Data Visualization

Short data visualization exercise, this dataset contains the prices and other attributes of almost 54,000 diamonds. It’s a great dataset for learning to work with data analysis and visualization.

Iván López Torres true
2022-10-22

Diamonds Dataset

We will analyze diamonds by their cut, color, clarity, price, and other attributes.

Dataset Columns

Column Description
carat Weight of the diamond (0.2 - 5.01)
cut Quality of the cut (Fair, Good, Very Good, Premium, Ideal)
color Diamond color, from J (worst) to D (best)
clarity A measurement of how clear the diamond is (I1 (worst), SI2, SI1, VS2, VS1, VVS2, VVS1, IF (best))
depth Total depth percentage = z / mean(x, y) = 2 * z / (x + y) (43 - 79)
table Width of top of diamond relative to widest point (43 - 95)
price Price in US dollars ($326 - $18,823)
x Length in mm (0 - 10.74)
y Width in mm (0 - 58.9)
z Depth in mm (0 - 31.8)

Diamonds Dataset Analysis

First, we will check the dimension of this dataset

dim(diamonds)
[1] 53940    10

We will get an overview of the dataset with summary, head and tail

summary(diamonds)
     carat               cut        color        clarity     
 Min.   :0.2000   Fair     : 1610   D: 6775   SI1    :13065  
 1st Qu.:0.4000   Good     : 4906   E: 9797   VS2    :12258  
 Median :0.7000   Very Good:12082   F: 9542   SI2    : 9194  
 Mean   :0.7979   Premium  :13791   G:11292   VS1    : 8171  
 3rd Qu.:1.0400   Ideal    :21551   H: 8304   VVS2   : 5066  
 Max.   :5.0100                     I: 5422   VVS1   : 3655  
                                    J: 2808   (Other): 2531  
     depth           table           price             x         
 Min.   :43.00   Min.   :43.00   Min.   :  326   Min.   : 0.000  
 1st Qu.:61.00   1st Qu.:56.00   1st Qu.:  950   1st Qu.: 4.710  
 Median :61.80   Median :57.00   Median : 2401   Median : 5.700  
 Mean   :61.75   Mean   :57.46   Mean   : 3933   Mean   : 5.731  
 3rd Qu.:62.50   3rd Qu.:59.00   3rd Qu.: 5324   3rd Qu.: 6.540  
 Max.   :79.00   Max.   :95.00   Max.   :18823   Max.   :10.740  
                                                                 
       y                z         
 Min.   : 0.000   Min.   : 0.000  
 1st Qu.: 4.720   1st Qu.: 2.910  
 Median : 5.710   Median : 3.530  
 Mean   : 5.735   Mean   : 3.539  
 3rd Qu.: 6.540   3rd Qu.: 4.040  
 Max.   :58.900   Max.   :31.800  
                                  

Here’s what we know about the diamonds dataset:

We will discover the first 6 rows:

head(diamonds)
carat cut color clarity depth table price x y z
0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
0.29 Premium I VS2 62.4 58 334 4.20 4.23 2.63
0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48

Then the last 6 rows:

tail(diamonds)
carat cut color clarity depth table price x y z
0.72 Premium D SI1 62.7 59 2757 5.69 5.73 3.58
0.72 Ideal D SI1 60.8 57 2757 5.75 5.76 3.50
0.72 Good D SI1 63.1 55 2757 5.69 5.75 3.61
0.70 Very Good D SI1 62.8 60 2757 5.66 5.68 3.56
0.86 Premium H SI2 61.0 58 2757 6.15 6.12 3.74
0.75 Ideal D SI2 62.2 55 2757 5.83 5.87 3.64

Diamonds Dataset Data Visualization

Histogram that represents the depth of diamonds

ggplot(diamonds, aes(x=depth)) + geom_histogram(fill="blue")

The diamond depth percentage is calculated by dividing total depth by the average diameter, then multiplying by 100.

Above histogram show us that most of the diamonds have a depth between 61% to 62.5%.

Histogram of price vs. cut of diamonds

ggplot( diamonds, aes(x=price, fill=cut)) +
        geom_histogram() +
        labs(y="Count", x="Price", fill="Quality", title="Histogram of Price vs. Cut (Quality) of Diamonds")

This histogram show us how the quality affects the price of the diamonds and the number of diamonds explored:

Histogram of price vs. color of diamonds

ggplot( diamonds, aes(x=price, fill=color)) +
        geom_histogram() +
        labs(y="Count", x="Price", fill="Color", title="Histogram of Price vs. Color of Diamonds")

This histogram show us how the color affects the price of the diamonds and the number of diamonds explored:

Polar graph of the cut of diamonds

ggplot( diamonds, aes(x=cut, fill=cut)) +
        theme_bw() +
        geom_bar() +
        coord_polar() +
        labs(x="Quality", y="Diamonds Count", title="Quality of the Diamonds")

Scatter plot of price vs. carat using quality as color variable

ggplot( diamonds, aes(x=carat, y=price, color=cut)) + 
        geom_point() + 
        labs(x="Carat", y="Price")

In this graph we realize that carat does not have a direct relationship with quality, since we may find fair diamonds with more than 4 carat:

Scatter plot of price vs. carat using color as variable

ggplot( diamonds, aes(x=carat, y=price, color=color)) + 
        geom_point() + 
        labs(x="Carat", y="Price")

In this case we see a more visible relationship between carat and color since there’s a bigger probability to get a better color in a smaller / lighter diamond

Scatter plot of price vs. carat using clarity as variable

graph <- ggplot(diamonds, aes(x=carat, y=price, color=clarity)) + 
                geom_point() + 
                labs(x="Carat", y="Price")

This is a very similar case than before, there is a direct relationship between carat and clarity since there’s a bigger probability to get a better clarity in a smaller / lighter diamond

Grouping by cut and color

ggplot(diamonds) +
geom_bar(
  mapping = aes(x=cut, fill=color),
  position = "dodge" 
) +
labs(x="Carat", y="Count", title="Grouping by Quality and Color")

Price grouped by quality

ggplot( data, aes(x=price)) +
        geom_density(aes(fill=factor(cut)), alpha=0.5) +
        labs(x="Price", fill="Quality")

We can use box plot to visualize the distribution of prices within each type of quality

By adding points to the box plot, we can have a better idea of the number of measurements and of their distribution

ggplot( diamods, aes(x=cut, y=price)) + 
        theme_bw() +
        geom_boxplot(fill="blue") +
        geom_jitter(alpha = 0.2, color = "red") +
        labs(x="Cut", y="Price", title="Box Plot of Price, Grouped by Quality")