UCWV: Data Visualization

Diamonds Dataset

We will analyze diamonds by their cut, color, clarity, price, and other attributes.

Dataset Columns

Column	Description
carat	Weight of the diamond (0.2 - 5.01)
cut	Quality of the cut (Fair, Good, Very Good, Premium, Ideal)
color	Diamond color, from J (worst) to D (best)
clarity	A measurement of how clear the diamond is (I1 (worst), SI2, SI1, VS2, VS1, VVS2, VVS1, IF (best))
depth	Total depth percentage = z / mean(x, y) = 2 * z / (x + y) (43 - 79)
table	Width of top of diamond relative to widest point (43 - 95)
price	Price in US dollars ($326 - $18,823)
x	Length in mm (0 - 10.74)
y	Width in mm (0 - 58.9)
z	Depth in mm (0 - 31.8)

Diamonds Dataset Analysis

First, we will check the dimension of this dataset

dim(diamonds)

[1] 53940    10

We will get an overview of the dataset with summary, head and tail

summary(diamonds)

     carat               cut        color        clarity     
 Min.   :0.2000   Fair     : 1610   D: 6775   SI1    :13065  
 1st Qu.:0.4000   Good     : 4906   E: 9797   VS2    :12258  
 Median :0.7000   Very Good:12082   F: 9542   SI2    : 9194  
 Mean   :0.7979   Premium  :13791   G:11292   VS1    : 8171  
 3rd Qu.:1.0400   Ideal    :21551   H: 8304   VVS2   : 5066  
 Max.   :5.0100                     I: 5422   VVS1   : 3655  
                                    J: 2808   (Other): 2531  
     depth           table           price             x         
 Min.   :43.00   Min.   :43.00   Min.   :  326   Min.   : 0.000  
 1st Qu.:61.00   1st Qu.:56.00   1st Qu.:  950   1st Qu.: 4.710  
 Median :61.80   Median :57.00   Median : 2401   Median : 5.700  
 Mean   :61.75   Mean   :57.46   Mean   : 3933   Mean   : 5.731  
 3rd Qu.:62.50   3rd Qu.:59.00   3rd Qu.: 5324   3rd Qu.: 6.540  
 Max.   :79.00   Max.   :95.00   Max.   :18823   Max.   :10.740  
                                                                 
       y                z         
 Min.   : 0.000   Min.   : 0.000  
 1st Qu.: 4.720   1st Qu.: 2.910  
 Median : 5.710   Median : 3.530  
 Mean   : 5.735   Mean   : 3.539  
 3rd Qu.: 6.540   3rd Qu.: 4.040  
 Max.   :58.900   Max.   :31.800

Here’s what we know about the diamonds dataset:

This dataset contains information about 53,940 round-cut diamonds. Each row of data represents a different diamond and there are 53,940 rows of data.
There are 10 variables measuring various pieces of information about the diamonds.
There are 3 variables with an ordered factor structure: cut, color and clarity. An ordered factor arranges the categorical values in a low-to-high rank order.
There are 6 variables that are of numeric structure: carat, depth, table, x, y, z
There is 1 variable that has an integer structure: price

We will discover the first 6 rows:

head(diamonds)

carat	cut	color	clarity	depth	table	price	x	y	z
0.23	Ideal	E	SI2	61.5	55	326	3.95	3.98	2.43
0.21	Premium	E	SI1	59.8	61	326	3.89	3.84	2.31
0.23	Good	E	VS1	56.9	65	327	4.05	4.07	2.31
0.29	Premium	I	VS2	62.4	58	334	4.20	4.23	2.63
0.31	Good	J	SI2	63.3	58	335	4.34	4.35	2.75
0.24	Very Good	J	VVS2	62.8	57	336	3.94	3.96	2.48

Then the last 6 rows:

tail(diamonds)

carat	cut	color	clarity	depth	table	price	x	y	z
0.72	Premium	D	SI1	62.7	59	2757	5.69	5.73	3.58
0.72	Ideal	D	SI1	60.8	57	2757	5.75	5.76	3.50
0.72	Good	D	SI1	63.1	55	2757	5.69	5.75	3.61
0.70	Very Good	D	SI1	62.8	60	2757	5.66	5.68	3.56
0.86	Premium	H	SI2	61.0	58	2757	6.15	6.12	3.74
0.75	Ideal	D	SI2	62.2	55	2757	5.83	5.87	3.64

Diamonds Dataset Data Visualization

Histogram that represents the depth of diamonds

ggplot(diamonds, aes(x=depth)) + geom_histogram(fill="blue")

The diamond depth percentage is calculated by dividing total depth by the average diameter, then multiplying by 100.

Above histogram show us that most of the diamonds have a depth between 61% to 62.5%.

Histogram of price vs. cut of diamonds

ggplot( diamonds, aes(x=price, fill=cut)) +
        geom_histogram() +
        labs(y="Count", x="Price", fill="Quality", title="Histogram of Price vs. Cut (Quality) of Diamonds")

This histogram show us how the quality affects the price of the diamonds and the number of diamonds explored:

Histogram of price vs. color of diamonds

ggplot( diamonds, aes(x=price, fill=color)) +
        geom_histogram() +
        labs(y="Count", x="Price", fill="Color", title="Histogram of Price vs. Color of Diamonds")

This histogram show us how the color affects the price of the diamonds and the number of diamonds explored:

Polar graph of the cut of diamonds

ggplot( diamonds, aes(x=cut, fill=cut)) +
        theme_bw() +
        geom_bar() +
        coord_polar() +
        labs(x="Quality", y="Diamonds Count", title="Quality of the Diamonds")

Scatter plot of price vs. carat using quality as color variable

ggplot( diamonds, aes(x=carat, y=price, color=cut)) + 
        geom_point() + 
        labs(x="Carat", y="Price")

In this graph we realize that carat does not have a direct relationship with quality, since we may find fair diamonds with more than 4 carat:

Scatter plot of price vs. carat using color as variable

ggplot( diamonds, aes(x=carat, y=price, color=color)) + 
        geom_point() + 
        labs(x="Carat", y="Price")

In this case we see a more visible relationship between carat and color since there’s a bigger probability to get a better color in a smaller / lighter diamond

Scatter plot of price vs. carat using clarity as variable

graph <- ggplot(diamonds, aes(x=carat, y=price, color=clarity)) + 
                geom_point() + 
                labs(x="Carat", y="Price")

This is a very similar case than before, there is a direct relationship between carat and clarity since there’s a bigger probability to get a better clarity in a smaller / lighter diamond

Grouping by cut and color

ggplot(diamonds) +
geom_bar(
  mapping = aes(x=cut, fill=color),
  position = "dodge" 
) +
labs(x="Carat", y="Count", title="Grouping by Quality and Color")

Price grouped by quality

ggplot( data, aes(x=price)) +
        geom_density(aes(fill=factor(cut)), alpha=0.5) +
        labs(x="Price", fill="Quality")

We can use box plot to visualize the distribution of prices within each type of quality

By adding points to the box plot, we can have a better idea of the number of measurements and of their distribution

ggplot( diamods, aes(x=cut, y=price)) + 
        theme_bw() +
        geom_boxplot(fill="blue") +
        geom_jitter(alpha = 0.2, color = "red") +
        labs(x="Cut", y="Price", title="Box Plot of Price, Grouped by Quality")