Short data visualization exercise, this dataset contains the prices and other attributes of almost 54,000 diamonds. It’s a great dataset for learning to work with data analysis and visualization.
We will analyze diamonds by their cut, color, clarity, price, and other attributes.
Dataset Columns
Column | Description |
---|---|
carat | Weight of the diamond (0.2 - 5.01) |
cut | Quality of the cut (Fair, Good, Very Good, Premium, Ideal) |
color | Diamond color, from J (worst) to D (best) |
clarity | A measurement of how clear the diamond is (I1 (worst), SI2, SI1, VS2, VS1, VVS2, VVS1, IF (best)) |
depth | Total depth percentage = z / mean(x, y) = 2 * z / (x + y) (43 - 79) |
table | Width of top of diamond relative to widest point (43 - 95) |
price | Price in US dollars ($326 - $18,823) |
x | Length in mm (0 - 10.74) |
y | Width in mm (0 - 58.9) |
z | Depth in mm (0 - 31.8) |
Diamonds Dataset Analysis
First, we will check the dimension of this dataset
dim(diamonds)
[1] 53940 10
We will get an overview of the dataset with summary, head and tail
summary(diamonds)
carat cut color clarity
Min. :0.2000 Fair : 1610 D: 6775 SI1 :13065
1st Qu.:0.4000 Good : 4906 E: 9797 VS2 :12258
Median :0.7000 Very Good:12082 F: 9542 SI2 : 9194
Mean :0.7979 Premium :13791 G:11292 VS1 : 8171
3rd Qu.:1.0400 Ideal :21551 H: 8304 VVS2 : 5066
Max. :5.0100 I: 5422 VVS1 : 3655
J: 2808 (Other): 2531
depth table price x
Min. :43.00 Min. :43.00 Min. : 326 Min. : 0.000
1st Qu.:61.00 1st Qu.:56.00 1st Qu.: 950 1st Qu.: 4.710
Median :61.80 Median :57.00 Median : 2401 Median : 5.700
Mean :61.75 Mean :57.46 Mean : 3933 Mean : 5.731
3rd Qu.:62.50 3rd Qu.:59.00 3rd Qu.: 5324 3rd Qu.: 6.540
Max. :79.00 Max. :95.00 Max. :18823 Max. :10.740
y z
Min. : 0.000 Min. : 0.000
1st Qu.: 4.720 1st Qu.: 2.910
Median : 5.710 Median : 3.530
Mean : 5.735 Mean : 3.539
3rd Qu.: 6.540 3rd Qu.: 4.040
Max. :58.900 Max. :31.800
Here’s what we know about the diamonds dataset:
This dataset contains information about 53,940 round-cut diamonds. Each row of data represents a different diamond and there are 53,940 rows of data.
There are 10 variables measuring various pieces of information about the diamonds.
There are 3 variables with an ordered factor structure: cut, color and clarity. An ordered factor arranges the categorical values in a low-to-high rank order.
There are 6 variables that are of numeric structure: carat, depth, table, x, y, z
There is 1 variable that has an integer structure: price
We will discover the first 6 rows:
head(diamonds)
carat | cut | color | clarity | depth | table | price | x | y | z |
---|---|---|---|---|---|---|---|---|---|
0.23 | Ideal | E | SI2 | 61.5 | 55 | 326 | 3.95 | 3.98 | 2.43 |
0.21 | Premium | E | SI1 | 59.8 | 61 | 326 | 3.89 | 3.84 | 2.31 |
0.23 | Good | E | VS1 | 56.9 | 65 | 327 | 4.05 | 4.07 | 2.31 |
0.29 | Premium | I | VS2 | 62.4 | 58 | 334 | 4.20 | 4.23 | 2.63 |
0.31 | Good | J | SI2 | 63.3 | 58 | 335 | 4.34 | 4.35 | 2.75 |
0.24 | Very Good | J | VVS2 | 62.8 | 57 | 336 | 3.94 | 3.96 | 2.48 |
Then the last 6 rows:
tail(diamonds)
carat | cut | color | clarity | depth | table | price | x | y | z |
---|---|---|---|---|---|---|---|---|---|
0.72 | Premium | D | SI1 | 62.7 | 59 | 2757 | 5.69 | 5.73 | 3.58 |
0.72 | Ideal | D | SI1 | 60.8 | 57 | 2757 | 5.75 | 5.76 | 3.50 |
0.72 | Good | D | SI1 | 63.1 | 55 | 2757 | 5.69 | 5.75 | 3.61 |
0.70 | Very Good | D | SI1 | 62.8 | 60 | 2757 | 5.66 | 5.68 | 3.56 |
0.86 | Premium | H | SI2 | 61.0 | 58 | 2757 | 6.15 | 6.12 | 3.74 |
0.75 | Ideal | D | SI2 | 62.2 | 55 | 2757 | 5.83 | 5.87 | 3.64 |
Diamonds Dataset Data Visualization
Histogram that represents the depth of diamonds
ggplot(diamonds, aes(x=depth)) + geom_histogram(fill="blue")
The diamond depth percentage is calculated by dividing total depth by the average diameter, then multiplying by 100.
Above histogram show us that most of the diamonds have a depth between 61% to 62.5%.
Histogram of price vs. cut of diamonds
ggplot( diamonds, aes(x=price, fill=cut)) +
geom_histogram() +
labs(y="Count", x="Price", fill="Quality", title="Histogram of Price vs. Cut (Quality) of Diamonds")
This histogram show us how the quality affects the price of the diamonds and the number of diamonds explored:
Histogram of price vs. color of diamonds
ggplot( diamonds, aes(x=price, fill=color)) +
geom_histogram() +
labs(y="Count", x="Price", fill="Color", title="Histogram of Price vs. Color of Diamonds")
This histogram show us how the color affects the price of the diamonds and the number of diamonds explored:
Polar graph of the cut of diamonds
ggplot( diamonds, aes(x=cut, fill=cut)) +
theme_bw() +
geom_bar() +
coord_polar() +
labs(x="Quality", y="Diamonds Count", title="Quality of the Diamonds")
Scatter plot of price vs. carat using quality as color variable
ggplot( diamonds, aes(x=carat, y=price, color=cut)) +
geom_point() +
labs(x="Carat", y="Price")
In this graph we realize that carat does not have a direct relationship with quality, since we may find fair diamonds with more than 4 carat:
Scatter plot of price vs. carat using color as variable
ggplot( diamonds, aes(x=carat, y=price, color=color)) +
geom_point() +
labs(x="Carat", y="Price")
In this case we see a more visible relationship between carat and color since there’s a bigger probability to get a better color in a smaller / lighter diamond
Scatter plot of price vs. carat using clarity as variable
graph <- ggplot(diamonds, aes(x=carat, y=price, color=clarity)) +
geom_point() +
labs(x="Carat", y="Price")
This is a very similar case than before, there is a direct relationship between carat and clarity since there’s a bigger probability to get a better clarity in a smaller / lighter diamond
Grouping by cut and color
ggplot(diamonds) +
geom_bar(
mapping = aes(x=cut, fill=color),
position = "dodge"
) +
labs(x="Carat", y="Count", title="Grouping by Quality and Color")
Price grouped by quality
ggplot( data, aes(x=price)) +
geom_density(aes(fill=factor(cut)), alpha=0.5) +
labs(x="Price", fill="Quality")
We can use box plot to visualize the distribution of prices within each type of quality
By adding points to the box plot, we can have a better idea of the number of measurements and of their distribution
ggplot( diamods, aes(x=cut, y=price)) +
theme_bw() +
geom_boxplot(fill="blue") +
geom_jitter(alpha = 0.2, color = "red") +
labs(x="Cut", y="Price", title="Box Plot of Price, Grouped by Quality")