ggplot2 is an R package for producing statistical, or data, graphics. Unlike most other graphics packages, ggplot2 has an underlying grammar, based on the Grammar of Graphics (Wilkinson 2005) Wilkinson, Leland. 2005. The Grammar of Graphics. 2nd ed. Statistics and Computing. Springer. , that allows you to compose graphs by combining independent components. More on the structure and use of ggplot2 here: https://ggplot2-book.org/introduction
Here, we will use ggplot2 to visualize and explore the normal distribution. Material and ideas for this chapter are taken from https://ggplot2tutor.com/tutorials/sampling_distributions. Loading the tidyverse library, which includes ggplot2 and many more packages that are needed for the following plots:
# R
library(tidyverse)
We use stat_function to visualize arbitrary functions. For example, we draw a simple polynomial on an empty plot.
# R
ggplot(data.frame(x = c(-20, 20)), aes(x)) +
stat_function(fun = function(x) { x**3 },
geom = "line")
We use this way to draw curves without data. We create a dataframe that defines the limits of the visualization. The first argument of stat_function is fun, which stands for function. We can make up functions from scratch (as in the example above) or use predefined functions like dnorm to visualize the normal distribution.
# R
ggplot(data.frame(x = c(-3, 3)), aes(x)) +
stat_function(fun = dnorm,
geom = "line")
With geom we tell ggplot how to visualize our function. The most common visualizations are an area, a line or points. We are interested in areas under the curve, as we have seen in theory. Let’s do that, by just changing the geom parameter to area.
# R
ggplot(data.frame(x = c(-3, 3)), aes(x)) +
stat_function(fun = dnorm,
geom = "area")
We can change the color of the area.
# R
ggplot(data.frame(x = c(-3, 3)), aes(x)) +
stat_function(fun = dnorm,
geom = "area",
fill = "darkseagreen")
We are interested to draw curves with certain function arguments. In stat_function we can add the args argument.
# R
ggplot(data.frame(x = c(20, 180)), aes(x)) +
stat_function(fun = dnorm,
geom = "area",
fill = "steelblue",
args = list(
mean = 100,
sd = 20
))
Since ggplot2 follows the grammar of graphics, we can easily stack functions on top of each other. We can display two normal distributions with different mean values. We add an alpha level to the upper distribution to make the lower distribution visible.
# R
ggplot(data.frame(x = c(0, 180)), aes(x)) +
stat_function(fun = dnorm,
geom = "area",
fill = "darkgrey",
args = list(
mean = 100,
sd = 20
)) +
stat_function(fun = dnorm,
geom = "area",
fill = "steelblue",
alpha = .8,
args = list(
mean = 70,
sd = 20
))
We can also draw two normal distrbutions with the same mean and different variance / standard deviation
# R
ggplot(data.frame(x = c(-30, 230)), aes(x)) +
stat_function(fun = dnorm,
geom = "area",
fill = "darkgrey",
args = list(
mean = 100,
sd = 20
)) +
stat_function(fun = dnorm,
geom = "area",
fill = "steelblue",
alpha = .8,
args = list(
mean = 100,
sd = 45
))
Often, we are only interested in the area under a curve for certain limits on the x-axis. For this we need the xlim argument. We will shade the the right half of the normal distribution.
# R
ggplot(data.frame(x = c(-3, 3)), aes(x)) +
stat_function(fun = dnorm,
geom = "line",
xlim = c(-3, 3)) +
stat_function(fun = dnorm,
geom = "area",
fill = "steelblue",
xlim = c(0, 3)) +
xlim(-3, 3)
We can draw the limits according to the confidence intervals, for example the 95% CI from the right side of the curve, using the qnorm(.95) expression.
# R
ggplot(data.frame(x = c(-3, 3)), aes(x)) +
stat_function(fun = dnorm,
geom = "line",
xlim = c(-3, 3)) +
stat_function(fun = dnorm,
geom = "area",
fill = "steelblue",
xlim = c(qnorm(.95), 3)) +
xlim(-3, 3)
We can do the same with the 95% CI for both sides of the curve, using qnorm(.025) and qnorm(.975) with xlim. Note that this is the standard normal distribution, with mean=0 and SD=1 and for this reason the qnorm(.975)=1,96 expression marks the x-value where the CI begins, since it is multiplied by 1(SD) and added to the mean(0).
# R
ggplot(data.frame(x = c(-3, 3)), aes(x)) +
stat_function(fun = dnorm,
geom = "line",
xlim = c(-3, 3)) +
stat_function(fun = dnorm,
geom = "area",
fill = "steelblue",
xlim = c(qnorm(.975), 3)) +
stat_function(fun = dnorm,
geom = "area",
fill = "steelblue",
xlim = c(-3, qnorm(.025))) +
xlim(-3, 3)
Generate data from 1000 observations from a normal distribution with mean 10 and standard deviation 1
# R
generated_data <- data.frame(normal_data = rnorm(1000, 10, 1))
Compare to histogram
# R
ggplot(generated_data, aes(x = normal_data)) +
geom_histogram(colour = "black", aes(y = after_stat(density))) +
geom_function(fun = dnorm, colour = "red", linewidth = 2,
args = list(mean = mean(generated_data$normal_data),
sd = sd(generated_data$normal_data)))
Do the same with 100,000 onservations. See that the distribution fits better the normal distribution curve
# R
generated_data2 <- data.frame(normal_data = rnorm(100000, 10, 1))
ggplot(generated_data2, aes(x = normal_data)) +
geom_histogram(colour = "black", aes(y = after_stat(density))) +
geom_function(fun = dnorm, colour = "red", linewidth = 2,
args = list(mean = mean(generated_data2$normal_data),
sd = sd(generated_data2$normal_data)))
To draw a Q-Q plot for a normal distribution, we use the geoms geom_qq and geom_qq_line. First for 1000 observations.
# R
ggplot(generated_data, aes(sample = normal_data)) +
geom_qq() + geom_qq_line()
Then we do the same for 100.000 observations. Which case shows a better fit with normality?
# R
ggplot(generated_data2, aes(sample = normal_data)) +
geom_qq() + geom_qq_line()
We will use the iris dataset. Numerous guides have been written on the exploration of this widely known dataset. Iris, introduced by Ronald Fisher in his 1936 paper The use of multiple measurements in taxonomic problems, contains three plant species (setosa, virginica, versicolor) and four features measured for each sample. These quantify the morphologic variation of the iris flower in its three species, all measurements given in centimeters.
# R
head(iris)
str(iris)
## 'data.frame': 150 obs. of 5 variables:
## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
summary(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
## Median :5.800 Median :3.000 Median :4.350 Median :1.300
## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
## Species
## setosa :50
## versicolor:50
## virginica :50
##
##
##
Compare to the normal curve with the histogram of the actual data for Sepal.Width. There are some deviations from the normal distribution, what do you think?
# R
ggplot(iris, aes(x = Sepal.Width)) +
geom_histogram(colour = "black", bins = 22, aes(y = after_stat(density))) +
geom_function(fun = dnorm, colour = "red", linewidth = 2,
args = list(mean = mean(iris$Sepal.Width),
sd = sd(iris$Sepal.Width)))
We draw a Q-Q plot for a comparison with the normal distribution.
# R
ggplot(iris, aes(sample = Sepal.Width)) +
geom_qq() + geom_qq_line()
Now we will explore the distribution of the same trait for each species separately, to see if our data are differentiated among species. The histogram is for all data and the curves are the normal distribution curves for each species. Does this figure explain the shape of the histogram?
# R
ggplot(iris, aes(x = Sepal.Width)) +
geom_histogram(colour = "black", bins = 22, aes(y = after_stat(density))) +
geom_function(fun = dnorm, colour = "red", linewidth = 2,
args = list(mean = mean(iris$Sepal.Width[1:50]),
sd = sd(iris$Sepal.Width[1:50]))) +
geom_function(fun = dnorm, colour = "green", linewidth = 2,
args = list(mean = mean(iris$Sepal.Width[51:100]),
sd = sd(iris$Sepal.Width[51:100]))) +
geom_function(fun = dnorm, colour = "blue", linewidth = 2,
args = list(mean = mean(iris$Sepal.Width[101:150]),
sd = sd(iris$Sepal.Width[101:150])))
The above figure needs a legend to describe which species is represented by which curve. For this reason we need to replace geom_function with stat_function and use aes(colour = …) inside stat_function to map the colors. Then we adjust scale_colour_manual() at the end of the command to to use named colors corresponding to the species.
# R
ggplot(iris, aes(x = Sepal.Width)) +
geom_histogram(colour = "black", bins = 22, aes(y = after_stat(density))) +
stat_function(fun = dnorm, aes(colour = "setosa"), linewidth = 2,
args = list(mean = mean(iris$Sepal.Width[1:50]),
sd = sd(iris$Sepal.Width[1:50]))) +
stat_function(fun = dnorm, aes(colour = "versicolor"), linewidth = 2,
args = list(mean = mean(iris$Sepal.Width[51:100]),
sd = sd(iris$Sepal.Width[51:100]))) +
stat_function(fun = dnorm, aes(colour = "virginica"), linewidth = 2,
args = list(mean = mean(iris$Sepal.Width[101:150]),
sd = sd(iris$Sepal.Width[101:150]))) +
scale_colour_manual("Species", values = c(setosa = "red", versicolor = "green", virginica = "blue"))
In ggplot2 we can view separately the attributes of the distribution of different species in histograms.
# R
ggplot(iris, aes(x= Sepal.Width, fill = Species)) +
geom_histogram(bins=22) +
facet_wrap(~Species)
We can do the same and draw the normal curve for each species.
# R
ggplot(iris, aes(x = Sepal.Width, fill = Species)) +
geom_histogram(aes(y = after_stat(density)), bins = 22) +
facet_wrap(~Species) +
geom_line(aes(y = dnorm(Sepal.Width,
mean = tapply(Sepal.Width, Species, mean, na.rm = TRUE)[PANEL],
sd = tapply(Sepal.Width, Species, sd, na.rm = TRUE)[PANEL])))