We will explore the iris
dataset that comes with R, based on data collected by Edgar Anderson (1935) and analyzed by R. A. Fisher (1936).
We start by loading the data set:
data(iris)
summary(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
## Median :5.800 Median :3.000 Median :4.350 Median :1.300
## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
## Species
## setosa :50
## versicolor:50
## virginica :50
##
##
##
We see that the dataset contains four numerical variables and one categorical variable:
Sepal.Length
: The sepal length, in centimeters,Sepal.Width
: The sepal width, in centimeters,Petal.Length
: The petal length, in centimeters,Petal.Width
: The petal width, in centimeters,Species
: The three different species of iris considered: setosa, versicolor and virginicaThe summary
command gave us five-number summaries for the numerical variables and a frequency table for the categorical variable.
We can also create these numerical summaries on their own:
favstats(~Sepal.Length, data=iris)
## min Q1 median Q3 max mean sd n missing
## 4.3 5.1 5.8 6.4 7.9 5.843333 0.8280661 150 0
tally(~Species, data=iris)
## Species
## setosa versicolor virginica
## 50 50 50
We can also produce summaries within each Species:
favstats(~Petal.Length|Species, data=iris)
## Species min Q1 median Q3 max mean sd n missing
## 1 setosa 1.0 1.4 1.50 1.575 1.9 1.462 0.1736640 50 0
## 2 versicolor 3.0 4.0 4.35 4.600 5.1 4.260 0.4699110 50 0
## 3 virginica 4.5 5.1 5.55 5.875 6.9 5.552 0.5518947 50 0
We can easily generate histograms:
histogram(~Petal.Length, breaks=20, col="purple", data=iris)
histogram(~Petal.Length|Species, data=iris, layout=c(1, 3))
And some boxplots:
bwplot(Species~Petal.Length, data=iris)
A scatterplot:
xyplot(Sepal.Length~Petal.Length, data=iris, groups=Species,
fill=brewer.pal(3, "Dark2"), pch=21:23, lwd=2, col="black",
main="Iris data (green=setosa, orange=versicolor, purple=virginica)",
xlab="Petal Length (cm)",
ylab="Sepal Length (cm)",
type=c("p", "smooth"))
A labeled dotplot of the mean Petal Length for each species:
mean(~Petal.Length|Species, data=iris) %>% sort() %>% dotplot()
setosaFit <- lm(Sepal.Length~Petal.Length,
data=iris %>% filter(Species == "setosa"))
summary(setosaFit)
##
## Call:
## lm(formula = Sepal.Length ~ Petal.Length, data = iris %>% filter(Species ==
## "setosa"))
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.57238 -0.20671 -0.03084 0.17339 0.93608
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.2132 0.4156 10.138 1.61e-13 ***
## Petal.Length 0.5423 0.2823 1.921 0.0607 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3432 on 48 degrees of freedom
## Multiple R-squared: 0.07138, Adjusted R-squared: 0.05204
## F-statistic: 3.69 on 1 and 48 DF, p-value: 0.0607
As anticipated, the linear model for setosa is weak.
Residual plot:
residPlot <- xyplot(resid(setosaFit)~fitted(setosaFit),
xlab="Predicted Values", ylab="Residuals")
ladd(panel.abline(h=0, lwd=2, col="black"), plot=residPlot)