To briefly switch things up, I want to look at a new dataset giving details of nearly 2000 bird species.
set.seed(1)
x <- read.csv("../../data/avian_ssd_jan07.txt", as.is=TRUE, sep="\t")
x <- x[x$Clutch_size > 0,]
x <- x[x$M_mass > 0,]
x <- x[x$Egg_mass > 0,]
dim(x)
## [1] 1685 44
As a sanity check, do the birds with the largest egg mass make sense to us?
x$English_name[order(x$Egg_mass,decreasing=TRUE)[1:10]]
## [1] "Emu" "Wandering Albatross" "Royal Albatross"
## [4] "Brown Kiwi" "Great Spotted Kiwi" "Emperor Penguin"
## [7] "Mute Swan" "Whooper Swan" "King Penguin"
## [10] "Little Spotted Kiwi"
I am interested in understanding what features make a bird have only one egg in a clutch. We can see that the mass (male) and mass of the egg both seem to influnce this set:
cl <- as.numeric(x$Clutch_size == 1)
y <- cl*2 - 1
plot(x$M_mass[cl == 0], x$Egg_mass[cl == 0], log="xy",
pch=19, cex=0.5, col=grey(0.5,0.5))
points(x$M_mass[cl == 1], x$Egg_mass[cl == 1], col="blue", pch=22)
We can use these two variables to fit both a linear and logistic model for classification:
outLm <- lm(y ~ log(M_mass) + log(Egg_mass), data=x)
summary(outLm)
##
## Call:
## lm(formula = y ~ log(M_mass) + log(Egg_mass), data = x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.76647 -0.18238 -0.04966 0.01163 2.12043
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.64548 0.05272 -12.244 < 2e-16 ***
## log(M_mass) -0.19753 0.02401 -8.229 3.8e-16 ***
## log(Egg_mass) 0.31721 0.02914 10.884 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4303 on 1650 degrees of freedom
## (32 observations deleted due to missingness)
## Multiple R-squared: 0.1117, Adjusted R-squared: 0.1106
## F-statistic: 103.7 on 2 and 1650 DF, p-value: < 2.2e-16
outGlm <- glm(cl ~ log(M_mass) + log(Egg_mass), data=x, family="binomial")
summary(outGlm)
##
## Call:
## glm(formula = cl ~ log(M_mass) + log(Egg_mass), family = "binomial",
## data = x)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.9842 -0.3172 -0.1687 -0.1301 3.3474
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.3054 0.5082 -4.536 5.73e-06 ***
## log(M_mass) -1.4149 0.2247 -6.296 3.05e-10 ***
## log(Egg_mass) 2.4646 0.2817 8.748 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 704.6 on 1652 degrees of freedom
## Residual deviance: 544.4 on 1650 degrees of freedom
## (32 observations deleted due to missingness)
## AIC: 550.4
##
## Number of Fisher Scoring iterations: 7
Which we can visualize as the following for the linear model:
plot(log(x$M_mass[cl == 0]), log(x$Egg_mass[cl == 0]),
pch=19, cex=0.5, col=grey(0.5,0.5))
points(log(x$M_mass[cl == 1]), log(x$Egg_mass[cl == 1]), col="blue", pch=22)
abline(-1 * outLm$coef[1] / outLm$coef[3], -1 * outLm$coef[2] / outLm$coef[3],
col="#6E3179", lty="dashed", lwd=1.5)
But wait, this doesn’t seem very helpful, right? The problem is that the classes are unbalanced so we need to actually move the plane by some amount:
plot(log(x$M_mass[cl == 0]), log(x$Egg_mass[cl == 0]),
pch=19, cex=0.5, col=grey(0.5,0.5))
points(log(x$M_mass[cl == 1]), log(x$Egg_mass[cl == 1]), col="blue", pch=22)
abline(-1 * outLm$coef[1] / outLm$coef[3] - 2, -1 * outLm$coef[2] / outLm$coef[3],
col="#6E3179", lty="dashed", lwd=1.5)
And the GLM looks like this:
plot(log(x$M_mass[cl == 0]), log(x$Egg_mass[cl == 0]),
pch=19, cex=0.5, col=grey(0.5,0.5))
points(log(x$M_mass[cl == 1]), log(x$Egg_mass[cl == 1]), col="blue", pch=22)
abline(-1 * outLm$coef[1] / outLm$coef[3] - 2, -1 * outLm$coef[2] / outLm$coef[3],
col="#000000", lty="dashed", lwd=1.5)
abline(-1 * outGlm$coef[1] / outGlm$coef[3] - 0.7, -1 * outGlm$coef[2] / outGlm$coef[3],
col="#6E3179", lty="dashed", lwd=1.5)
Which both visually separate the space in a sensible way.