Logistic regression

Bird dataset

To briefly switch things up, I want to look at a new dataset giving details of nearly 2000 bird species.

set.seed(1)
x <- read.csv("../../data/avian_ssd_jan07.txt", as.is=TRUE, sep="\t")
x <- x[x$Clutch_size > 0,]
x <- x[x$M_mass > 0,]
x <- x[x$Egg_mass > 0,]
dim(x)
## [1] 1685   44

As a sanity check, do the birds with the largest egg mass make sense to us?

x$English_name[order(x$Egg_mass,decreasing=TRUE)[1:10]]
##  [1] "Emu"                 "Wandering Albatross" "Royal Albatross"    
##  [4] "Brown Kiwi"          "Great Spotted Kiwi"  "Emperor Penguin"    
##  [7] "Mute Swan"           "Whooper Swan"        "King Penguin"       
## [10] "Little Spotted Kiwi"

Emu

Wandering Albatross

Royal Albatross

Brown Kiwi

Emperor Penguin

I am interested in understanding what features make a bird have only one egg in a clutch. We can see that the mass (male) and mass of the egg both seem to influnce this set:

cl <- as.numeric(x$Clutch_size == 1)
y <- cl*2 - 1
plot(x$M_mass[cl == 0], x$Egg_mass[cl == 0], log="xy",
     pch=19, cex=0.5, col=grey(0.5,0.5))
points(x$M_mass[cl == 1], x$Egg_mass[cl == 1], col="blue", pch=22)

We can use these two variables to fit both a linear and logistic model for classification:

outLm <- lm(y ~ log(M_mass) + log(Egg_mass), data=x)
summary(outLm)
## 
## Call:
## lm(formula = y ~ log(M_mass) + log(Egg_mass), data = x)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.76647 -0.18238 -0.04966  0.01163  2.12043 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -0.64548    0.05272 -12.244  < 2e-16 ***
## log(M_mass)   -0.19753    0.02401  -8.229  3.8e-16 ***
## log(Egg_mass)  0.31721    0.02914  10.884  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4303 on 1650 degrees of freedom
##   (32 observations deleted due to missingness)
## Multiple R-squared:  0.1117, Adjusted R-squared:  0.1106 
## F-statistic: 103.7 on 2 and 1650 DF,  p-value: < 2.2e-16
outGlm <- glm(cl ~ log(M_mass) + log(Egg_mass), data=x, family="binomial")
summary(outGlm)
## 
## Call:
## glm(formula = cl ~ log(M_mass) + log(Egg_mass), family = "binomial", 
##     data = x)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.9842  -0.3172  -0.1687  -0.1301   3.3474  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)    -2.3054     0.5082  -4.536 5.73e-06 ***
## log(M_mass)    -1.4149     0.2247  -6.296 3.05e-10 ***
## log(Egg_mass)   2.4646     0.2817   8.748  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 704.6  on 1652  degrees of freedom
## Residual deviance: 544.4  on 1650  degrees of freedom
##   (32 observations deleted due to missingness)
## AIC: 550.4
## 
## Number of Fisher Scoring iterations: 7

Which we can visualize as the following for the linear model:

plot(log(x$M_mass[cl == 0]), log(x$Egg_mass[cl == 0]),
     pch=19, cex=0.5, col=grey(0.5,0.5))
points(log(x$M_mass[cl == 1]), log(x$Egg_mass[cl == 1]), col="blue", pch=22)
abline(-1 * outLm$coef[1] / outLm$coef[3], -1 * outLm$coef[2] / outLm$coef[3],
         col="#6E3179", lty="dashed", lwd=1.5)

But wait, this doesn’t seem very helpful, right? The problem is that the classes are unbalanced so we need to actually move the plane by some amount:

plot(log(x$M_mass[cl == 0]), log(x$Egg_mass[cl == 0]),
     pch=19, cex=0.5, col=grey(0.5,0.5))
points(log(x$M_mass[cl == 1]), log(x$Egg_mass[cl == 1]), col="blue", pch=22)
abline(-1 * outLm$coef[1] / outLm$coef[3] - 2, -1 * outLm$coef[2] / outLm$coef[3],
         col="#6E3179", lty="dashed", lwd=1.5)

And the GLM looks like this:

plot(log(x$M_mass[cl == 0]), log(x$Egg_mass[cl == 0]),
     pch=19, cex=0.5, col=grey(0.5,0.5))
points(log(x$M_mass[cl == 1]), log(x$Egg_mass[cl == 1]), col="blue", pch=22)
abline(-1 * outLm$coef[1] / outLm$coef[3] - 2, -1 * outLm$coef[2] / outLm$coef[3],
         col="#000000", lty="dashed", lwd=1.5)
abline(-1 * outGlm$coef[1] / outGlm$coef[3] - 0.7, -1 * outGlm$coef[2] / outGlm$coef[3],
         col="#6E3179", lty="dashed", lwd=1.5)

Which both visually separate the space in a sensible way.