When is the due date?

Posted by Mike Birdgeneau

Category: Data

Firstly, it's a little surprising I haven't done this yet - using some simple statistics to help figure out when the little one is likely to show up! This could help with some life decisions, for example: Should I go skiing today? How far from town am I willing to drive?

Of course, in order to use these stats, I'd have to figure out what threshold I'd be comfortable with... a more difficult problem. Let's start with the easier one! The data I'm using comes from spacefem.com; a big thanks for making this data available.

Let's start by loading up a few packages, the data & defining a due date:

suppressPackageStartupMessages({
    library(data.table)
    library(ggplot2)
    library(gridExtra)
    library(fitdistrplus)
    library(scales)
    library(xtable)
})

due_date <- as.Date("2016-02-29")
birth_data <- data.table(read.csv("data/cdf_spacefem_all.csv"))

Here's a quick sample of what the data looks like at this point:

head(birth_data, 5)
##    days num_births
## 1:  245         13
## 2:  246         15
## 3:  247          9
## 4:  248         21
## 5:  249         18

Now we'll take the table that has days, load it up & make a few calculations; this allows us to calculate a probability distribution, and a cumulative distribution function relatively easily!

birth_data[, `:=`(Date, due_date + (days - 280 + 1)), ]  # Assumes 280d from last menstrual period for due date calc.

birth_data[, `:=`(frac, num_births/sum(num_births)), ]
birth_data[, `:=`(cdf, cumsum(frac)), ]

Let's take a quick look at the data (always good practice):

p1 <- ggplot(birth_data, aes(x = Date, y = frac)) + geom_area(fill = "skyblue") + 
    theme_bw() + ylab("Probability of Delivery") + ggtitle("Probability of Delivery by Date") + 
    scale_y_continuous(labels = percent)

p2 <- ggplot(birth_data, aes(x = Date, y = cdf)) + geom_area(fill = "skyblue") + 
    theme_bw() + ylab("Cumulative Probabilty of Delivery") + ggtitle("Cumulative Probability of Delivery by Date") + 
    scale_y_continuous(labels = percent)

grid.arrange(p1, p2, ncol = 2)

Birth Date Distributions

The sample size of this dataset is: (n=7493), and the resolution is only by 'days', so we see that the distribution certainly isn't smooth. Interestingly enough, the distribution doesn't look normal. This could be due to several factors, including induced labour, etc.

Let's take a quick look at the distribution (e.g. skew & kurtosis). The Cullen & Frey graph works well to show where our data falls relative to other distributions:

descdist(as.numeric(sample(birth_data$Date, size = 1e+06, replace = TRUE, prob = birth_data$frac)))

Distribution Info

## summary statistics
## ------
## min:  16826   max:  16882 
## median:  16860 
## mean:  16859 
## estimated sd:  9.657033 
## estimated skewness:  -0.7695845 
## estimated kurtosis:  3.591334

Now, just a word of caution: You shouldn't use a distribution without understanding where the distribution comes from, and understanding if the true population may actually be described by that distribution... I don't really need to fit a distribution here - but I could select a Weibull distribution - it should fit the data reasonably well & is often used in time to failure type analysis... which has some similarities. Here's how this would fit:

fweib <- fitdist(as.numeric(sample(birth_data$Date, size = 1e+06, replace = TRUE, 
    prob = birth_data$frac)), "weibull")
dt_weib <- data.table(cdf = seq(0.001, 0.999, by = 0.001), Date = as.Date(qweibull(seq(0.001, 
    0.999, by = 0.001), shape = fweib$estimate["shape"], scale = fweib$estimate["scale"]), 
    origin = as.Date("1970-01-01")))

p2 + geom_line(data = dt_weib, colour = "red", lwd = 0.8)

Weibull Distributino Fit

So the weibull fits pretty nicely. The fit results are here, for someone who might be interested:

print(fweib)
## Fitting of the distribution ' weibull ' by maximum likelihood 
## Parameters:
##        estimate  Std. Error
## shape  2112.648 1.602996600
## scale 16863.485 0.008414135

Now, of course because it's currently 2016-02-05, we've shown that delivery on dates before today didn't occur, and therefore the distribution starts to become truncated. Here's what's happening:

birth_data_limited <- birth_data[Date >= Sys.Date(), ]
birth_data_limited[, `:=`(frac, num_births/sum(num_births)), ]
birth_data_limited[, `:=`(cdf, cumsum(frac)), ]

# Plot Today's CDF over the Original
p2 + geom_area(aes(fill = "Original")) + geom_area(data = birth_data_limited, 
    aes(fill = "Today")) + scale_fill_manual(name = NULL, values = c(Original = "skyblue", 
    Today = "seagreen")) + theme(legend.position = "bottom")

Today's CDF

Am I going to miss that meeting?

Birdgeneau Baby Predictor: Shiny App

Winning a Baby Pool

If you're looking to win a baby pool, there's no sure way - that's the nature of probabilities, but you can at least increase your chances of winning.

Here's the updated histogram, and the top 15 most likely days from the filtered distribution shown above:

Date Percent
2016-03-01 5.5%
2016-03-02 5.2%
2016-02-29 4.9%
2016-03-03 4.5%
2016-03-04 4.3%
2016-03-07 4.3%
2016-03-06 4.2%
2016-03-08 4.2%
2016-02-27 4.2%
2016-02-28 4.2%
2016-03-05 4.1%
2016-02-26 3.8%
2016-02-23 3.2%
2016-02-25 3.2%
2016-03-09 3.1%

Comments

Python Development with Docker

I've been trying to brush up my skills in Python after spending a lot of time in R. One of the challenges I've run into is maintaining a common development environment on my Mac. Enter Docker.

Force Carbonation Charts

Force carbonation charts are readily available in either metric or imperial units, mostly the latter. The problem was, I use Celcius, and my regulator is in psi. This was a quick solution to the problem.

Leap Year Birthday?

If you're born on Feb 29th, when's your Birthday? Good question.

Other ways to reach me