## When is the due date?

Posted by Mike Birdgeneau

##### Category: Data

Firstly, it's a little surprising I haven't done this yet - using some simple statistics to help figure out when the little one is likely to show up! This could help with some life decisions, for example: Should I go skiing today? How far from town am I willing to drive?

Of course, in order to use these stats, I'd have to figure out what threshold I'd be comfortable with... a more difficult problem. Let's start with the easier one! The data I'm using comes from spacefem.com; a big thanks for making this data available.

Let's start by loading up a few packages, the data & defining a due date:

```
suppressPackageStartupMessages({
library(data.table)
library(ggplot2)
library(gridExtra)
library(fitdistrplus)
library(scales)
library(xtable)
})
due_date <- as.Date("2016-02-29")
birth_data <- data.table(read.csv("data/cdf_spacefem_all.csv"))
```

Here's a quick sample of what the data looks like at this point:

```
head(birth_data, 5)
```

```
## days num_births
## 1: 245 13
## 2: 246 15
## 3: 247 9
## 4: 248 21
## 5: 249 18
```

Now we'll take the table that has days, load it up & make a few calculations; this allows us to calculate a probability distribution, and a cumulative distribution function relatively easily!

```
birth_data[, `:=`(Date, due_date + (days - 280 + 1)), ] # Assumes 280d from last menstrual period for due date calc.
birth_data[, `:=`(frac, num_births/sum(num_births)), ]
birth_data[, `:=`(cdf, cumsum(frac)), ]
```

Let's take a quick look at the data (always good practice):

```
p1 <- ggplot(birth_data, aes(x = Date, y = frac)) + geom_area(fill = "skyblue") +
theme_bw() + ylab("Probability of Delivery") + ggtitle("Probability of Delivery by Date") +
scale_y_continuous(labels = percent)
p2 <- ggplot(birth_data, aes(x = Date, y = cdf)) + geom_area(fill = "skyblue") +
theme_bw() + ylab("Cumulative Probabilty of Delivery") + ggtitle("Cumulative Probability of Delivery by Date") +
scale_y_continuous(labels = percent)
grid.arrange(p1, p2, ncol = 2)
```

The sample size of this dataset is: (n=7493), and the resolution is only by 'days', so we see that the distribution certainly isn't smooth. Interestingly enough, the distribution doesn't look normal. This could be due to several factors, including induced labour, etc.

Let's take a quick look at the distribution (e.g. skew & kurtosis). The Cullen & Frey graph works well to show where our data falls relative to other distributions:

```
descdist(as.numeric(sample(birth_data$Date, size = 1e+06, replace = TRUE, prob = birth_data$frac)))
```

```
## summary statistics
## ------
## min: 16826 max: 16882
## median: 16860
## mean: 16859
## estimated sd: 9.657033
## estimated skewness: -0.7695845
## estimated kurtosis: 3.591334
```

Now, just a word of caution: You shouldn't use a distribution without understanding where the distribution comes from, and understanding if the true population may actually be described by that distribution... I don't really need to fit a distribution here - but I could select a Weibull distribution - it should fit the data reasonably well & is often used in time to failure type analysis... which has some similarities. Here's how this would fit:

```
fweib <- fitdist(as.numeric(sample(birth_data$Date, size = 1e+06, replace = TRUE,
prob = birth_data$frac)), "weibull")
dt_weib <- data.table(cdf = seq(0.001, 0.999, by = 0.001), Date = as.Date(qweibull(seq(0.001,
0.999, by = 0.001), shape = fweib$estimate["shape"], scale = fweib$estimate["scale"]),
origin = as.Date("1970-01-01")))
p2 + geom_line(data = dt_weib, colour = "red", lwd = 0.8)
```

So the weibull fits pretty nicely. The fit results are here, for someone who might be interested:

```
print(fweib)
```

```
## Fitting of the distribution ' weibull ' by maximum likelihood
## Parameters:
## estimate Std. Error
## shape 2112.648 1.602996600
## scale 16863.485 0.008414135
```

Now, of course because it's currently 2016-02-05, we've shown that delivery on dates before today didn't occur, and therefore the distribution starts to become truncated. Here's what's happening:

```
birth_data_limited <- birth_data[Date >= Sys.Date(), ]
birth_data_limited[, `:=`(frac, num_births/sum(num_births)), ]
birth_data_limited[, `:=`(cdf, cumsum(frac)), ]
# Plot Today's CDF over the Original
p2 + geom_area(aes(fill = "Original")) + geom_area(data = birth_data_limited,
aes(fill = "Today")) + scale_fill_manual(name = NULL, values = c(Original = "skyblue",
Today = "seagreen")) + theme(legend.position = "bottom")
```

## Am I going to miss that meeting?

Birdgeneau Baby Predictor: Shiny App

## Winning a Baby Pool

If you're looking to win a baby pool, there's no sure way - that's the nature of probabilities, but you can at least increase your chances of winning.

Here's the updated histogram, and the top 15 most likely days from the filtered distribution shown above:

Date | Percent |
---|---|

2016-03-01 | 5.5% |

2016-03-02 | 5.2% |

2016-02-29 | 4.9% |

2016-03-03 | 4.5% |

2016-03-04 | 4.3% |

2016-03-07 | 4.3% |

2016-03-06 | 4.2% |

2016-03-08 | 4.2% |

2016-02-27 | 4.2% |

2016-02-28 | 4.2% |

2016-03-05 | 4.1% |

2016-02-26 | 3.8% |

2016-02-23 | 3.2% |

2016-02-25 | 3.2% |

2016-03-09 | 3.1% |

## Comments