Categories
#rstats Data analysis Sports

Double check your work (Kaggle Women’s NCAA tournament 2019)

I’m writing about an attention-to-detail error immediately after realizing it.  It probably won’t matter, but if it ends up costing me a thousands-of-dollars prize, I’ll feel salty.  I thought I’d grouse in advance just in case.

The last few years I’ve entered Kaggle’s March Madness data science prediction contests.  I had a good handle on the women’s tournament last year, finishing in the top 10%.  But my prior data source – which I felt set me apart, as I scraped it myself – wasn’t available this year.  So, living my open-source values, I made a quick submission by forking a repo that a past winner shared on Kaggle and adding some noise.

Now, to win these contests – with a $25k prize purse – you need to make some bets, coding individual games as 1 or 0 to indicate 100% confidence that a team will win.  If you get it right, your prediction is perfect, generating no penalty (“log-loss”).  Get it completely wrong and the scoring rule generates a near-infinite penalty for the magnitude of your mistake – your entry is toast.

You can make two submissions, so I entered one with plain predictions – “vanilla” – and one where I spiced it up with a few hard-coded bets.  In my augmented Women’s tournament entry, I wagered that Michigan, Michigan State, and Buffalo would each win their first round games.  The odds of all three winning was was only about 10%, but if it happened, I thought that might be enough for me to finish in the money.

Michigan and Buffalo both won today!  And yet I found myself in the middle of the leaderboard.  I had a sinking feeling.  And indeed, Kaggle showed the same log-loss score for both entries, and I was horrified when I confirmed:

A comparison of my vanilla and spiced-up predictions
These should not be identical.

In case Michigan State wins tomorrow and this error ends up costing me a thousand bucks in early April, the commit in question will be my proof that I had a winning ticket and blew it.

Comment if you see the simple mistake that did me in:

Where is an AI code reviewer to suggest this doesn’t do what I thought it did?

As of this writing – 9 games in – I’m in 294th place out of 505 with a log-loss of 0.35880.  With the predictions above, I’d be in 15th place with a log-loss of 0.1953801, and ready to benefit further from my MSU prediction tomorrow.

The lesson is obvious: check my work!  I consider myself to be strong in that regard which makes this especially painful.  I could have looked closely at my code, sure, but the fundamental check would have been to plot the two prediction sets against each other.

That lesson stands, even if the Michigan State women fall tomorrow and render my daring entry, and this post, irrelevant.  I’m not sure I’ll make time for entering these competitions next year; this would be a sour note to end on.

Categories
#rstats Data analysis survivor pool

Calculating likelihood of X% of entrants advancing in an NFL Survivor Pool

or: Yes, Week 2 of the 2015 NFL season probably was the toughest week for a survivor pool, ever.

Week 2 of the 2015 NFL season was rife with upsets, with 9 of 16 underdogs winning their games.  This wreaked havoc on survivor pools (aka eliminator pools), where the object is to pick a single team to win each week.  The six most popular teams (according to Yahoo! sports) all lost:

yahoo wk 2 picks 2015

(image from Yahoo! Sports, backed up here as it looks like the URL will not be stable into next year)

About 4.6% of Yahoo! participants survived the week (I looked only at the top 11 picks due to data availability, see the GitHub file below for more details).  This week left me wondering: was this the greatest % of survivor pool entrants to lose in a single week, ever?  And what were the odds of this happening going into this week?

I wrote some quick code to run a million simulations of the 2nd week of the 2015 NFL season (available here on GitHub).

Results

Given the projected win probabilities (based on Vegas odds) and the pick distributions, only 684 of the 1,000,000 simulations yielded a win rate below the 4.6% actual figure.  Thus the likelihood that only 4.6% of entrants would make it through the week was 0.0684%, less than a tenth of one percent.  Or to put it another way, this event had a 1-in-1,462 chance of occurring.

Here are the results of the simulation:

simulation results

  1. Blue line: median expected result, 80.6% winners
  2. Yellow line: 1st percentile result, 13.8% winners (to give you a sense of how rare a result this week was)
  3. Red line: actual result, 4.6% winners

So was it the greatest week for survivor pool carnage ever?  Probably.  You might never see a week like it again in your lifetime.

P.S. This distribution is pretty cool, with the sudden drop off and gradual climb starting at x = 0.50.  This is caused by 50% of the pool picking the Saints, the most likely team to win.  I wouldn’t say this is a bimodal distribution, exactly – is there a term for this?