Category Archives: Data analysis

Strava traffic on William St. Bikeway

The William St. Bikeway officially opened last weekend, though it is not yet finished and is in fact entirely closed in segments as construction is finished.

I realized I should grab a “before” shot of the Strava cycling heatmap so I can eventually compare it to “after.” [the hardest part of data analysis is collecting the right data].  I took this November 1st, 2019, though a week ago would have been better:

heatmap showing cycling traffic on the Strava app

In it we see that William is less popular for East/West travel than either Liberty or Washington. This might have been due to its more peripheral location at the south edge of downtown, confusing lane changes, and higher traffic speeds.  The latter two are mitigated by the protected bike lane.

Will we see traffic spike? The biggest increase in ridership will likely be in the non-Strava-using crowd, i.e., regular people.  And that will be my explanation if this heatmap looks the same a year from now.  I’m not sure if those cycling for sport will find the protected lane more appealing.  The data service Strava Metro would allow for better analysis of this question, including  looking at those rides tagged as “commutes”, but I don’t have access to that data.

Incidentally, I’m curious about the “advisory bike lane” unprotected segment between First and Fourth.  With winter approaching, I don’t expect that segment to be painted anytime soon.  An informational poster on William St. describes how it will work and it doesn’t sound like anything I have seen around Ann Arbor.

Double check your work (Kaggle Women’s NCAA tournament 2019)

I’m writing about an attention-to-detail error immediately after realizing it.  It probably won’t matter, but if it ends up costing me a thousands-of-dollars prize, I’ll feel salty.  I thought I’d grouse in advance just in case.

The last few years I’ve entered Kaggle’s March Madness data science prediction contests.  I had a good handle on the women’s tournament last year, finishing in the top 10%.  But my prior data source – which I felt set me apart, as I scraped it myself – wasn’t available this year.  So, living my open-source values, I made a quick submission by forking a repo that a past winner shared on Kaggle and adding some noise.

Now, to win these contests – with a $25k prize purse – you need to make some bets, coding individual games as 1 or 0 to indicate 100% confidence that a team will win.  If you get it right, your prediction is perfect, generating no penalty (“log-loss”).  Get it completely wrong and the scoring rule generates a near-infinite penalty for the magnitude of your mistake – your entry is toast.

You can make two submissions, so I entered one with plain predictions – “vanilla” – and one where I spiced it up with a few hard-coded bets.  In my augmented Women’s tournament entry, I wagered that Michigan, Michigan State, and Buffalo would each win their first round games.  The odds of all three winning was was only about 10%, but if it happened, I thought that might be enough for me to finish in the money.

Michigan and Buffalo both won today!  And yet I found myself in the middle of the leaderboard.  I had a sinking feeling.  And indeed, Kaggle showed the same log-loss score for both entries, and I was horrified when I confirmed:

A comparison of my vanilla and spiced-up predictions
These should not be identical.

In case Michigan State wins tomorrow and this error ends up costing me a thousand bucks in early April, the commit in question will be my proof that I had a winning ticket and blew it.

Comment if you see the simple mistake that did me in:

Where is an AI code reviewer to suggest this doesn’t do what I thought it did?

As of this writing – 9 games in – I’m in 294th place out of 505 with a log-loss of 0.35880.  With the predictions above, I’d be in 15th place with a log-loss of 0.1953801, and ready to benefit further from my MSU prediction tomorrow.

The lesson is obvious: check my work!  I consider myself to be strong in that regard which makes this especially painful.  I could have looked closely at my code, sure, but the fundamental check would have been to plot the two prediction sets against each other.

That lesson stands, even if the Michigan State women fall tomorrow and render my daring entry, and this post, irrelevant.  I’m not sure I’ll make time for entering these competitions next year; this would be a sour note to end on.

Generating unique IDs using R

Here’s a function that generates a specified number of unique random IDs, of a certain length, in a reproducible way.

There are many reasons you might want a vector of unique random IDs.  In this case, I embed my unique IDs in SurveyMonkey links that I send via mail merge. This way I can control the emailing process, rather than having messages come from SurveyMonkey, but I can still identify the respondents.  If you are doing this for the same purpose, note that you first need to enable a custom variable in SurveyMonkey!  I call mine for simplicity.

The function

Here’s what you get:

This function could get stuck in the while-loop if your N exceeds the number of unique permutations of alphanumeric strings of length char_len.  There are length(pool) ^ char_len permutations available.  Under the default value of char_len = 5, that’s 62^5 combinations or 916,132,832.  This should not be a problem for most users.

On reproducible randomization

The ability to set the randomization seed is to aid in reproducing the ID vector.  If you’re careful, and using version control, you should be able to retrace what you did even without setting seed.  There are downsides to setting the same seed each time too, for instance, if your input list gets shuffled and you’re now assigning already-used codes to different users.

No matter how you use this function, think carefully about how to record and reuse values such that IDs stay consistent over time.

Exporting results for mail merging

Here’s what this might look like in practice if you want to generate these IDs, then merge them into SurveyMonkey links and export for sending in a mail merge.  In the example below, I generate both English- and Spanish-language links.

Note that I have created the custom variable a in SurveyMonkey, which is why I can say a= in the URL.

How to Teach Yourself R

(Or, “how to teach professionals to teach themselves R”).

Background: I taught myself R in 2014 from public web resources, and since then have steered several cohorts of data analysts at my organization through various R curricula, adapting based on their feedback.

This is geared toward people teaching themselves R outside of graduate school (I perceive graduate students to have more built-in applications and more time for learning, though I don’t speak from experience).  I say “students” below but I am referring to professionals.  This advice assumes little or no programming experience in other languages, e.g., people making the shift from Excel to R (I maintain that Excel is one of R’s chief competitors).  If you already work in say, Stata, you may face fewer frustrations (and might consider DataCamp’s modules geared specifically to folks in your situation). 

I’ve tried combinations of Coursera’s Data Science Specialization, DataCamp’s R courses, and the “R for Data Science” textbook.  Here’s what I’ve learned about learning and teaching R and what I recommend.

I see three big things that will help you learn R:

  1. A problem you really want to solve
  2. A self-study resource
  3. A coach/community to help you

Continue reading How to Teach Yourself R

Can a Twitter bot increase voter turnout?

Summary: in 2015 I created a Twitter bot, @AnnArborVotes (code on GitHub).  (2018 Sam says: after this project ceased I gave the Twitter handle to local civics hero Mary Morgan at A2CivCity).  I searched Twitter for 52,000 unique voter names, matching names from the Ann Arbor, MI voter rolls to Twitter accounts based nearby.  The bot then tweeted messages to a randomly-selected half of those 2,091 matched individuals, encouraging them to vote in a local primary election that is ordinarily very low-turnout.

I then examined who actually voted (a matter of public record).  There was no overall difference between the treatment and control groups. I observed a promising difference in the voting rate when looking only at active Twitter users, i.e., those who had tweeted in the month before I visited their profile. These active users only comprised 7% of my matched voters, however, and the difference in this small subgroup was not statistically significant (n = 150, voting rates of 23% vs 15%, p = 0.28).

I gave a talk summarizing the experiment at Nerd Nite Ann Arbor that is accessible to laypeople (it was at a bar and meant to be entertainment):

This video is hosted by the amazing Ann Arbor District Library – here is their page with multiple formats of this video and a summary of the talk.  Here are the slides from the talk (PDF), but they’ll make more sense with the video’s voiceover.

The full write-up:

I love the R programming language (#rstats) and wanted a side project.  I’d been curious about Twitter bots.  And I’m vexed by how low voter turnout is in local elections.  Thus, this experiment.
Continue reading Can a Twitter bot increase voter turnout?

Calculating likelihood of X% of entrants advancing in an NFL Survivor Pool

or: Yes, Week 2 of the 2015 NFL season probably was the toughest week for a survivor pool, ever.

Week 2 of the 2015 NFL season was rife with upsets, with 9 of 16 underdogs winning their games.  This wreaked havoc on survivor pools (aka eliminator pools), where the object is to pick a single team to win each week.  The six most popular teams (according to Yahoo! sports) all lost:

yahoo wk 2 picks 2015

(image from Yahoo! Sports, backed up here as it looks like the URL will not be stable into next year)

About 4.6% of Yahoo! participants survived the week (I looked only at the top 11 picks due to data availability, see the GitHub file below for more details).  This week left me wondering: was this the greatest % of survivor pool entrants to lose in a single week, ever?  And what were the odds of this happening going into this week?

I wrote some quick code to run a million simulations of the 2nd week of the 2015 NFL season (available here on GitHub).

Results

Given the projected win probabilities (based on Vegas odds) and the pick distributions, only 684 of the 1,000,000 simulations yielded a win rate below the 4.6% actual figure.  Thus the likelihood that only 4.6% of entrants would make it through the week was 0.0684%, less than a tenth of one percent.  Or to put it another way, this event had a 1-in-1,462 chance of occurring.

Here are the results of the simulation:

simulation results

  1. Blue line: median expected result, 80.6% winners
  2. Yellow line: 1st percentile result, 13.8% winners (to give you a sense of how rare a result this week was)
  3. Red line: actual result, 4.6% winners

So was it the greatest week for survivor pool carnage ever?  Probably.  You might never see a week like it again in your lifetime.

P.S. This distribution is pretty cool, with the sudden drop off and gradual climb starting at x = 0.50.  This is caused by 50% of the pool picking the Saints, the most likely team to win.  I wouldn’t say this is a bimodal distribution, exactly – is there a term for this?