# Advent of Code 2017 in #rstats: Day 2

After reading this puzzle, I was excited and concerned to see “spreadsheet”:

1. Excited: R is good at rectangular numeric data, and I work with a lot of spreadsheets
2. Concerned: Will I be starting with a single long string?  How do I get that into rectangular form?

After considering trying to chop up one long string, I copy-pasted my input to a text file and `read.delim()` worked on the first try.

Part 1 was simple, Part 2 was tricky.  The idea of row-wise iteration got in my head and I spent a lot of time with `apply` and `lapply`… I still hold a grudge against them from battles as a new R user, but having (mostly) bent them to my will I went to them here.  Maybe I should spend more time with the `purrr` package.

Straightforward:

### Part 2

Got the idea quickly, but executing took a bit longer.  I decided right away that I would:

1. Work on a function that tackled a row (as a vector) at a time
2. The function would divide every value in the vector by every other value in the vector
3. It would then pick out the integer that wasn’t 1, as well as its reciprocal, and divide them
4. Then run this function row-wise on the input

Actually building up that function took ~20 minutes.  I felt confident the approach would work so didn’t stop to consider anything else… I’m excited to see other solutions after I publish this, as maybe there’s something elegant I missed.

Along the way I learned: `is.integer` doesn’t do what I thought it would.  Come on, base R.

I don’t work with matrices much (not a mathematician, don’t have big data) and was a somewhat vexed by the matrix -> data.frame conversion.

# Advent of Code 2017 in #rstats: Day 1

Today’s puzzle is a good example of thinking “R-ishly”.   In R, I find it easier to compare vectors than to compare strings, and easier to work with vectorized functions than to loop.  So my code starts by splitting the input into a vector.

Part 1:

Split the input into a vector, add the first value to the end, and use dplyr’s lead() function to compare each value to the following value:

Part 2:

This challenge is simpler, in my opinion.  You can halve the vector, then compare the first half to the second half, summing the matches (don’t forget to double that result).

# How to Teach Yourself R

(Or, “how to teach professionals to teach themselves R”).

Background: I taught myself R in 2014 from public web resources, and since then have steered several cohorts of data analysts at my organization through various R curricula, adapting based on their feedback.

This is geared toward people teaching themselves R outside of graduate school (I perceive graduate students to have more built-in applications and more time for learning, though I don’t speak from experience).  I say “students” below but I am referring to professionals.  This advice assumes little or no programming experience in other languages, e.g., people making the shift from Excel to R (I maintain that Excel is one of R’s chief competitors).  If you already work in say, Stata, you may face fewer frustrations (and might consider DataCamp’s modules geared specifically to folks in your situation).

I’ve tried combinations of Coursera’s Data Science Specialization, DataCamp’s R courses, and the “R for Data Science” textbook.  Here’s what I’ve learned about learning and teaching R and what I recommend.

1. A problem you really want to solve
2. A self-study resource

# Can a Twitter bot increase voter turnout?

Summary: in 2015 I created a Twitter bot, @AnnArborVotes (code on GitHub).  (2018 Sam says: after this project ceased I gave the Twitter handle to local civics hero Mary Morgan at A2CivCity).  I searched Twitter for 52,000 unique voter names, matching names from the Ann Arbor, MI voter rolls to Twitter accounts based nearby.  The bot then tweeted messages to a randomly-selected half of those 2,091 matched individuals, encouraging them to vote in a local primary election that is ordinarily very low-turnout.

I then examined who actually voted (a matter of public record).  There was no overall difference between the treatment and control groups. I observed a promising difference in the voting rate when looking only at active Twitter users, i.e., those who had tweeted in the month before I visited their profile. These active users only comprised 7% of my matched voters, however, and the difference in this small subgroup was not statistically significant (n = 150, voting rates of 23% vs 15%, p = 0.28).

I gave a talk summarizing the experiment at Nerd Nite Ann Arbor that is accessible to laypeople (it was at a bar and meant to be entertainment):

This video is hosted by the amazing Ann Arbor District Library – here is their page with multiple formats of this video and a summary of the talk.  Here are the slides from the talk (PDF), but they’ll make more sense with the video’s voiceover.

The full write-up:

I love the R programming language (#rstats) and wanted a side project.  I’d been curious about Twitter bots.  And I’m vexed by how low voter turnout is in local elections.  Thus, this experiment.

# Calculating likelihood of X% of entrants advancing in an NFL Survivor Pool

or: Yes, Week 2 of the 2015 NFL season probably was the toughest week for a survivor pool, ever.

Week 2 of the 2015 NFL season was rife with upsets, with 9 of 16 underdogs winning their games.  This wreaked havoc on survivor pools (aka eliminator pools), where the object is to pick a single team to win each week.  The six most popular teams (according to Yahoo! sports) all lost:

(image from Yahoo! Sports, backed up here as it looks like the URL will not be stable into next year)

About 4.6% of Yahoo! participants survived the week (I looked only at the top 11 picks due to data availability, see the GitHub file below for more details).  This week left me wondering: was this the greatest % of survivor pool entrants to lose in a single week, ever?  And what were the odds of this happening going into this week?

I wrote some quick code to run a million simulations of the 2nd week of the 2015 NFL season (available here on GitHub).

Results

Given the projected win probabilities (based on Vegas odds) and the pick distributions, only 684 of the 1,000,000 simulations yielded a win rate below the 4.6% actual figure.  Thus the likelihood that only 4.6% of entrants would make it through the week was 0.0684%, less than a tenth of one percent.  Or to put it another way, this event had a 1-in-1,462 chance of occurring.

Here are the results of the simulation:

1. Blue line: median expected result, 80.6% winners
2. Yellow line: 1st percentile result, 13.8% winners (to give you a sense of how rare a result this week was)
3. Red line: actual result, 4.6% winners

So was it the greatest week for survivor pool carnage ever?  Probably.  You might never see a week like it again in your lifetime.

P.S. This distribution is pretty cool, with the sudden drop off and gradual climb starting at x = 0.50.  This is caused by 50% of the pool picking the Saints, the most likely team to win.  I wouldn’t say this is a bimodal distribution, exactly – is there a term for this?