Category Archives: Data analysis

How to Teach Yourself R

(Or, “how to teach professionals to teach themselves R”).

Background: I taught myself R in 2014 from public web resources, and since then have steered several cohorts of data analysts at my organization through various R curricula, adapting based on their feedback.  This is geared toward people teaching themselves R outside of graduate school (I perceive graduate students to have more built-in applications and more time for learning, though I don’t speak from experience).  I say “students” below but I am referring to professionals.

I’ve tried combinations of Coursera’s Data Science Specialization, DataCamp’s R courses, and the “R for Data Science” textbook.  Here’s what I’ve learned about learning and teaching R and what I recommend.

The Reward vs. The Pain

R can do many valuable things, once you’re past the basics. It is also frustrating to learn.  A few learners are interested in R itself (they may eventually join the #rstats Twitter community), but most people have problems to solve.  For them, R is not a destination, but a means to an end.

Those students will stick with R so long as the value-to-frustration ratio is sufficiently high.  Their path to learning R must maximize the reward and minimize the pain.

(I’m one of those interested-in-R-itself people, and realized this divide the hard way when the learning resources that worked for me failed for people with a more practical motivation).

Maximizing The Reward

1. Start with the juiciest topics

Hadley Wickham explains his unconventional (and good, in my opinion) sequencing of topics in R for Data Science:

Starting with data ingest and tidying is sub-optimal because 80% of the time it’s routine and boring, and the other 20% of the time it’s weird and frustrating. That’s a bad place to start learning a new subject! Instead, we’ll start with visualisation and transformation of data that’s already been imported and tidied. That way, when you ingest and tidy your own data, your motivation will stay high because you know the pain is worth it.

Maximize the reward by starting with the highest-yield, most-compelling stuff.  Begin by teaching students to count and chart, or to do the things that got them interested in R – not by lecturing on R’s object classes!

2. Have a headache

Math teacher Dan Meyer asks, “if math is the aspirin, what’s the headache?” The same goes for learning R (and maybe anything): students will learn more when they have a problem they want to solve with R.  This is a powerful reward!  If a student already has a headache, good.  Go solve that problem and you’ll learn R in the process.  (Ideally your first problem will not be too complicated).

I first tried R when I wanted to calculate the median of a variable, for each combination of three other variables.  I wanted hundreds of medians and I could not do it in Excel.  The reward of solving my headache made it worthwhile to take the plunge.

If you don’t have a headache, you should get one.

Minimize the pain

It is so annoying to be a beginning R user.  Especially if you’re used to software like Excel or Tableau, where things just work.  There are a thousand stupid hurdles to trip new useRs:  installing packages, working directory problems, data ingest, obscure error messages, bad documentation…

As an advanced user, sometimes I forget about this.  Then a student shares a cryptic error message that seems to bear no relation to the problem they want to solve and I shudder.

To this end, I teach the tidyverse first.  Compared to base R, you’ll get usable results faster, there’s a flatter learning curve, and there are many demos online to draw from.  The apply functions have a place, sure, but throwing them at beginners is a good way to turn people off.  (Go read the documentation for ?tapply).  The [professional] students I work with do not care about lists or vector types right now; they have problems to solve.

Ways to implement this approach

The best way to learn R is to do your work in R.  Having your own problem to solve provides not only motivation, but practice, which is the key to learning anything.

If you have a skill gap between the R you know and the R you need to solve your own headache, you may need to bridge the gap with some artificial practice.  The resources below are ways to build enough scaffolding through artificial problem solving that you can start doing your own meaningful work in R.

Here’s how I’ve found some popular resources to stack up, in the order I’ve tried them with students.

Coursera’s JHU Data Science Specialization

I started this sequence in 2014 as new R user.  Since then I’ve sent others through it.  I enjoyed it, but most others have not.  A true beginner who is not extremely self-motivated will need support through the rough patches in these courses.

Summary of feedback from beginners: “confusing and frustrating, I don’t see how this is relevant to my work.”


  • Practice problems and homework.  If you don’t have your own headaches, the homework and quizzes will provide them.
  • Accountability.  The weekly deadlines kept me successfully prioritizing professional development.  Without them it would have fallen by the wayside.
  • Swirl.  Swirl is a simple, interactive way to learn basic aspects of R, like a pared-down version of the DataCamp interface (below).  Swirl modules are nicely embedded into the early courses here, with accountability!


  • Poor sequencing of topics/curriculum.  Last I checked, these courses still started off with R’s object classes, the apply functions, loops, etc. instead of higher-yield approaches like the tidyverse.  Some courses have wide, shallow scopes, leading to time spent dealing with OAuth tokens and parsing JSON – but just once, enough to get frustrated but not learn.
    • They teach different R graphics approaches: ggplot2, base R, and lattice are all crammed into a few weeks.  Pick one – preferably ggplot2 – and teach it deeply.
  • Unconvincing headaches.  The potential for fun, motivating practice is so great with these topics, which made some of the lackluster Coursera assignments particularly disappointing.
    • An early assignment in R Programming  – this is the first challenging course and a gateway to the rest – asks students to write their first function: it will invert a matrix and cache the result.  This task is arcane, unmotivating, and difficult for the wrong reasons; its lack of similarity to the real-world problems my students want to solve with R was immensely frustrating.  (The hospital data wrangling functions were much better).
    • The course project for linear regression has students create an explanatory model using a low-quality, tiny data set (the 32 row mtcars data.frame).  This was limited and inauthentic.
    • The machine learning course project had a row ID variable present in both the training and evaluation set, essentially giving away the answers when trying to predict what group a record came from.
  • Poor alignment between assignments and lectures.  Homework and quiz problems were often unrelated to the week’s lectures.  This was justified by, “you should be a hacker and go figure it out.”  That’s reasonable for some level of disconnect, but when there’s a total mismatch it just adds pain for R learners and renders the lectures irrelevant.
  • Inconsistent lectures.  Lecture quality varies from compelling to near-useless, e.g., when a lecturer goes into detailed derivations of statistical formulas, in a course focused on applications.
  • Weak student engagement.  The peer feedback I received was rarely valuable, and plagiarism was rampant.  I googled snippets of code and comments from the projects I graded and found about half of them (!) to be obviously plagiarized.  Too bad: the community is one potential strength of the Coursera courses, but isn’t yet being realized effectively.

Overall, the Coursera JHU specialization is moderate reward, high pain.


I’ve worked through parts of different courses to explore them, and have sent others through them.

Summary of feedback from beginners: “approachable, and I felt like I was getting it, but I wasn’t prepared us to apply what we covered.”


  • Good videos.  The combination of code and narration is easily to follow and the explanations are detailed.
  • Bite-sized pieces with interactive practice.  The in-browser programming interface is nice, as is the ability to get feedback and hints.


  • Poor sequencing of topics/curriculum. Look at the intro to R and intermediate R courses.  Starting beginners off with matrices and lists?  Loops, control flow, lapply?  Woof.  To get around this, I tried to cherry-pick chapters of the different courses, which proved unsatisfactory.  I ended up sending people to the introduction course, then courses in dplyr, data cleaning, ggplot2, and R Markdown.
  • Limited practice opportunities.  Students told me they’d play along with the videos successfully, but then go to work in R weeks later and find they’d have remembered little and have few examples to look back on.  Independent practice is missing from DataCamp.  (Coursera’s homework, quizzes, and projects are a major differentiator in this respect).
    • I ended up creating private practice projects to augment the DataCamp materials.
  • Heavily reliant on videos.  Not a problem if video works for you, but maybe you prefer other modes.

Previously, DataCamp had the additional “pro” of offering a generous non-profit discount, but this has unfortunately been discontinued.

Overall, DataCamp is low reward, moderately-low pain.

R For Data Science

Released in 2017, this textbook is freely available online.  I’ve browsed the book, and recommended sections to others, but haven’t worked through the complete text.  It’s too new for me to have feedback from users.


  • Coherent topics and sequencing.  There’s a clear, opinionated philosophy that front-loads the highest-yield approaches (e.g., charting before data ingest).
  • Uses state-of-the-art packages.  It’s based in the tidyverse, and the nature of its main author (Hadley Wickham, who maintains the tidyverse) and scheme for staying updated (an open-source project where improvements are accepted as pull requests) means that it will stay current with the rapidly-developing world of R tools.
  • It’s free and open.  No logging in or paying.


  • Practice problems seem thin and lack supports.  Unlike the curricula above, there are no hints, forums, or accountability.  This makes sense for a textbook, but it will be critical that students obtain their own additional, deeper practice.
  • No credential.  Some learners find it rewarding to earn points, grades, and certificates.
  • Narrow focus on the essentials.  This is intentional, and stated in the book.  Data science encompasses a vast array of topics, and the book is better for staying in its lane.  But if you want to learn more about say, inferential statistics, you’ll need to go elsewhere.

Overall, R for Data Science is moderate reward, low pain.

The Secret Sauce: Coaching

Looking at the [professional] students who have gotten the most out of the trainings above, they have something in common: they had someone to guide them.  In some cases this was more intensive, with weekly pair-coding sessions and code reviews, but at a minimum most students who successfully climbed the learning curve had someone they were comfortable emailing or IM’ing with questions.  Sometimes the response would be “look at this StackOverflow post,” other times it would be hopping on a screenshare to debug.

Coaching helps both sides of the reward/pain ratio: you get guided to the most fruitful approaches and achieve your results faster, while capping the time you’re stuck on any one obstacle.  Our data analysts have increased exposure to coaching through regular code reviews (for feedback and improvement, not just an accuracy check) and through a chat channel dedicated to R, where users can share ideas and crowd-source answers.

Unfortunately, coaching doesn’t scale.  It’s time-intensive and many people teaching themselves R won’t have access to a more advanced user – though ask around your company or program.  But there are proxies for coaching:

  • the Coursera course forums and TAs;
  • R friends you make at your local R meetup;
  • The online R community, including #rstats on Twitter
  • StackOverflow (read existing questions and post new ones)

In closing: How I suggest learning R

  1. Get your own headache to solve – why do you want to learn R?
  2. Start trying to solve it, using tidyverse packages.
  3. Lean on your coach, peers, or online community when you need help.
  4. If you find you need to build your skill base before you can solve your own headaches:
    1. Work through R for Data Science with a friend.  Then:
    2. If you like videos and bite-sized practice, tell a more advanced R user about the headache you want to solve and ask them what chapters of DataCamp courses they think would be relevant.  You should pick and choose based on what’s useful in your personal work.
    3. If you want more in-depth practice problems, use the skills you learned from R for Data Science to work through the Coursera courses.
    4. Seek out more practice.  Follow along with worked problems explained by bloggers, work through the dplyr package vignettes, try an easy Kaggle competition.

This is an exciting time for teaching yourself R.  None of the curricular options I review here existed 5 years ago, nor did the suite of R packages I recommend starting with.  While they’re imperfect, I’m glad these options are available to the community, with low barriers to entry.  I appreciate the hard work that their creators put in and appreciate even more their continued improvement.

I hope this blog post becomes obsolete in the next 5 years, as the options for self-teaching R expand and mature along with the R ecosystem and community.

Can a Twitter bot increase voter turnout?

Summary: in 2015 I created a Twitter bot, @AnnArborVotes (code on GitHub).  I searched Twitter for 52,000 unique voter names, matching names from the Ann Arbor, MI voter rolls to Twitter accounts based nearby.  The bot then tweeted messages to a randomly-selected half of those 2,091 matched individuals, encouraging them to vote in a local primary election that is ordinarily very low-turnout.

I then examined who actually voted (a matter of public record).  There was no overall difference between the treatment and control groups. I observed a promising difference in the voting rate when looking only at active Twitter users, i.e., those who had tweeted in the month before I visited their profile. These active users only comprised 7% of my matched voters, however, and the difference in this small subgroup was not statistically significant (n = 150, voting rates of 23% vs 15%, p = 0.28).

I gave a talk summarizing the experiment at Nerd Nite Ann Arbor that is accessible to laypeople (it was at a bar and meant to be entertainment):

This video is hosted by the amazing Ann Arbor District Library – here is their page with multiple formats of this video and a summary of the talk.  Here are the slides from the talk (PDF), but they’ll make more sense with the video’s voiceover.

The full write-up:

I love the R programming language (#rstats) and wanted a side project.  I’d been curious about Twitter bots.  And I’m vexed by how low voter turnout is in local elections.  Thus, this experiment.
Continue reading Can a Twitter bot increase voter turnout?

Calculating likelihood of X% of entrants advancing in an NFL Survivor Pool

or: Yes, Week 2 of the 2015 NFL season probably was the toughest week for a survivor pool, ever.

Week 2 of the 2015 NFL season was rife with upsets, with 9 of 16 underdogs winning their games.  This wreaked havoc on survivor pools (aka eliminator pools), where the object is to pick a single team to win each week.  The six most popular teams (according to Yahoo! sports) all lost:

yahoo wk 2 picks 2015

(image from Yahoo! Sports, backed up here as it looks like the URL will not be stable into next year)

About 4.6% of Yahoo! participants survived the week (I looked only at the top 11 picks due to data availability, see the GitHub file below for more details).  This week left me wondering: was this the greatest % of survivor pool entrants to lose in a single week, ever?  And what were the odds of this happening going into this week?

I wrote some quick code to run a million simulations of the 2nd week of the 2015 NFL season (available here on GitHub).


Given the projected win probabilities (based on Vegas odds) and the pick distributions, only 684 of the 1,000,000 simulations yielded a win rate below the 4.6% actual figure.  Thus the likelihood that only 4.6% of entrants would make it through the week was 0.0684%, less than a tenth of one percent.  Or to put it another way, this event had a 1-in-1,462 chance of occurring.

Here are the results of the simulation:

simulation results

  1. Blue line: median expected result, 80.6% winners
  2. Yellow line: 1st percentile result, 13.8% winners (to give you a sense of how rare a result this week was)
  3. Red line: actual result, 4.6% winners

So was it the greatest week for survivor pool carnage ever?  Probably.  You might never see a week like it again in your lifetime.

P.S. This distribution is pretty cool, with the sudden drop off and gradual climb starting at x = 0.50.  This is caused by 50% of the pool picking the Saints, the most likely team to win.  I wouldn’t say this is a bimodal distribution, exactly – is there a term for this?