How to Teach Yourself R

(Or, “how to teach professionals to teach themselves R”).

Background: I taught myself R in 2014 from public web resources, and since then have steered several cohorts of data analysts at my organization through various R curricula, adapting based on their feedback.  This is geared toward people teaching themselves R outside of graduate school (I perceive graduate students to have more built-in applications and more time for learning, though I don’t speak from experience).  I say “students” below but I am referring to professionals.

I’ve tried combinations of Coursera’s Data Science Specialization, DataCamp’s R courses, and the “R for Data Science” textbook.  Here’s what I’ve learned about learning and teaching R and what I recommend.

The Reward vs. The Pain

R can do many valuable things, once you’re past the basics. It is also frustrating to learn.  A few learners are interested in R itself (they may eventually join the #rstats Twitter community), but most people have problems to solve.  For them, R is not a destination, but a means to an end.

Those students will stick with R so long as the value-to-frustration ratio is sufficiently high.  Their path to learning R must maximize the reward and minimize the pain.

(I’m one of those interested-in-R-itself people, and realized this divide the hard way when the learning resources that worked for me failed for people with a more practical motivation).

Maximizing The Reward

1. Start with the juiciest topics

Hadley Wickham explains his unconventional (and good, in my opinion) sequencing of topics in R for Data Science:

Starting with data ingest and tidying is sub-optimal because 80% of the time it’s routine and boring, and the other 20% of the time it’s weird and frustrating. That’s a bad place to start learning a new subject! Instead, we’ll start with visualisation and transformation of data that’s already been imported and tidied. That way, when you ingest and tidy your own data, your motivation will stay high because you know the pain is worth it.

Maximize the reward by starting with the highest-yield, most-compelling stuff.  Begin by teaching students to count and chart, or to do the things that got them interested in R – not by lecturing on R’s object classes!

2. Have a headache

Math teacher Dan Meyer asks, “if math is the aspirin, what’s the headache?” The same goes for learning R (and maybe anything): students will learn more when they have a problem they want to solve with R.  This is a powerful reward!  If a student already has a headache, good.  Go solve that problem and you’ll learn R in the process.  (Ideally your first problem will not be too complicated).

I first tried R when I wanted to calculate the median of a variable, for each combination of three other variables.  I wanted hundreds of medians and I could not do it in Excel.  The reward of solving my headache made it worthwhile to take the plunge.

If you don’t have a headache, you should get one.

Minimize the pain

It is so annoying to be a beginning R user.  Especially if you’re used to software like Excel or Tableau, where things just work.  There are a thousand stupid hurdles to trip new useRs:  installing packages, working directory problems, data ingest, obscure error messages, bad documentation…

As an advanced user, sometimes I forget about this.  Then a student shares a cryptic error message that seems to bear no relation to the problem they want to solve and I shudder.

To this end, I teach the tidyverse first.  Compared to base R, you’ll get usable results faster, there’s a flatter learning curve, and there are many demos online to draw from.  The apply functions have a place, sure, but throwing them at beginners is a good way to turn people off.  (Go read the documentation for ?tapply).  The [professional] students I work with do not care about lists or vector types right now; they have problems to solve.

Ways to implement this approach

The best way to learn R is to do your work in R.  Having your own problem to solve provides not only motivation, but practice, which is the key to learning anything.

If you have a skill gap between the R you know and the R you need to solve your own headache, you may need to bridge the gap with some artificial practice.  The resources below are ways to build enough scaffolding through artificial problem solving that you can start doing your own meaningful work in R.

Here’s how I’ve found some popular resources to stack up, in the order I’ve tried them with students.

Coursera’s JHU Data Science Specialization

I started this sequence in 2014 as new R user.  Since then I’ve sent others through it.  I enjoyed it, but most others have not.  A true beginner who is not extremely self-motivated will need support through the rough patches in these courses.

Summary of feedback from beginners: “confusing and frustrating, I don’t see how this is relevant to my work.”

Pros:

  • Practice problems and homework.  If you don’t have your own headaches, the homework and quizzes will provide them.
  • Accountability.  The weekly deadlines kept me successfully prioritizing professional development.  Without them it would have fallen by the wayside.
  • Swirl.  Swirl is a simple, interactive way to learn basic aspects of R, like a pared-down version of the DataCamp interface (below).  Swirl modules are nicely embedded into the early courses here, with accountability!

Cons:

  • Poor sequencing of topics/curriculum.  Last I checked, these courses still started off with R’s object classes, the apply functions, loops, etc. instead of higher-yield approaches like the tidyverse.  Some courses have wide, shallow scopes, leading to time spent dealing with OAuth tokens and parsing JSON – but just once, enough to get frustrated but not learn.
    • They teach different R graphics approaches: ggplot2, base R, and lattice are all crammed into a few weeks.  Pick one – preferably ggplot2 – and teach it deeply.
  • Unconvincing headaches.  The potential for fun, motivating practice is so great with these topics, which made some of the lackluster Coursera assignments particularly disappointing.
    • An early assignment in R Programming  – this is the first challenging course and a gateway to the rest – asks students to write their first function: it will invert a matrix and cache the result.  This task is arcane, unmotivating, and difficult for the wrong reasons; its lack of similarity to the real-world problems my students want to solve with R was immensely frustrating.  (The hospital data wrangling functions were much better).
    • The course project for linear regression has students create an explanatory model using a low-quality, tiny data set (the 32 row mtcars data.frame).  This was limited and inauthentic.
    • The machine learning course project had a row ID variable present in both the training and evaluation set, essentially giving away the answers when trying to predict what group a record came from.
  • Poor alignment between assignments and lectures.  Homework and quiz problems were often unrelated to the week’s lectures.  This was justified by, “you should be a hacker and go figure it out.”  That’s reasonable for some level of disconnect, but when there’s a total mismatch it just adds pain for R learners and renders the lectures irrelevant.
  • Inconsistent lectures.  Lecture quality varies from compelling to near-useless, e.g., when a lecturer goes into detailed derivations of statistical formulas, in a course focused on applications.
  • Weak student engagement.  The peer feedback I received was rarely valuable, and plagiarism was rampant.  I googled snippets of code and comments from the projects I graded and found about half of them (!) to be obviously plagiarized.  Too bad: the community is one potential strength of the Coursera courses, but isn’t yet being realized effectively.

Overall, the Coursera JHU specialization is moderate reward, high pain.

DataCamp

I’ve worked through parts of different courses to explore them, and have sent others through them.

Summary of feedback from beginners: “approachable, and I felt like I was getting it, but I wasn’t prepared us to apply what we covered.”

Pros:

  • Good videos.  The combination of code and narration is easily to follow and the explanations are detailed.
  • Bite-sized pieces with interactive practice.  The in-browser programming interface is nice, as is the ability to get feedback and hints.

Cons:

  • Poor sequencing of topics/curriculum. Look at the intro to R and intermediate R courses.  Starting beginners off with matrices and lists?  Loops, control flow, lapply?  Woof.  To get around this, I tried to cherry-pick chapters of the different courses, which proved unsatisfactory.  I ended up sending people to the introduction course, then courses in dplyr, data cleaning, ggplot2, and R Markdown.
  • Limited practice opportunities.  Students told me they’d play along with the videos successfully, but then go to work in R weeks later and find they’d have remembered little and have few examples to look back on.  Independent practice is missing from DataCamp.  (Coursera’s homework, quizzes, and projects are a major differentiator in this respect).
    • I ended up creating private practice projects to augment the DataCamp materials.
  • Heavily reliant on videos.  Not a problem if video works for you, but maybe you prefer other modes.

Previously, DataCamp had the additional “pro” of offering a generous non-profit discount, but this has unfortunately been discontinued.

Overall, DataCamp is low reward, moderately-low pain.

R For Data Science

Released in 2017, this textbook is freely available online.  I’ve browsed the book, and recommended sections to others, but haven’t worked through the complete text.  It’s too new for me to have feedback from users.

Pros:

  • Coherent topics and sequencing.  There’s a clear, opinionated philosophy that front-loads the highest-yield approaches (e.g., charting before data ingest).
  • Uses state-of-the-art packages.  It’s based in the tidyverse, and the nature of its main author (Hadley Wickham, who maintains the tidyverse) and scheme for staying updated (an open-source project where improvements are accepted as pull requests) means that it will stay current with the rapidly-developing world of R tools.
  • It’s free and open.  No logging in or paying.

Cons:

  • Practice problems seem thin and lack supports.  Unlike the curricula above, there are no hints, forums, or accountability.  This makes sense for a textbook, but it will be critical that students obtain their own additional, deeper practice.
  • No credential.  Some learners find it rewarding to earn points, grades, and certificates.
  • Narrow focus on the essentials.  This is intentional, and stated in the book.  Data science encompasses a vast array of topics, and the book is better for staying in its lane.  But if you want to learn more about say, inferential statistics, you’ll need to go elsewhere.

Overall, R for Data Science is moderate reward, low pain.

The Secret Sauce: Coaching

Looking at the [professional] students who have gotten the most out of the trainings above, they have something in common: they had someone to guide them.  In some cases this was more intensive, with weekly pair-coding sessions and code reviews, but at a minimum most students who successfully climbed the learning curve had someone they were comfortable emailing or IM’ing with questions.  Sometimes the response would be “look at this StackOverflow post,” other times it would be hopping on a screenshare to debug.

Coaching helps both sides of the reward/pain ratio: you get guided to the most fruitful approaches and achieve your results faster, while capping the time you’re stuck on any one obstacle.  Our data analysts have increased exposure to coaching through regular code reviews (for feedback and improvement, not just an accuracy check) and through a chat channel dedicated to R, where users can share ideas and crowd-source answers.

Unfortunately, coaching doesn’t scale.  It’s time-intensive and many people teaching themselves R won’t have access to a more advanced user – though ask around your company or program.  But there are proxies for coaching:

  • the Coursera course forums and TAs;
  • R friends you make at your local R meetup;
  • The online R community, including #rstats on Twitter
  • StackOverflow (read existing questions and post new ones)

In closing: How I suggest learning R

  1. Get your own headache to solve – why do you want to learn R?
  2. Start trying to solve it, using tidyverse packages.
  3. Lean on your coach, peers, or online community when you need help.
  4. If you find you need to build your skill base before you can solve your own headaches:
    1. Work through R for Data Science with a friend.  Then:
    2. If you like videos and bite-sized practice, tell a more advanced R user about the headache you want to solve and ask them what chapters of DataCamp courses they think would be relevant.  You should pick and choose based on what’s useful in your personal work.
    3. If you want more in-depth practice problems, use the skills you learned from R for Data Science to work through the Coursera courses.
    4. Seek out more practice.  Follow along with worked problems explained by bloggers, work through the dplyr package vignettes, try an easy Kaggle competition.

This is an exciting time for teaching yourself R.  None of the curricular options I review here existed 5 years ago, nor did the suite of R packages I recommend starting with.  While they’re imperfect, I’m glad these options are available to the community, with low barriers to entry.  I appreciate the hard work that their creators put in and appreciate even more their continued improvement.

I hope this blog post becomes obsolete in the next 5 years, as the options for self-teaching R expand and mature along with the R ecosystem and community.

Batch 71: Scio Pilsner

There’s a closet in my basement that hovers near 50 degrees in the winter.  So before spring arrives, I wanted to take advantage of my natural “temperature control” and brew a lager.  I don’t brew many lagers, but the provocative Brulosophy experiments on lager yeast fermentation temperature gave me peace of mind that if a warm spell comes through and the room gets a little warmer, it’ll be fine.

This was a convenience recipe in other regards.  I used dry yeast to avoid needing a massive starter and I used up half a bag of leftover pils malt.  And I ran off 5 gallons of wort before adding flame-out hops, to ferment with an ale yeast and use to top up the 53 gallon barrel at my house that is mostly full of funky dark saison.  It’s nice to get rid of the headspace in the barrel, and we wager no one will notice 10% of hoppy Belgian ale blended in.

Continue reading Batch 71: Scio Pilsner

Batch 70: Eclipse Lodi Ranch 11 Cab

I’ve made country wine with Concord grapes, and wine from a kit that cost $2/bottle but tasted like $8/bottle.  But I’d rather drink beer than wine I can get for $8/bottle.  So I thought I’d try a kit that costs $6/bottle and see if it makes wine I actually want to drink.

In late 2016 I purchased  the Winexpert Eclipse Lodi Ranch 11 Cabernet Sauvignon kit.  Not sure what year that makes the grapes.  I’m not going to write much about ingredients or process since I followed the kit directions, unless otherwise noted.

2017-02-04: “Brewed” this with my 2 y/o son.  Despite the helper, kept good sanitation.  OG was around 1.093, though perhaps more sugars dissolved in from the grape skins.

Fermentation temperature bounced around from the minimum (72F) up to the mid-80s, as I crudely warmed it through a Michigan basement in winter.

I stirred the grape skin bag back down into the must, near-daily, for the first week.

2017-02-13 – 9 days: racked to a glass carboy.  Gravity is about .992!  Wine yeasts don’t play.

Batch 68: Zingibier V (Spiced Ginger Belgian-style ale)

As a beginning homebrewer, I got lucky and struck gold.  Literally: my improvised recipe for an imperial spiced witbier won a gold medal at the 2010 National Homebrew Competition.

This is my fifth rebrewing of that recipe, tinkering with it each time.  I don’t often repeat in my brewing schedule, but I enjoy learning from iterations of this recipe.

Continue reading Batch 68: Zingibier V (Spiced Ginger Belgian-style ale)

Batch 67: Tart of Darkness clone sour stout

A friend came into a 53 gallon Knob Creek Single Barrel Reserve 9 year whiskey barrel that now lives in my basement and houses homebrew.  We’ve done a Russian Imperial Stout, a Scotch Ale, and an Oud Bruin.  The fresh barrel contributed a massive oak character, but over 3 batches and 1.5 years, the oak faded.  When the barrel naturally went sour during the Scotch Ale, we switched to intentionally soured beers and added 8 sachets of the Flemish Ale F4 blend from Blackman Yeast.

Next up is a sour stout, very low on hops (<10 IBUs), inspired by The Bruery’s Tart of Darkness.  The low IBUs are friendlier to souring microbes and also avoid the clashing of bitterness and acidity.

Continue reading Batch 67: Tart of Darkness clone sour stout

Batch 64: Barrel Aged Strong Scotch Ale on Black Raspberries

The 2nd beer in the Knob Creek barrel collaboration.

I brewed two 5-gallon shares of this beer, in collaboration with a fellow barrel members.  Others used different recipes.  We brewed in summer 2015, then aged the beer in the barrel for about 6 months, pulling it January 2016.  It went naturally sour in the barrel, making the sour aspect of this lambic-esque in that it spontaneously soured from organisms present in the surrounding environment.

I aged my ~4.5 gallons on 1.5lbs of wild black raspberries for another 4 months in a secondary carboy. It took a while to carb up, the result of have aged for over a year.  I’ll add yeast at bottling for future barrel-aged sours.  But carbed up eventually, and it’s good.

Summer 2016: this beer placed 2nd in the American Wild Ale category at the 2016 Michigan Beer Cup.

March 2017: funny that I originally worried about whether this would carbonate; it has continued to ferment in the bottle and now gushes upon opening if not very cold.  Not coincidentally, it’s developing a more prominent Brett funk.  If I had a 4th slot for this year’s Nat’l Homebrew Competition, I might enter it – which also means it’s not one of my top 3 beers right now.  But it’s still quite nice.

Continue reading Batch 64: Barrel Aged Strong Scotch Ale on Black Raspberries

Can a Twitter bot increase voter turnout?

Summary: in 2015 I created a Twitter bot, @AnnArborVotes (code on GitHub).  I searched Twitter for 52,000 unique voter names, matching names from the Ann Arbor, MI voter rolls to Twitter accounts based nearby.  The bot then tweeted messages to a randomly-selected half of those 2,091 matched individuals, encouraging them to vote in a local primary election that is ordinarily very low-turnout.

I then examined who actually voted (a matter of public record).  There was no overall difference between the treatment and control groups. I observed a promising difference in the voting rate when looking only at active Twitter users, i.e., those who had tweeted in the month before I visited their profile. These active users only comprised 7% of my matched voters, however, and the difference in this small subgroup was not statistically significant (n = 150, voting rates of 23% vs 15%, p = 0.28).

I gave a talk summarizing the experiment at Nerd Nite Ann Arbor that is accessible to laypeople (it was at a bar and meant to be entertainment):

This video is hosted by the amazing Ann Arbor District Library – here is their page with multiple formats of this video and a summary of the talk.  Here are the slides from the talk (PDF), but they’ll make more sense with the video’s voiceover.

The full write-up:

I love the R programming language (#rstats) and wanted a side project.  I’d been curious about Twitter bots.  And I’m vexed by how low voter turnout is in local elections.  Thus, this experiment.
Continue reading Can a Twitter bot increase voter turnout?

Expertise vs. Emotion at Ann Arbor City Council

Removed from scientific context, vaccinating your kid sounds crazy.  Let’s stick a needle in their arm and put disease and chemicals into their body.  To prevent an illness nobody you know has ever gotten.  And on top of your kid crying, and your own lack of experience with the disease, you have neighbors whispering in your ear (or posting loudly on social media) how dangerous vaccines are.

Instead of putting it to a popular vote, though, or listening to the loudest voices on your Facebook feed, you listen to your child’s pediatrician (I hope) and bodies of experts like the AMA and CDC, who unanimously cite overwhelming evidence in favor of vaccinations.

For every decision, there are gut feelings and personal opinions about the issue, and then there are the scientific arguments – what does the evidence say?  Most often, these come from experts in the field, who have devoted years to mastering the topic.

Would #a2council vaccinate?

The greatest  conflicts in Ann Arbor politics are often driven by clashes between gut feelings (either voiced by citizens or held by CMs) and expert opinions.

Continue reading Expertise vs. Emotion at Ann Arbor City Council

Simple Grocery Store Cider 2014

November 2014: Bought 5 gallons of Kapnick Orchards Cider at the grocery store, pitched a packet of Vintner’s Harvest MA33  yeast.  O.G. 1.046.

Mid-2015: I added a full 10″ cinnamon stick, on recommendation from several AABG club members.  The idea is to get hints of cinnamon that trigger associations with apple flavor (think apple pie), but stay below the threshold of identifiable cinnamon.  SG 0.993 (7% abv).

January 2016: kegged and added 2.5 tsp each of 10% K-Meta solution and potassium sorbate, along with table sugar (beet) to taste.  Over the year+ in a single vessel, the cider had dropped clear.

Measuring sugar to backsweeten:

  1. Pull a full hydrometer sample, measure (0.993), taste – way too dry, no balance to acidity
  2. Stir in a little sugar, taste, repeat; when it gets in the ballpark, measure gravity (1.009) – almost there
  3. Keep going – 1.015 was sweet, but balanced.  Maybe a little too sweet while still, but carbonation should even that out.  Target 1.013.
  4. Calculate amount of sweetener needed.  I need 20 ppg increase in gravity (from .993 -> 1.013) for 5 gallons, so 20 x 5 = 100 points.  One pound of table sugar yields 46 ppg so I need 100/46 = 2.18 lbs of sugar.

For reference, apparently Woodchuck Cider has a final gravity of 1.029!