Categories
#rstats Data analysis

How to Teach Yourself R

(Or, “how to teach professionals to teach themselves R”).

Background: I taught myself R in 2014 from public web resources, and since then have steered several cohorts of data analysts at my organization through various R curricula, adapting based on their feedback.

This is geared toward people teaching themselves R outside of graduate school (I perceive graduate students to have more built-in applications and more time for learning, though I don’t speak from experience).  I say “students” below but I am referring to professionals.  This advice assumes little or no programming experience in other languages, e.g., people making the shift from Excel to R (I maintain that Excel is one of R’s chief competitors).  If you already work in say, Stata, you may face fewer frustrations (and might consider DataCamp’s modules geared specifically to folks in your situation). 

I’ve tried combinations of Coursera’s Data Science Specialization, DataCamp’s R courses, and the “R for Data Science” textbook.  Here’s what I’ve learned about learning and teaching R and what I recommend.

I see three big things that will help you learn R:

  1. A problem you really want to solve
  2. A self-study resource
  3. A coach/community to help you

1. A problem you want to solve

The Reward vs. The Pain

R can do many valuable things, once you’re past the basics. It is also frustrating to learn.  A few learners are interested in R itself (they may eventually join the Twitter community), but most people have problems to solve.  For them, R is not a destination, but a means to an end.  Why do you want to learn R?  Maybe there is problem that you can’t easily solve with your current tool.  That is the reward.

Those students will stick with R so long as the value-to-frustration ratio is sufficiently high.  Their path to learning R must maximize the reward and minimize the pain.

(I’m one of those interested-in-R-itself people, and realized this divide the hard way when the learning resources that worked for me failed for people with a more practical motivation).

Maximize The Reward

Start with the juiciest topics

Hadley Wickham explains his unconventional (and good, in my opinion) sequencing of topics in R for Data Science:

Starting with data ingest and tidying is sub-optimal because 80% of the time it’s routine and boring, and the other 20% of the time it’s weird and frustrating. That’s a bad place to start learning a new subject! Instead, we’ll start with visualisation and transformation of data that’s already been imported and tidied. That way, when you ingest and tidy your own data, your motivation will stay high because you know the pain is worth it.

Maximize the reward by starting with the highest-yield, most-compelling stuff.  Begin by teaching students to count and chart, or to do the things that got them interested in R – not by lecturing on R’s object classes!

Have a headache

Math teacher Dan Meyer asks, “if math is the aspirin, what’s the headache?” The same goes for learning R (and maybe anything): students will learn more when they have a problem they want to solve with R.  This is a powerful reward!  If a student already has a headache, good.  Go solve that problem and you’ll learn R in the process.  (Ideally your first problem will not be too complicated).

I first tried R when I wanted to calculate the median of a variable, for each combination of three other variables.  I wanted hundreds of medians and I could not do it in Excel.  The reward of solving my headache made it worthwhile to take the plunge.

If you don’t have a headache, you should get one.

Minimize The Pain

It is so annoying to be a beginning R user.  Especially if you’re used to software like Excel or Tableau, where things just work.  There are a thousand stupid hurdles to trip new useRs:  installing packages, working directory problems, data ingest, obscure error messages, bad documentation…

As an advanced user, sometimes I forget about this.  Then a student shares a cryptic error message that seems to bear no relation to the problem they want to solve and I shudder.

To this end, I teach the tidyverse first.  Compared to base R, you’ll get usable results faster, there’s a flatter learning curve, and there are many demos online to draw from.  The apply functions have a place, sure, but throwing them at beginners is a good way to turn people off.  (Go read the documentation for ?tapply).  The [professional] students I work with do not care about lists or vector types right now; they have problems to solve.

2. A self-study resource

The best way to learn R is to do your work in R.  Having your own problem to solve provides not only motivation, but practice, which is the key to learning anything.

If you have a skill gap between the R you know and the R you’ll need to solve your own headache, you may need to bridge the gap with some artificial practice.  The resources below are ways to build enough scaffolding through artificial problem solving that you can start doing your own meaningful work in R.

Here’s how I’ve found some popular resources to stack up, in the order I’ve tried them with students.

Coursera’s JHU Data Science Specialization

I started this sequence in 2014 as new R user.  Since then I’ve sent others through it.  I enjoyed it, but most others have not.  A true beginner who is not extremely self-motivated will need support through the rough patches in these courses.

Summary of feedback from beginners: “confusing and frustrating, I don’t see how this is relevant to my work.”

Pros:

  • Practice problems and homework.  If you don’t have your own headaches, the homework and quizzes will provide them.
  • Accountability.  The weekly deadlines kept me successfully prioritizing professional development.  Without them it would have fallen by the wayside.
  • Swirl.  Swirl is a simple, interactive way to learn basic aspects of R, like a pared-down version of the DataCamp interface (below).  Swirl modules are nicely embedded into the early courses here, with accountability!

Cons:

  • Poor sequencing of topics/curriculum.  Last I checked, these courses still started off with R’s object classes, the apply functions, loops, etc. instead of higher-yield approaches like the tidyverse.  Some courses have wide, shallow scopes, leading to time spent dealing with OAuth tokens and parsing JSON – but just once, enough to get frustrated but not learn.
    • They teach different R graphics approaches: ggplot2, base R, and lattice are all crammed into a few weeks.  Pick one – preferably ggplot2 – and teach it deeply.
  • Unconvincing headaches.  The potential for fun, motivating practice is so great with these topics, which made some of the lackluster Coursera assignments particularly disappointing.
    • An early assignment in R Programming  – this is the first challenging course and a gateway to the rest – asks students to write their first function: it will invert a matrix and cache the result.  This task is arcane, unmotivating, and difficult for the wrong reasons; its lack of similarity to the real-world problems my students want to solve with R was immensely frustrating.  (The hospital data wrangling functions were better).
    • The course project for linear regression has students create an explanatory model using a low-quality, tiny data set (the 32 row mtcars data.frame).  This was limited and inauthentic.
    • The machine learning course project had a row ID variable present in both the training and evaluation set, essentially giving away the answers when trying to predict what group a record came from.
  • Poor alignment between assignments and lectures.  Homework and quiz problems were often unrelated to the week’s lectures.  This was justified by, “you should be a hacker and go figure it out.”  That’s reasonable for some level of disconnect, but when there’s a total mismatch it just adds pain for R learners and renders the lectures irrelevant.

Additionally, the lecture quality is inconsistent and the peer community is plagued by plagiarism (I estimate 50% of assignments are plagiarized).

Overall, the Coursera JHU specialization is moderate reward, high pain.  It suited me, but failed to meet the needs of my beginner colleagues.

DataCamp

I’ve worked through parts of different courses to explore them, and have sent others through them.

Summary of feedback from beginners: “approachable, and I felt like I was getting it, but I wasn’t prepared to apply what we covered.”

Pros:

  • Good videos.  The combination of code and narration is easily to follow and the explanations are detailed.
  • Bite-sized pieces with interactive practice.  The in-browser programming interface is nice, as is the ability to get feedback and hints.

Cons:

  • Poor sequencing of topics/curriculum. Look at the intro to R and intermediate R courses.  Starting beginners off with matrices and lists?  Loops, control flow, lapply?  Woof.  To get around this, I tried to cherry-pick chapters of the different courses, which proved unsatisfactory.  I ended up sending people to the introduction course, then courses in dplyr, data cleaning, ggplot2, and R Markdown.
  • Limited practice opportunities.  Students told me they’d play along with the videos successfully, but then go to work in R weeks later and find they’d have remembered little and have few examples to look back on.  Independent practice is missing from DataCamp.  (Coursera’s homework, quizzes, and projects are a major differentiator in this respect).
    • I ended up creating private practice projects to augment the DataCamp materials.
  • Heavily reliant on videos.  Not a problem if video works for you, but maybe you prefer other modes.

Previously, DataCamp had the additional “pro” of offering a generous non-profit discount, but this has unfortunately been discontinued.

Overall, DataCamp is low reward, moderately-low pain.

R For Data Science

Released in 2017, this textbook is freely available online.  I’ve browsed the book, and recommended sections to others, but haven’t worked through the complete text.  It’s too new for me to have feedback from users.

Pros:

  • Coherent topics and sequencing.  There’s a clear, opinionated philosophy that front-loads the highest-yield approaches (e.g., charting before data ingest).
  • Uses state-of-the-art packages.  It’s based in the tidyverse, and the nature of its main author (Hadley Wickham, who maintains the tidyverse) and scheme for staying updated (an open-source project where improvements are accepted as pull requests) means that it will stay current with the rapidly-developing world of R tools.
  • It’s free and open.  No logging in or paying.

Cons:

  • Practice problems seem thin and lack supports.  Unlike the curricula above, there are no hints, forums, or accountability.  This makes sense for a textbook, but it will be critical that students obtain their own additional, deeper practice.
  • No credential.  Some learners find it rewarding to earn points, grades, and certificates.
  • Narrow focus on the essentials.  This is intentional, and stated in the book.  Data science encompasses a vast array of topics, and the book is better for staying in its lane.  But if you want to learn more about say, inferential statistics, you’ll need to go elsewhere.

Overall, R for Data Science is moderate reward, low pain.

3. A coach or community to help you

The Secret Sauce: Coaching

Looking at the [professional] students who have gotten the most out of the trainings above, they have something in common: they had someone to guide them.  In some cases this was more intensive, with weekly pair-coding sessions and code reviews, but at a minimum most students who successfully climbed the learning curve had someone they were comfortable emailing or IM’ing with questions.  Sometimes the response would be “look at this StackOverflow post,” other times it would be hopping on a screenshare to debug.

Coaching helps both sides of the reward/pain ratio: you get guided to the most fruitful approaches and achieve your results faster, while capping the time you’re stuck on any one obstacle.  Our data analysts have increased exposure to coaching through regular code reviews (for feedback and improvement, not just an accuracy check) and through a chat channel dedicated to R, where users can share ideas and crowd-source answers.

Unfortunately, coaching doesn’t scale.  It’s time-intensive and many people teaching themselves R won’t have access to a more advanced user – though ask around your company or program.  But there are proxies for coaching:

  • the Coursera course forums and TAs;
  • R friends you make at your local R meetup;
  • The online R community, including #rstats on Twitter
  • StackOverflow (read existing questions and post new ones)

In closing: How I suggest learning R

  1. Get your own headache to solve – why do you want to learn R?
  2. Start trying to solve it, using tidyverse packages.
  3. Lean on your coach, peers, or online community when you need help.
  4. If you find you need to build your skill base before you can solve your own headaches:
    1. Work through R for Data Science with a friend.  Then:
    2. If you like videos and bite-sized practice, tell a more advanced R user about the headache you want to solve and ask them what chapters of DataCamp courses they think would be relevant.  You should pick and choose based on what’s useful in your personal work.
    3. If you want more in-depth practice problems, use the skills you learned from R for Data Science to work through the Coursera courses.
    4. Seek out more practice.  Follow along with worked problems explained by bloggers, work through the dplyr package vignettes, try an easy Kaggle competition.

This is an exciting time for teaching yourself R.  None of the curricular options I review here existed 5 years ago, nor did the suite of R packages I recommend starting with.  While they’re imperfect, I’m glad these options are available to the community, with low barriers to entry.  I appreciate the hard work that their creators put in and appreciate even more their continued improvement.

I hope this blog post becomes obsolete in the next 5 years, as the options for self-teaching R expand and mature along with the R ecosystem and community.

Leave a Reply

Your email address will not be published. Required fields are marked *