Categories
#rstats Data analysis ruminations Work

Reflections on five years of the janitor R package

One thing led to another. In early 2016, I was participating in discussions on the Twitter hashtag, a community for users of the R programming language. There, Andrew Martin and I met and realized we were both R users working in K-12 education. That chance interaction led to me attending a meeting of education data users that April in NYC.

Going through security at LaGuardia for my return flight, I chatted with Chris Haid about data science and R. Chris affirmed that I’d earned the right to call myself a “data scientist.” He also suggested that writing an R package wasn’t anything especially difficult.

My plane home that night was hours late. Fired up and with unexpected free time on my hands, I took a few little helper functions I’d written for data cleaning in R and made my initial commits in assembling them into my first software package, janitor, following Hilary Parker’s how-to guide.

That October, the janitor package was accepted to CRAN, the official public repository of R packages. I celebrated and set a goal of someday attaining 10,000 downloads.

Yesterday janitor logged its one millionth download, wildly exceeding my expectations. I thought I’d take this occasion to crunch some usage numbers and write some reflections. This post is sort of a baby book for the project, almost five years in.

By The Numbers

This chart shows daily downloads since the package’s first CRAN release. The upper line (red) is weekdays, the lower line (green) is weekends. Each vertical line represents a new version published on CRAN.

From the very beginning I was excited to have users, but this chart makes that exciting early usage seem miniscule. janitor’s most substantive updates were published in March 2018, April 2019, and April 2020, with it feeling more done each time, but most user adoption has occurred more recently than that. I guess I didn’t have to worry so much about breaking changes.

Another way to look at the growth is year-over-year downloads:

YearDownloadsRatio vs. Prior Year
2016-1713,284
2017-1847,3043.56x
2018-19161,4113.41x
2019-20397,3902.46x
2020-21 (~5 months)383,5956
Download counts are from the RStudio mirror, which does not represent all R user activity. That said, it’s the only available count and the standard measure of usage.

This project has really taken off. janitor took almost four and a half years on CRAN to hit its millionth download — but also might log a million downloads in 2021 alone.

This growth can’t go on forever, but I don’t think janitor’s adoption is yet approaching a saturation point among R users. And there was likely an overall increase in the total number of R users during this period, too. For comparison, the popular dplyr package, on which janitor is built, saw a download count in 2019-20 that was 1.52x its 2018-19 total (this is apples-to-apples with the 2.46x figure in the table above). I see the pool of potential janitor users as a subset of dplyr users; this pair of numbers tells me both that janitor is increasing its adoption share within that group and that the size of the overall pool of potential users is growing.

In terms of relative popularity, I gathered download counts for the 17,235 packages on CRAN for the first two months of this year, January 01 2021 through March 01 2021. janitor’s 159,945 downloads ranked it 286th, putting it in the 98th percentile. (Here’s my code for this blog post; the package downloads count part comes from this post by Josiah Parry).

Two other metrics have tracked neatly with downloads: on Github, the project sits just short of 1,000 stars and 100 forks.

Highlights

Some of my favorite parts of this journey have been:

Making a sticker. I admired the tidyverse hex stickers and celebrated janitor’s 2016 CRAN release by commissioning a janitor sticker. It felt realer to have a tangible symbol of its existence. Here are the first designs:

And here is the final product in the wild:

I still have stickers left – let me know if you want one!

Attending rOpenSci’s unconference 2017. I learned a ton, wrote some new code, met delightful people, swapped stickers, and got good career advice (a PhD was indeed not the right move for me). And I began the big rewrite for janitor 1.0 on the plane ride back from Los Angeles. (I may be concerned about flying but will say I feel most productive on planes. And trains: I wrote hours of janitor code on the train back from Toronto in January 2018. More examination later of when I found time for janitor).

Figuring out how to store the underlying data in a tabyl so that all the adorn_ functions can retrieve it. I am proud of this technical breakthrough and am a little disappointed no one has ever asked me how it works. *I* think it feels like magic to call adorn_ns().

Learning software development concepts. I’ve learned some fundamentals of software development, including version control (git), unit testing, continuous integration (CI), and semantic versioning. The git skills in particular have been a help on other projects.

Helping others. Most of all, it warms my heart to hear from a new R user – say, someone completing a PhD in biology – how the janitor functions are saving them time. Suppose there have been 100,000 janitor users and it has saved them an average of one hour apiece, that would be eleven years of user frustration saved!

Sticking with it. This project has progressed one small improvement at a time. Five years of small improvements have added up! This may be the most total time I’ve ever worked on a project or effort.

Collaborating with people virtually. janitor has received code contributions from 24 different people around the world, plus intangible contributions from dozens of others in the form of feedback, requests, and bug reports. It’s evidence of the human tendencies of creativity and working toward the common good. On that note:

Thank Yous

This couldn’t have happened without help from so many. In somewhat chronological order, thank you to:

  • Alex Spurrier, who got me started with both R and the tidyverse (back then it was the “Hadleyverse”) in 2014 and Adam Maier, who helped me climb the R learning curve.
  • Early encouragers and collaborators like Chris Haid, Andrew Martin, Ryan Knight, and Jake Russ.
  • Bill Denney, who has improved janitor more than anyone else (more on that below).
  • Everyone who shared love on social media, via email, or at unconf17, especially early on. Knowing that people were using janitor kept me invested.
  • Users who filed a bug report or feature request or submitted a pull request. Or who contributed ideas or opinions in Github discussions. Malte Grosser and Jonathan Zadra have exemplified this kind of contribution.
  • TNTP, where I worked until last month. I’m grateful to both my colleagues, who were core users of janitor and who created a nurturing environment, and my managers, who indulged my work on the project and were excited for me, even though open-source software development was not obviously germane to our mission.
  • The tidyverse developer team at RStudio. janitor is built on their excellent packages and would not be possible without their work. Their team has been encouraging and kind, humoring janitor’s aspirations to be a friend of the tidyverse.
  • rOpenSci for fostering a helpful, friendly community. I’m a scientist at heart, if not a credentialed one, and I enjoy being part of the open science movement.
  • Most of all, thanks to family members who cared for my kids and freed me up for periods of deep work. They’ve been supportive from the beginning: janitor’s first CRAN release made it into my in-laws’ 2016 holiday letter.

Especially huge thanks to Bill Denney. Bill and I met over Github one day when he suggested a feature in 2018. That’s not so unusual. What is unusual is that he also submitted a well-programmed implementation of it, and was receptive to feedback, and nailed all the details. As we worked more together, he suggested more features, responded to new issues from others, resolved thankless problems related to timezones and encoding, and wrote a ton of sharp code.

We are complimentary partners. I have a vision for the user’s experience, but my code is not always concise or fast, and I can be flighty. Bill is a superlative programmer and responsive project participant. Especially as my time spent on janitor grew more scarce, Bill’s knack for writing clean, thorough code was invaluable in making some nice features happen. In particular, the current behavior of clean_names (like how it handles non-ASCII characters, especially important for those working in a language other than English) and the function compare_df_cols (which was a work in progress for three years) would not have been possible without Bill. When we finished version 2.0.0 last April, I felt some relief. COVID-19 uncertainty was rampant and I found it reassuring that if things went south for me, at least this project had come to rest in a good place.

Working with Bill on janitor has, to me, been the internet at its best. Our collaboration arose organically, built through contributions big and small, and almost all of our communication has happened on Github. It feels like being penpals, but for making software. The world of free, open-source software (FOSS) has its issues but its best parts shine as examples of humanity’s goodness and potential.

The Personal Give and Take

What I put in and got out of this project is complicated. Looking at my Github contribution graph, the days where I’ve made the most progress on janitor were all holidays (Passover, Christmas) or vacations (at the beach in North Carolina or a cabin in Ontario) or long weekends. That is, days when I had no work obligations and where my children were playing with their grandparents.

Was I doing uncompensated labor? Or pursuing a hobby for pleasure in my free time? Definitely more of the latter, but it wasn’t always clear. I was working until midnight on vacation, but by choice, chasing the intellectual pursuit and satisfaction of problem-solving.

There were times where out of a sense of obligation, I reluctantly snuck away from my family to finish a release. For instance, this Christmas I spent hours closing some lingering minor issues for an upcoming release. This project has mostly been fun intellectual challenges, but it has its share of tedious tasks, like documentation updates or navigating a CRAN release. I think that’s inevitable, though; those details are what makes janitor a tool the world can use rather than a half-baked personal lark. And since weekends and holidays are when I have time to myself, that’s when I grind it out.

As to personal gains from this work: yes, I hoped this skill development and demonstration would pay off career-wise, but the primary motivator was intrinsic. I’m not sure how much janitor opened doors professionally, but the skills I picked up from working on it changed how I do my work. That growth kept me from feeling like I was stagnating professionally. And it led to other projects like the R packages surveymonkey and tntpr.

Looking back, the free time spent on janitor was mostly pleasurable. Going forward, though, there are other things I’d rather do in my limited free time. That’s convenient, though, as janitor is pretty stable now and doesn’t require as much from me. I don’t see the need for another major release … but I’ve said that before and been wrong, so I also wouldn’t be surprised to be prepping for another CRAN submission over the holidays this coming winter.

Why It Worked

I think the janitor package found success because I was firmly grounded in the beginner experience. When you’re new to R, it can be really hard to do simple things. That was even more the case in 2014, when intro courses pointed me toward indecipherable functions like tapply(). When I started janitor, I’d recently been a new R user myself and coached data analysts at my workplace as they switched from Excel, SPSS, and Stata to R. I could see the common pain points: why spend a half-hour renaming variables in R when you could edit the raw data in Excel? Why is it so complicated to crosstab two variables?

janitor’s first and still-best function, clean_names(), started off very simple (it’s now quite complex under the hood): make the text lowercase and replace spaces and special characters with underscores. Veteran R users would find this task easy, and many had their own versions of this function. What was missing was packaging it up for beginners for whom this was a stumbling block.

Coaching others, and constantly cleaning dirty data for years as a consultant, kept these functions evolving. The total overhaul of tabyl() for version 1.0 grew out of a conversation with a colleague who wanted to mix row-wise percentages with column-wise totals. Debugging round_half_up() with another colleague was the impetus behind the release of version 2.1. Requests and bugs from users, often beginners, were another key voice shaping the package’s growth.

janitor in the wild

I’ll close with a couple of my favorite examples of janitor being used. This is the part that feels the most magical: the tools I created are helping people do good work!

Sophie Beiers, a data journalist at the ACLU, gave a great talk at rstudio::global 2021, Trial and Error in Data Viz at the ACLU. She describes several projects, then walks through the process of an ACLU study that made the case for releasing incarcerated people to keep them safe from COVID-19. Her team studied data from the early months of the pandemic and found that communities who freed more people saw no increase in crime.

This project required processing “countless .csvs and Excel files” from dozens of locales. And, Sophie said, janitor was instrumental in that effort. Hooray! This was exactly janitor’s original use case, cleaning and exploring many dirty data files, and it’s gratifying to see the tool being used to expedite such righteous efforts.

From Sophie’s talk: a glimpse of a particularly gnarly tabyl

I also must include this amazing artwork by data science illustrator Allison Horst, depicting clean_names() in action:

Best documentation ever

To janitor’s last five years and its next five

I managed to keep doing something for long enough that I ended up with something good to show for it. janitor went from being a tiny, rapidly-evolving project that took up a lot of my time to a mature project that has stabilized but still keeps improving incrementally.

At five years in, it’s as widely used as ever, even as I’m learning a new job and using R less. I plan to keep up with maintenance, out of a sense of duty and because I’m still a user myself. (Just yesterday I wondered, could we write a function to find and extract all tables scattered within a spreadsheet?) But I’m glad janitor’s in this established stage where it works and is well-documented and I can mostly sit back and blog about it.

One reply on “Reflections on five years of the janitor R package”

Dude. Thank you so much for this package. clean_names and tabyl are completely instrumental and saved me dozens of hours. The package is included in our introductory R courses in my company so you should definitely feel proud of what you’ve achieved here.

Thank you very much.

Leave a Reply

Your email address will not be published. Required fields are marked *