Categories
Software Work

Running Our Own Fork of Apache Superset in Production

Here’s an update on my journey deploying Apache Superset and using Docker. It felt good to write about what I’ve learned at work, and it’s been a while: I don’t think I posted here that my job title is now Senior Data Engineer! But people who don’t work on computer infrastructure might not find this one interesting.

In 2023, we deployed Apache Superset at the City of Ann Arbor as our Business Intelligence (BI) / data visualization platform, choosing it over Microsoft Power BI or Metabase. That decision has been a resounding success. Superset is a rock-solid product that keeps getting better … and* we’ve saved over $150k and counting in license costs vs. proprietary software.

I’ve re-read the 2015 talk “Choose Boring Technology” a couple of times while working in my current job (that link goes to a slideshow version of the talk, turned into a website – that’s the format I’ve experienced it in). The author talks about having only three innovation tokens to spend on new tech at a given time. The rest has to be boring. Then when you’ve mastered the new tech, you get a token back to spend on something else.

Deploying Superset took all my tokens: Docker, DevOps, Linux sysadmin. I said, we will only deploy official Docker images released by the Superset project. No way are we in the business of creating our own, this is complicated enough.

I learned a ton in the intervening years. I’m still learning a ton. It’s great! As I’ve gotten those tokens back by becoming competent at those technologies, I’ve been able to do more with our Superset deployment.

First that looked like building our own Superset Docker image, tweaking the environment but not touching the code. The project forced our hand on this because starting in 4.1.0, it no longer included basic drivers needed to use Superset out of the box, most notably the one to connect to the PostgreSQL backend database. I’m still not entirely convinced this was the right choice for the project but I saw the other side’s argument that everyone really ought to be building their own image.

Categories
ruminations Software Work Writing

LLMs are good coders, useless writers

My writer friends say Large Language Models (LLMs) like ChatGPT and Bard are overhyped and useless. Software developer friends say they’re a valuable tool, so much so that some pay out-of-pocket for ChatGPT Plus. They’re both correct: the writing they spew is pointless at best, pernicious at worst. … and coding with them has become an exciting part of my job as a data analyst.

Here I share a few concrete examples where they’ve shined for me at work and ruminate on why they’re good at coding but of limited use in writing. Compared to the general public, computer programmers are much more convinced of the potential of so-called Generative AI models. Perhaps these examples will help explain that difference.

Example 1: Finding a typo in my code

I was getting a generic error message from running this command, something whose Google results were not helpful. My prompt to Bard:

Bard told me I had a “significant issue”:

Yep! So trivial, but I wasn’t seeing it. It also suggested a styling change and, conveniently, gave me back the fixed code so that I could copy-paste it instead of correcting my typos. Here the LLM was able to work with my unique situation when StackOverflow and web searches were not helping. I like that the LLM can audit my code.

Example 2: Writing a SQL query

Today I started writing a query to check an assumption about my data. I could see that in translating my thoughts directly to code, I was getting long-winded, already on my third CTE (common table expression). There had to be a simpler way. I described my problem to Bard and it delivered.

My prompt:

Bard replied:

Categories
#rstats Data analysis ruminations Software Work

Same Developer, New Stack

I’ve been fortunate to work with and on open-source software this year. That has been the case for most of a decade: I began using R in 2014. I hit a few milestones this summer that got me thinking about my OSS journey.

I became a committer on the Apache Superset project. I’ve written previously about deploying Superset at work as the City of Ann Arbor’s data visualization platform. The codebase (Python and JavaScript) was totally new to me but I’ve been active in the community and helped update documentation.

Those contributions were sufficient to get me voted in as a committer on the project. It’s a nice recognition and vote of confidence but more importantly gives me tools to have a greater impact. And I’m taking baby steps toward learning Superset’s backend. Yesterday I made my first contribution to the codebase, fixing a small bug just in time for the next major release.

Superset has great momentum and a pleasant and involved (and growing!) community. It’s a great piece of software to use daily and I look forward to being a part of the project for the foreseeable future.

I used pyjanitor for the first time today. I had known of pyjanitor‘s existence for years but only from afar. It started off as a Python port of my janitor R package, then grew to encompass other functionality. My janitor is written for beginners, and that came full circle today as I, a true Python beginner, used pyjanitor to wrangle some data. That was satisfying, though I’m such a Python rookie that I struggled to import the dang package.

Categories
Data analysis Local reporting Software Work

Making the Switch to Apache Superset

This is the story of how the City of Ann Arbor adopted Apache Superset as its business intelligence (BI) platform. Superset has been a superior product for both creators and consumers of our data dashboards and saves us 94% in costs compared to our prior solution.

Background

As the City of Ann Arbor’s data analyst, I spend a lot of time building charts and dashboards in our business intelligence / data visualization platform. When I started the job in 2021, we were halfway through a contract and I used that existing software as I completed my initial data reporting projects.

After using it for a year, I was feeling its pain points. Building dashboards was a cumbersome and finicky process and my customers wanted more flexible and aesthetically-pleasing results. I began searching for something better.

Being a government entity makes software procurement tricky – we can’t just shop and buy. Our prior BI platform was obtained via a long Request for Proposals (RFP) process. This time I wanted to try out products to make sure they would perform as expected. Will it work with our data warehouse? Can we embed charts in our public-facing webpages?

The desire to try before buying led me to consider open-source options as well as products that we already had access to through existing contracts (i.e., Microsoft Power BI).