Running Our Own Fork of Apache Superset in Production

Here’s an update on my journey deploying Apache Superset and using Docker. It felt good to write about what I’ve learned at work, and it’s been a while: I don’t think I posted here that my job title is now Senior Data Engineer! But people who don’t work on computer infrastructure might not find this one interesting.

In 2023, we deployed Apache Superset at the City of Ann Arbor as our Business Intelligence (BI) / data visualization platform, choosing it over Microsoft Power BI or Metabase. That decision has been a resounding success. Superset is a rock-solid product that keeps getting better … and* we’ve saved over $150k and counting in license costs vs. proprietary software.

I’ve re-read the 2015 talk “Choose Boring Technology” a couple of times while working in my current job (that link goes to a slideshow version of the talk, turned into a website – that’s the format I’ve experienced it in). The author talks about having only three innovation tokens to spend on new tech at a given time. The rest has to be boring. Then when you’ve mastered the new tech, you get a token back to spend on something else.

Deploying Superset took all my tokens: Docker, DevOps, Linux sysadmin. I said, we will only deploy official Docker images released by the Superset project. No way are we in the business of creating our own, this is complicated enough.

I learned a ton in the intervening years. I’m still learning a ton. It’s great! As I’ve gotten those tokens back by becoming competent at those technologies, I’ve been able to do more with our Superset deployment.

First that looked like building our own Superset Docker image, tweaking the environment but not touching the code. The project forced our hand on this because starting in 4.1.0, it no longer included basic drivers needed to use Superset out of the box, most notably the one to connect to the PostgreSQL backend database. I’m still not entirely convinced this was the right choice for the project but I saw the other side’s argument that everyone really ought to be building their own image.

So I stepped into my role as a member of the Superset Project Management Committee (PMC) and wrote the docs for how to do that. If I was going to have to figure this out, I might as well help others make the jump too so they could stay with Superset (and if anything was wrong in my approach, others would correct it!).

That wasn’t too complicated as we were simply extending an official release of Superset. You can see that Dockerfile example is pretty short: install some drivers and Python packages that I need in my Superset for single sign-on authentication, connecting to Microsoft SQL Server, etc.

We’ve been going like that through multiple releases and I’ve gotten my innovation token back. This week I went to the next level: deploying our own tiny fork of Superset.

I say “tiny” because it only contains code present in the upstream Superset repository. Specifically, I checked out the repository at the commit for the official 6.0.0 release, then cherry-picked one standalone fix for a bug that’s bothering us where HTML tags don’t get parsed and show up in a table.

That fix will be present in the next official version, 6.0.1, but since we are already building our own Docker image, it seemed easy enough to cherry-pick this fix and take little step down the path of forking. Then I can drop the fork when 6.0.1 comes out and deploy it as-is.

We’re still not writing custom code that isn’t in the main branch of Superset and deploying that. I can easily see how trying to maintain custom code through multiple upstream releases can get perilous and anyway, we don’t have a pressing need for any customization that can’t be sent back to the main Superset repository.

On that note, all of the improvements I’ve made to the Superset codebase have been sent to the main project as pull requests (PRs). To get them into our environment, I’ve waited for them to merge, then for the next official version to be released before the fixes or enhancements appear in our Superset environment.

Now maybe I can start cherrying my own PRs into our custom image? For instance, this tiny fix I made in 2025 where I added more options to Superset’s boxplot chart at the request of the head scientist at Ann Arbor’s water treatment plant. I could have added that to our built image immediately without waiting for an official release. I’ll have to restrict myself to atomic PRs whose code I understand as to not bite off unintended consequences, but I should be able to cherry most bug fixes this way.

Running our own Docker image of Superset feels practical at this point, not risky. I’m avoiding the situation of trying to maintain custom code indefinitely in a fork and I don’t think it will come to that. I don’t see a business need at the city such that we would build a custom feature that isn’t appropriate for contribution to the main Superset project.

But who knows. I said we wouldn’t build our own image and now we’re not only building an image, we’re building it off a (modest) fork. I’ve learned I can’t project more than a few months out what kind of technology choices will make sense for us. Tech changes fast, but mostly it’s about me earning my innovation tokens back.

This kind of flexibility is part of why I love using open-source software. I can rearrange things as needed to meet my organization’s needs, looking at the code myself and evaluating which pieces I want right now. With proprietary software, I’d be stuck hoping that the bug fixes and features in the new version outweigh the new bugs I’d have to live with through the next release cycle.

The last thing I’ll say is that I love Docker containers as a technology. It has great ratio of complexity to impact. I became familiar with Docker as a side effect of deploying Superset and now it’s essential to data engineering at the city.

All of my data ETL jobs run in containers with set versions of Python (and R) packages, making it possible to gradually migrate each task to newer packages and tools without the breakage that might occur if I had to upgrade a single environment on a host VM.

I highly recommend trying out Docker, particularly Docker Compose, if you have a chance to deploy software that way. This applies to work as well to personal side projects, particularly if you are leaning on AI coding assistants. With the speed at which those tools make alterations and deploy new services, Docker is nice for containing and sandboxing things. E.g., instead of installing a one-off Postgres DB on your host, run it as a container.

UPDATE: it didn’t take long for this decision to pay off again. It’s eleven days later and today I cherried and deployed a bigger bugfix, where hitting “Refresh Dashboard” on a dashboard with nested tabs crashes Superset. This has brought down our whole Superset instance a few times and today was the last straw. I cherry-picked #37018 and now Superset is no longer crashing!

Footnote

*I used the ellipsis dots “…” instead of an em-dash in that sentence to signal that this is an entirely human-written post, no AI involved. Editing out my em-dashes makes me sad but they’re poisoned. Even as a long-time em-dash lover, I am skeptical when I see someone else using them, so I’m ceding the ground to LLMs. At least for now.

Leave a Reply Cancel reply