Taking Care of Tile Farm

Tile Farm, our map rendering server, has undergone a quiet overhaul during the past three months to deal with wild swings in demand. In March, we released the worldwide Watercolor maps to bring the number of visual designs at maps.stamen.com to three. With great press and kind words comes unprecedented demand, so I’ve been working with Tethr’s Aaron Huslage to change the way we use geographic data to handle the load. We’ve learned a few things along the way that will be important features of Stamen’s future mapping work.

The general theme of our optimization work has been to speed up response times and shrink problems. Tiles were being rendered reliably most of the time, but the overall experience was sluggish and unresponsive. Sometimes, the watercolor tiles would fail altogether. System load was weirdly high, and we didn’t have a reliable way to understand what we were seeing.

Repair #1: Postgres and Imposm

Looking into the system, Aaron noticed that several common map feature queries were taking an unusually-long amount of time for Postgres to process, even after performing basic database tunings to use memory and cache more effectively. Our first significant move was to attempt to shrink the cost of those queries, typically for large features at low zoom levels.

We found queries for lakes, forests, riverbanks and similar features that introduced massive overhead due to the number and complexity of their shapes. After doing a bit of research into simplifying and filtering geometries, I remembered Oliver Tonnhofer’s Imposm mentioned by AJ Ashton in his 2011 State Of The Map presentation on Tile Mill. Imposm is an OpenStreetMap data importer and an alternative to the older Osm2pgsql, and in addition to solving our needs for simpler, fewer shapes through its GeneralizedTable feature, it also answered a bunch of questions we didn’t know we had yet.

The first unexpected bonus from using Imposm was that we could create special, custom tables for exactly the data we’d need for any particular selection of data. Instead of the resource-sucking Postgres views onto generic point/line/polygon tables offered by High Road, Imposm allows us to create a larger number of rendering-specific tables with no more than the exact data we’d need.

The second unexpected bonus is that Imposm is written completely in Python, so adding new post-processing steps to the data is trivial as it’s imported into the database. For example, we’re starting to get rid of long, confusing regular expressions for abbreviating street names and replacing them with procedural code that can be more easily customized and shared.

The nicest pleasant surprise from Imposm is that it’s actually much faster to run than Osm2pgsql, thanks to a concurrent design that divides the import into parallel processes.

Repair #2: More, smaller Gunicorns

Tile Farm uses a WSGI server called Gunicorn to host instances of TileStache. A herd of gunicorn processes are hidden behind the webserver Nginx, which protects them and delivers requests. This is a common arrangement, and in our first design for Tile Farm we had a single Gunicorn configuration with all of our TileStache settings. With a rush of visitors to the site upon launch and the unique design of the Watercolor layer, we had to perform a number of fast, targeted interventions to improve performance and keep up with sudden demand. Watercolor in particular presented some debugging challenges, and we found that it was important to test changes to it in isolation from the other styles.

After working with a separate Gunicorn configuration for just Watercolor on a different port, it was clearly going to be easier to extend the same setup to the remaining Toner and Terrain styles, each of which is actually built up from a number of sub-styles composited together after rendering in Mapnik. The TileStache configuration is now split, with one for Toner, one for Terrain, one for Watercolor and a few others. Instead of interfering with the entire service when performing maintenance we can now modify individual styles and lower the extensive cost of restarts. Unlike Apache processes, Gunicorns can often be hard to kill and they don’t go down easy, so it’s important to be able to target them more narrowly. The primary drawback to this approach is complexity: each Gunicorn/TileStache setup runs on a separate port, governed by a separate startup script with its own logfile and other reporting. Another drawback is hundreds of persistent connections from Mapnik to PostGIS left open at all times, which Aaron assures me is weird and wrong. Still, the ability to isolate problems and even permit limited crashiness in construction areas has been liberating.

Repair #3: The Gunicorn Slayer

I mentioned above that Watercolor has its own special snowflake problems. Processes rendering these maps would eventually become unresponsive, not replying to requests and not giving up their slots to be replaced by their parent Gunicorn server. We looked at a number of possible causes and fixed each: watercolor texture bitmaps are now loaded globally instead of per request, and we no longer use metatiles since the CPU cost of Watercolor is linear with bitmap size. Debugging this was confusing, then frustrating, then finally a waste of time. We settled on a more drastic solution and built the Gunicorn Slayer.

The Gunicorn Slayer is inspired by the Netflix Chaos Monkey, “a wild monkey with a weapon in your data center to randomly shoot down instances and chew through cables.” Admittedly less chaotic, ours borrows a strategy from Logan’s Run and retires any Gunicorn process older than a few hours.

This last fix still feels somewhat dirty since we were unable to determine the root cause of the problem, though if we were to revisit it all again an excellent post from Pinterest’s engineering blog gives some great hints on repurposing process titles and POSIX signals to gain some visibility into a running process.

Then And Now

Maintaining reliability for a project like maps.stamen.com is always going to be a long game of catch-up. Our goal has been to make the project slightly easier to ignore every day, by increasing reliability and finding and fixing trouble spots. The fix with the single largest impact has been a move from generic to bespoke OSM data through Imposm, followed closely by moving to a more-and-smaller approach with the rendering servers.