Taking Care of Tile Farm

Tile Farm, our map rendering server, has undergone a quiet overhaul during the past three months to deal with wild swings in demand. In March, we released the worldwide Watercolor maps to bring the number of visual designs at maps.stamen.com to three. With great press and kind words comes unprecedented demand, so I’ve been working with Tethr’s Aaron Huslage to change the way we use geographic data to handle the load. We’ve learned a few things along the way that will be important features of Stamen’s future mapping work.

The general theme of our optimization work has been to speed up response times and shrink problems. Tiles were being rendered reliably most of the time, but the overall experience was sluggish and unresponsive. Sometimes, the watercolor tiles would fail altogether. System load was weirdly high, and we didn’t have a reliable way to understand what we were seeing.

Repair #1: Postgres and Imposm

Looking into the system, Aaron noticed that several common map feature queries were taking an unusually-long amount of time for Postgres to process, even after performing basic database tunings to use memory and cache more effectively. Our first significant move was to attempt to shrink the cost of those queries, typically for large features at low zoom levels.

We found queries for lakes, forests, riverbanks and similar features that introduced massive overhead due to the number and complexity of their shapes. After doing a bit of research into simplifying and filtering geometries, I remembered Oliver Tonnhofer’s Imposm mentioned by AJ Ashton in his 2011 State Of The Map presentation on Tile Mill. Imposm is an OpenStreetMap data importer and an alternative to the older Osm2pgsql, and in addition to solving our needs for simpler, fewer shapes through its GeneralizedTable feature, it also answered a bunch of questions we didn’t know we had yet.

The first unexpected bonus from using Imposm was that we could create special, custom tables for exactly the data we’d need for any particular selection of data. Instead of the resource-sucking Postgres views onto generic point/line/polygon tables offered by High Road, Imposm allows us to create a larger number of rendering-specific tables with no more than the exact data we’d need.

The second unexpected bonus is that Imposm is written completely in Python, so adding new post-processing steps to the data is trivial as it’s imported into the database. For example, we’re starting to get rid of long, confusing regular expressions for abbreviating street names and replacing them with procedural code that can be more easily customized and shared.

The nicest pleasant surprise from Imposm is that it’s actually much faster to run than Osm2pgsql, thanks to a concurrent design that divides the import into parallel processes.

Repair #2: More, smaller Gunicorns

Tile Farm uses a WSGI server called Gunicorn to host instances of TileStache. A herd of gunicorn processes are hidden behind the webserver Nginx, which protects them and delivers requests. This is a common arrangement, and in our first design for Tile Farm we had a single Gunicorn configuration with all of our TileStache settings. With a rush of visitors to the site upon launch and the unique design of the Watercolor layer, we had to perform a number of fast, targeted interventions to improve performance and keep up with sudden demand. Watercolor in particular presented some debugging challenges, and we found that it was important to test changes to it in isolation from the other styles.

After working with a separate Gunicorn configuration for just Watercolor on a different port, it was clearly going to be easier to extend the same setup to the remaining Toner and Terrain styles, each of which is actually built up from a number of sub-styles composited together after rendering in Mapnik. The TileStache configuration is now split, with one for Toner, one for Terrain, one for Watercolor and a few others. Instead of interfering with the entire service when performing maintenance we can now modify individual styles and lower the extensive cost of restarts. Unlike Apache processes, Gunicorns can often be hard to kill and they don’t go down easy, so it’s important to be able to target them more narrowly. The primary drawback to this approach is complexity: each Gunicorn/TileStache setup runs on a separate port, governed by a separate startup script with its own logfile and other reporting. Another drawback is hundreds of persistent connections from Mapnik to PostGIS left open at all times, which Aaron assures me is weird and wrong. Still, the ability to isolate problems and even permit limited crashiness in construction areas has been liberating.

Repair #3: The Gunicorn Slayer

I mentioned above that Watercolor has its own special snowflake problems. Processes rendering these maps would eventually become unresponsive, not replying to requests and not giving up their slots to be replaced by their parent Gunicorn server. We looked at a number of possible causes and fixed each: watercolor texture bitmaps are now loaded globally instead of per request, and we no longer use metatiles since the CPU cost of Watercolor is linear with bitmap size. Debugging this was confusing, then frustrating, then finally a waste of time. We settled on a more drastic solution and built the Gunicorn Slayer.

The Gunicorn Slayer is inspired by the Netflix Chaos Monkey, “a wild monkey with a weapon in your data center to randomly shoot down instances and chew through cables.” Admittedly less chaotic, ours borrows a strategy from Logan’s Run and retires any Gunicorn process older than a few hours.

This last fix still feels somewhat dirty since we were unable to determine the root cause of the problem, though if we were to revisit it all again an excellent post from Pinterest’s engineering blog gives some great hints on repurposing process titles and POSIX signals to gain some visibility into a running process.

Then And Now

Maintaining reliability for a project like maps.stamen.com is always going to be a long game of catch-up. Our goal has been to make the project slightly easier to ignore every day, by increasing reliability and finding and fixing trouble spots. The fix with the single largest impact has been a move from generic to bespoke OSM data through Imposm, followed closely by moving to a more-and-smaller approach with the rendering servers.

Demand, Supply

This morning when I got into work, I noticed a small uptick in the bug reports that we get via maps.stamen.com. Curiously, quite a few of them were centered around England. I suppose I had thought that we would have generated most of the English tiles by now, so could expect smooth sailing there. Then, someone gave the game away:

“A blue tile in the center of London that unfortunately shows up on my desktop background when using Satellite Eyes.”

Friend-of-Stamen, Tom Taylor, announced almost 24 hours ago that he had “made a thing and wrote a blog post about it.” That thing is Satellite Eyes, which changes your Mac desktop wallpaper to a map of where you are. Tom has incorporated Watercolor, Toner and Terrain (in the U.S.) as well as the lovely Bing Aerial map as options.


Here is a screenshot of my desktop, with the Watercolor neighbourhood map of San Francisco in Halftone selected.

After many of us here at the Studio had installed the app, we began to notice that our tile farm was… well… smoking. Since Tom is English, we suspected that many other English people around England had also installed the app, and were happily playing with the preferences that let you switch between map styles and zoom levels. We took to the graphs:


1 is the normal, full usage. 5 is not.

Even though this is all Tom’s fault, it’s also a good thing! Far better to respond to actual demand than to try to optimize prematurely. So, we’ve increased capacity by spreading some of the some of the watercolor rendering load into EC2, and are working on re-creating those “underwater” tiles you might have noticed around the map.

Thanks, Tom! Excellent work!

maps.stamen: Some Known Bugs, What’s To Do

Thanks to our handy bug reporting form, and perhaps spending a little too much time surfing around the world in watercolor at the studio, we’ve isolated a couple of bugs which we’d like to update you on.

The background is, even though the watercolor map has been online for some weeks now, viewers haven’t yet looked at every place at every zoom level. Since it would take aaaages to make all the map tiles for all the zoom levels available, we’ve been creating new tile areas based on people’s viewing activity, and then working to cache tiles for popular areas. (See Jeff’s Log Maps post on content.stamen.com for more information about this.)

This means there are parts of the world haven’t had their watercolor map made. What we’re finding too, is that some of the tiles that have already been made have been generated incorrectly, and will have to be re-made.

As you click around watercolor world, you might come across maps that look like this:

We’ve been calling this the “Underwater” bug. It seems like it happens in the tile-making process if the machine that constructs the tiles is running too hot. It freaks out at coastlines, and ends up literally flooding the land area with the water texture.

You may also have seen a preponderance of grey while you browse around too, like:


Thanks to our superstar efficiency buddy, Aaron Huslage, we think we’ve tracked down the overall issue to machine I/O, the servers’ ability to process inputs, and issue outputs. If the I/O is flooded, the software to generate tiles on the fly baulks, and gets more and more underwater. So, to try to reduce that chance of flooding, we need to reduce and simplify the inputs we’re sending through to create new tiles. Step 1 is to try to “simplify the world.”

The theory is that we’re sending a more complicated Make request than we need to. CTO, Mike Migurski likens this to killing a whole chicken in order to make a McNuggetTM. We’re experimenting with ways to reduce the size of OpenStreetMap data ahead of time, for the whole world, because Watercolor in particular is such simplified cartography that is doesn’t need the whole chicken. If we just give Cascadenik only what it needs (instead of the whole chicken), that might reduce the machine I/O. Then, we’ll see what happens next…

We’ll post an update to let you know if that worked, or not. Any advice that springs to mind, feel free to post a comment!