Hi! I’m Pavel and I interned at Dropbox over the past summer. One of my biggest projects during this internship was optimizing Python for dynamic page generation on the website. By the end of the summer, I optimized many of dropbox.com’s pages to render 5 times faster. This came with a fair share of challenges though, which I’d like to write about today:
All user-generated strings on the Dropbox website are passed through a special escaping function we’ve written that takes context into account. The problem is that having defined our own escaping function we need to ensure that it’s as fast as possible since basically all user input is passing through this function. In this article we’ll focus on one particular escaping context, HTML text.
A quick disclaimer, before you accuse us of reinventing the wheel and ignoring both the
cgi.escape functions, it doesn’t escape quite enough characters. This is especially true if you consider the possibility for confusing the browser into believing your page is UTF-7, at which point the equal sign has to be escaped. So we have to write our own.
The original function looked like this:
Applying it to some test templates gives (times in seconds; the digits (obviously) aren’t all significant):
: html1 14.1678471565
Not very fast. (I blame the intern who wrote it…) Of course, the code wasn’t optimized for speed, but for readability. But like I said, given that this is actually a non-negligible part of our render time, it could use some optimization.
The first fact of Python optimization is that function calls are slow.
html1 is awful in this regard: it calls a function per character! Worse yet, the common case is calling a function that does nothing at all! So the first step is to inline. This also lets us join the
This has a pretty good 15% or so improvement:
: html2 12.8864150047
Now, the preceding code sample maybe isn’t too pretty, but the improvement is nothing worth sneering at. There’s more we can do though. The second fact of Python optimization is that the loop overhead can also be pretty significant. This leads to our second attempt, which gets rid of the loop in favor of a generator expression. Fortunately, switching to generator expressions makes the resulting function as readable as the original.
What are the timings with this new version?
: html3 13.4748219418
Hmm. The readability improvements are nice, but the speed dropped. Maybe we can get both speed and reability with some tweaks?
More OptimizationsWe already picked out a neat 10% improvement and still have a wonderfully readable function. But we can go faster…
The problem with the generator expressions used above is that Python actually constructs a generator. Let’s look at the bytecode:
Luckily, we can avoid making the generator by simply using a list comprehension instead of a generator expression:
This brings us back to faster than
: html4 11.2531888485
Looks like the generator expression is actually slower than the list comprehension in this case, a good explanation is probably that our sample set of strings are probably too small to reap the benefits of using a generator. As expected, our disassembly now looks a lot more friendly:
But this is still slower than it should be. One low-hanging fruit still stands out: why are we doing a linear search on the whitelist?
The timings bear out the guess that sets are faster than strings:
: html5 8.6205868721
The question remains, Can we go faster?
Deeper Python Optimization
If we’re going to optimize this function, we might as well extract all the performance we possibly can. One peculiarity of modern Python is that it has two
LOAD_FAST. Due to a implementation detail loading a local variable is faster than loading a global variable. Consequently, you want to avoid global variables in tight loops. The above disassembly pointed to two such globals:
We don’t expect this to buy us much, but why not?
: html6 7.87281298637
In absolute measurement a second is not much of an improvment but as a percentage it’s still a significant of time.
String interpolation is another thing Python isn’t very fast at. Let’s see if removing that helps.
: html7 5.58323383331
That’s a huge boost from saving the string interpolation!
Every Problem Can be Solved With a Big Enough Hash Table
Any web programmer knows that performance problems can be universally solved with caching. (Heh heh, if only…) In any case, if the gain from removing string interpolation was so great, maybe we can just cache the results of that characters to escaped form: function, and not do string concatenation either. We’ll set up a cache of characters to HTML escapes:
Then we read and write our cache as necessary:
Since our web servers are long-running processes, the cache should eventually capture all of the characters people are using; Python’s dicts are then fast enough to give us a speed boost over string concatenation and =str(ord(c))=.
: html8 4.5724029541
Another big boost from avoid what seems like a trivial computation.
If we’re going down the
setdefault branch so rarely, and since that’s the only place we’re using
ord, maybe it’s not worth making those local variables?
We don’t expect much of a benefit here, but maybe a percent change or so…
: html9 4.565928936
Wait, Why a Python Loop At All?
Wait, why are we using a Python loop at all? Python loops are slow… Instead, maybe we can use the C code in Python’s
re module? We can use regexes to find anything that isn’t in our whitelist, and do the replacement. That way, the “do-nothing” operation on whitelisted characters becomes much cheaper.
Unfortunately, there’s no way to get the entire matched range from a Python
match object. So we’re forced to call the
group method– and function calls are slow!
What are our results?
: htmlA 4.54020690918
Hmm, this isn’t that great, actually. We did save a percentage point, but this is a lot less readable (at least, in my opinion) than
html9, and the gains aren’t worth it.
Until, that is, we try it on whitelisted-only text:
: html9 3.84376811981
: htmlA 0.796116113663
html9 got a bit faster, by virtue of avoiding hitting the cache at all, but the regex-based solution *rocked*, since it could do all of the skipping of characters in C, not Python.
In fact, the regex-based solution is slower for punctuation, since it has to do multiple function calls instead of just one function call and dictionary lookup. And for short strings, the overhead to initiating the regex search is worse. Let’s try a test on punctuation-heavy, short text snippets:
: html9 1.59476995468
: htmlA 3.44844794273
Measure Twice, Cut Once
So we have one function that’s faster for English alphanumeric text and one that’s faster for everything else. But we can’t just shaft our non-US users, and we don’t want to settle for a function that’s five times slower for the common case! So we have a few options.
The first is simply to expand our whitelist — most file names have a dot in them, and spaces, dashes, and similar are popular. At signs appear in emails. And so on. Of course, one has to be careful not to permit any XSS vector while doing so; but a conservative expansion by adding
-|!,. _ to our whitelist should be safe. Of course, this helps both versions, so it’s not really a fix.
Another fix presents itself, though: can we figure out which version of
html to use quickly?
.6, you ask? Well, the exact constant isn’t too important (the function *works* whichever way the test goes, it’s just an optimization), but some testing showed it to be the approximate break-even point of the two methods.
With hope and trepidation, we run the tester…
: htmlB 5.16241598129
: html9 3.97228693962
: htmlA 3.95208191872
As we hoped, the result is fast on whitelisted-only characters…
: htmlB 1.24477005005
: html9 3.41327309608
: htmlA 0.696345090866
And is passable on punctuation:
: htmlB 5.97420597076
: html9 3.61161899567
: htmlA 8.88924694061
But the overhead is pretty saddening…
We can improve things a bit by testing just a few characters for punctuation/unicode.
We use =6= instead of =.6 * min(10, len(s))= because if the string is shorter than 10 characters, either alternative is going to be fast.
This leads to a marked improvement. For alphanumeric strings:
: htmlC 0.836707115173
: html9 3.34154415131
: htmlA 0.701889276505
For unicode-heavy strings:
: htmlC 3.70150613785
: html9 2.82831597328
: htmlA 6.77485609055
This is now really looking like an option for production use. But checking is still quite a bit of overhead still– in the case where the user has just about the break-even balance of English and international characters, we’re looking at a 20% premium. Can we do better?
The User is Always Right
Well, so checking which to use has very high overhead. What to do? Well, we’ve got to think about when each case will come up: how might we predict that lots of non-English-alphanumerics are going to come up? What users are likely to store files with international names, or use an international name? (Hopefully, people who use lots of punctuation in both are few…)
Well luckily, Dropbox added translations a bit back, so we already have all of the infrastructure in place to detect what locale a user is from. So a quick optimization is to switch to the
html9 version (the list-comprehension-based one) for international users, and use the fast
htmlA version (regexes) for US-based users, and also users from countries with mostly-Latin alphabets. The code here is mostly tied to Dropbox internals, so I’m not going to show it, but I’m sure you get the point.
This final optimization removes the check overhead while giving us most of the benefit. Success!
What We Didn’t Do
Now there are some ways to optimize the above even further. The obvious approach is to rewrite the entire thing in C. This lets us squeeze the last bits of performance out of the above, but it would also make things much harder to maintain.
Similarly, using PyPy or a similar JIT would probably help on the inner loop of the pure-Python version, likely making it as fast as the regex-based approach.
The first, most basic conclusion is that the basic facts of Python optimization inline functions, use implicit loops, move computation into C if possible are perfectly valid. Another fact: Python dictionaries are *fast*. The
set and the
CACHE_HTML_ESCAPES dict both rely on the superb hash table implementation for their performance gains.
Other “common wisdom”, like using locals instead of globals, yields relatively little gain.
Optimizing inner loops in a high-level loops requires lots of trial and error. It was absolutely amazing that moving to string concatenation from string interpolation gave such a huge boost to performance.
Finally, measurement is key. It was measurement that told us that HTML escaping was slow to begin with; and without measuring performance, we would never have guessed that string interpolation was so slow.
Thanks for reading!