Building Carousel, Part I: How we made our networked mobile app feel fast and local

Uncategorized / 16 Comments Posted by Stephen Poletto on April 14, 2014

When we began the journey of building a mobile app for Dropbox a few years ago, we started simple — our Android and iOS apps allowed our users to view their files on the go, and cache them for offline access. As smartphones became more popular, we realized we could provide another great service on mobile: automatic backup of all the photos and videos taken on these devices, so they’d be safe forever in Dropbox.

Last Wednesday, on April 9, we took another giant leap forward with the introduction of Carousel. Carousel is a single home for all your photos and videos, independent of whether they’re local to the device you’re using, or already backed up to Dropbox.

While Carousel seems pretty simple on the surface, there were a number of technical challenges we faced in building it. We needed to ship both an Android app and an iOS app on day one, which required us to think critically about how to share code between the two platforms. In order to support collections of over 100,000 photos, we needed to prioritize performance and find a way to beat the garbage collector on Android. In the coming weeks and months, we want to share the story of how we went about building Carousel and provide some insight into the hard engineering problems we solved along the way. Today, we’re going to focus on how we built Carousel to feel fast, responsive, and local, even though the data on which users operate is ultimately backed by the Dropbox servers.

Make it Faster!

As we thought about what we wanted in the next iteration of a mobile photos product, we kept coming back to this guiding principle:

A Dropbox-powered gallery app can and should be just as fast as a local gallery app and should never force the user to wait to complete an action. Users should be able to view, curate, and share their photos regardless of state; they should never have to wait or worry about which photos are local and which are not.

As long as our app was slower than a local gallery app, we knew it would never become the central place where our users go to view and interact with their photos. With this guiding principle in mind, we took a critical look at the Dropbox app’s photos tab, and identified two key problems that made the app feel way too slow:

1. The photos tab makes blocking HTTPS requests in order to sync user actions to the server. For instance, when the user tries to share a photo, this is what they see:

The same is true when the user tries to delete a photo from the photos tab. In the event of no connectivity, these requests outright fail and require the user to try again later.

2. There’s no way to view and interact with photos that are local only to the device (i.e. not yet uploaded to Dropbox).

These two problems, when combined, made the app especially difficult to use in the context of sharing photos with others. In order to share using the Dropbox app, users first had to wait for their photos to back up, then wait on a blocking network request to complete the share. The app also couldn’t be used as a replacement for a traditional gallery, since photos captured offline can’t be viewed at all.

Client Architecture

To solve these problems, we need to build a unified data model, in which local photos and remote photos are treated as equivalent objects, with all the same capabilities and properties. Second, considering that humans can perceive application response delays at around the 100 ms mark, we simply can’t afford to make user-visible blocking network calls. Instead, we need to build an eventually consistent system, where the user can perform some action, immediately see the effect of that action locally, then eventually see the effect of that action globally on other devices. In the academic world, this is known as optimistic replication.

To build a merged view of both local and server content, we first need the client to stay up to date with changes that are happening remotely on the server. To achieve that, we use HTTP long polling to get notified of changes, and use a variant of our delta API to pull those changes down. Delta returns the changes that have occurred to a user’s Dropbox since the last time the client called up to the server. That is, it provides the additions, deletions and modifications to photo metadata that have occurred since the prior cursor. When we fetch these changes, we write the most up-to-date server metadata into a server_photos table in SQLite. The server_photos table is purely a cache of the “truth,” which lives on the server.

Meanwhile, our client-side camera roll scanner computes a fast hash of each photo to determine which photos have not yet been backed up to Dropbox. We turn a photo that needs to be uploaded into a photo_upload_operation, and likewise serialize it into SQLite.

Finally, before we can render the view, we have a third input source in the form of client-side user actions. Whenever the user hides or deletes a photo in Carousel, we want the action to take effect instantly. We can then asynchronously write that change back to the server. To do so, we construct a HideOperation, or DeleteOperation, which also gets persisted to SQLite.

Every user action in Carousel thus becomes an operation, which will eventually be synced to the server. These operations are placed into in-memory operation queues and persisted to SQLite for conservation across app launches. For each queue, there’s a dedicated operation sync thread, which waits until an operation is ready to execute, then makes the HTTPS call necessary to submit the change to the server. Whenever we need to render a view to the user, we consult these pending operations to make sure we’re reflecting the user’s latest actions. It’s only safe to remove these operations once we’re certain we’ve seen their effect come down in the delta. We thus end up with an architecture that looks like this:

Let’s walk through an example of rendering the primary grid view to the user.

class DbxPhotoClient {
    list<DbxPhotoItem> list_photos();
};

Inside the implementation of list_photos, we:

1. Read all server photo metadata out of the server_photos table.
2. Add in all the local photos pending upload.
3. Remove photos which have been hidden or deleted.

For example, suppose our server_photos table contains the following data:

Server ID Hash Hidden On Server
A hash_a No

Our photo_upload_operations table contains the following data:

Camera Roll ID Hash
B hash_b

And our photo_modification_operations table contains the following data:

Operation Type Photo ID(s)
Hide [Server ID = A]

Our call to list_photos() will produce as final output the result of unioning local and server content, then applying the pending hide:

ID Hash Is Hidden
A hash_a Yes
B hash_b No

Note that in practice, forcing the UI to call list_photos() to perform a read from SQLite every time there’s a change to the photo model would be prohibitively expensive. Instead, we keep the photo model loaded in memory, and modify it as changes come in (either via user actions in the app, or remote changes on the server). This is not all that different than the delta API we use to sync down changes from the server. To keep things fast, we essentially introduce another level of delta between disk and memory. In the next blog post, we’ll take a look at exactly how this works, and how it enabled us to build an app that can handle over 100,000 photos.

The key idea in the example we walked through above is that applying a client-side photo addition and hide on top of cached server state should provide the same result as eventually uploading the photo and applying the hide on the server. Whenever we render data in Carousel, we first consult the cached server state, then “re-play” pending operations on top of it. In the case of hide & delete, we then rely on last-write-wins semantics on the server to resolve any conflicts.

This works really well for photos that are already in Dropbox, since the photos already have server IDs. Each pending operation can store the server ID(s) on which it should be applied. But what happens when we want to allow modifications to photos that haven’t finished uploading yet? As an additional constraint, keep in mind that due to the multi-platform nature of Dropbox, the photo might be uploaded from a source other than the Carousel client. Even when that happens, we still need to resolve any pending actions that were taken on that photo.

Identifying a Photo

There are a few different ways to ensure an action taken on a local-only photo gets synced to the server properly. We wanted something simple and relatively stateless to keep the client-side logic easy to reason about. To achieve this, we introduced the concept of a device-specific ID, henceforth referred to as a LUID (locally unique ID), as the canonical way to refer to each photo. A LUID is a stable identifier, meaning it can be used to refer to a photo both before and after it has been uploaded. A LUID is simply an autoincrement integer, and it works like this:

When we scan the device for new photos and find a photo that needs to be uploaded to Dropbox, we create a LUID for that local photo. We then add an entry in the local_photo_luids table, which maps the LUID to its native camera roll ID.

When a new server photo S comes down in delta, we check if S.hash matches any local photo hashes. If not, we create a new LUID, and add an entry to the server_photo_luids table, which maps the LUID to its server ID.

In the event the hash does match some local photo L, it means L has finished uploading and we now have its server metadata available. We assign S.photo_luid = L.photo_luid. At the same time, we also mark the relevant photo_upload_operation as completed. To prevent conflicts (for instance if the same photo gets added to the user’s Dropbox multiple times), the first server photo with the same hash is the one that will “complete” the upload operation and claim the LUID.

You’ll notice by using this logic, we always have a stable way to refer to a particular photo without worrying about whether it is on the server or not. This reduces a lot of complexity in the app, since the UI can simply treat LUIDs as the basis of equality between photo objects. When a local photo finishes uploading, we don’t need to worry about tracking down each reference to it and “upgrading” the reference to use the new server ID. The LUID abstracts that away.

Faster Sharing

With LUIDs in hand, let’s take a look at what happens when a user shares in Carousel.

Suppose the user selects a batch of photos, some of which are still local only to the device, and some of which are already in Dropbox. Even if one of these photos finishes uploading while the user is still selecting photos, their selection will be preserved, since the selection set is based on LUIDs.

After the user selects the recipients with whom they’d like to share, we can construct the corresponding share operation.

DbxShareOp op(photo_luids, recipients);
op.persist(); // Save the operation to SQLite

When we render the resulting conversation view, we read the cached server state for the conversation uniquely identified by the recipients. We then re-play this pending share on top of it, just like all the operations we’ve seen before. We could spend a whole blog post going into more depth here, but for now we’ll defer that discussion.

If any of the LUIDs within the share operation are still local only (i.e. they do not have entries in the server_photo_luids table), then we know the share is not yet ready to be submitted to the server. The share operation queue can therefore sleep, and wait until the local-only photos in question are uploaded. We consider this a dependency on the share operation, which must be resolved before the operation is ready for remote execution. As part of constructing the share operation, we also mark the relevant photo_upload_operations as “blocking a share”, so that they become re-prioritized to the front of the upload queue.

When the dependent photo uploads complete, the share operation is ready to execute on the server. We look up the server IDs (via the server_photo_luids lookup table) and send a request to the server to perform the share.

The best part is that all of this happens asynchronously, so the user is free to continue using the app, or go about their day. No spinners, no waiting.

Lessons Learned

The big lesson we learned from building the Dropbox app photos tab was: don’t block the user! Instead of requiring changes to be propagated to the server synchronously, we built Carousel from day one as an eventually consistent system. With mobile networks still slow and unreliable, we knew this would be the only way to deliver a Dropbox-backed gallery that felt fast and local.

The asynchronous, delta-based design to our mobile library empowered us to build an app that was much faster than the Dropbox photos tab. This design enabled us to hide the latency between client and server from the user. In the next installation of this series, we’ll go into more depth on the latency between disk and memory, and how optimizing that was also critical to making the app feel fast.

In the meantime, go download Carousel, and let us know your thoughts!

 

Introducing Pyston: an upcoming, JIT-based Python implementation

Uncategorized / 126 Comments Posted by Kevin Modzelewski on April 03, 2014

Hello everyone, I’m very excited to announce Pyston, a new open-source implementation of Python, currently under development at Dropbox.  The goal of the project is to produce a high-performance Python implementation that can push Python into domains dominated by traditional systems languages like C++.

Here at Dropbox, we love Python and try to use it for as much as we can.  As we scale and the problems we tackle grow, though, we’re starting to find that hitting our performance targets can sometimes become prohibitively difficult when staying on Python.  Sometimes, it can be less work to do a rewrite in another language.  I personally love Python, and it pains me every time we decide to rewrite something, so I wanted to do something about it.  After some abandoned experiments with static compilation, we looked around and saw how successfully JIT techniques are being applied in the JavaScript space: Chrome’s V8 engine, in particular, has greatly pushed the status quo of JavaScript performance.  Our hope is that by using similar techniques, we can achieve similar performance improvements for Python.

Pyston is still in the earliest stages and is not ready for use, but we’re hopeful that by announcing it early in its lifecycle and open-sourcing the code, we can collaborate with the Python and JIT communities throughout its development.  There’s only room for so much detail in this blog post, but we wanted to talk about why we think we need a new Python implementation, and go into a little bit of how Pyston works.

Why a new implementation

There are already a number of Python implementations using JIT techniques, often in sophisticated ways.  PyPy has achieved impressive performance with its tracing JIT; Jython and IronPython are both built on top of mature VMs with extensive JIT support.  So why do we think it’s worth starting a new implementation?

In short, it’s because we think the most promising techniques are incompatible with existing implementations.  For instance, the JavaScript world has switched from tracing JITs to method-at-a-time JITs, due to the compelling performance benefits.  Whether or not the same performance advantage holds for Python is an open question, but since the two approaches are fundamentally incompatible, the only way to start answering the question is to build a new method-at-a-time JIT.

Another point of differentiation is the planned use of a conservative garbage collector to support extension modules efficiently.  Again, we won’t know until later whether this is a better approach or not, but it’s a decision that’s integral enough to a JIT that it is difficult to test in an existing implementation.

The downside of starting from scratch is, unsurprisingly, that creating a new language implementation is an enormous task.  Luckily, tools are starting to come out that can help with this process; in particular, Pyston is built on top of LLVM, which lets us achieve top-tier code generation quality without having to deal with the details ourselves. Nonetheless, a new Python implementation is a huge undertaking, and Pyston will not be ready for use soon.

How it works

At a high level, Pyston takes parsed Python code and transforms it to the LLVM intermediate representation (IR).  The IR is then run through the LLVM optimizer and passed off to the LLVM JIT engine, resulting in executable machine code.  LLVM contains a large number of optimization passes and mechanisms for easily adding more, which can lead to very fast code.

The problem, though, is that LLVM can’t reason about Python code, because all the low-level behavior is hidden behind the type dispatching you have to do in any dynamic language.  To handle this, Pyston employs type speculation: it is typically impossible to prove that a variable will have a specific type, but Pyston can often predict with some certainty what the type of an object can be.  Once a prediction is made, Pyston will verify the prediction at runtime, branching between a fast path where the prediction holds, and a slow path where it doesn’t.

Pyston also includes other modern techniques such as hidden classes for fast attribute lookups and inline caches for fast method calls.  You can find more technical details on the Github page, along with a separate blog post that goes into more technical detail.

Current state

Pyston is still in its infancy and right now only supports a minimal subset of the Python language.  It’s not quite fair to state benchmark numbers, since 1) Pyston doesn’t support a large enough set of benchmarks to be representative, and 2) Pyston doesn’t support all runtime features (including ones that might introduce slowdowns), so it’s not a true apples-to-apples comparison.  With those caveats, Pyston generally is able to beat CPython’s performance, but still lags behind PyPy.

The code has been released on Github under the Apache 2.0 license, along with a growing amount of technical documentation.  There’s a lot of work to be done, and we’re looking to grow the team: if this kind of thing interests you, please apply!

Stay tuned for more updates as the project progresses.  If you’d like to subscribe to our announcement mailing list, you can do so here.

Video Processing at Dropbox

Uncategorized / 33 Comments Posted by Pierpaolo Baccichet on February 18, 2014

Every day millions of people upload videos to Dropbox. Besides wanting their memories safe forever, they also want to be able to watch them at any time and on any device. The playout experience should feel instant, despite the fact that the content is actually stored remotely. Low latency playback of content poses interesting technical challenges because of the three main factors below.

1. Codec diversity

Most end users are familiar with extensions like .mp4, .avi, .flv, but not everybody is familiar with the fact that the file extension does not necessarily match the internal encoding of the content. People assume that an .mp4 file will certainly play on a Mac laptop, but that’s not always a safe assumption because the content might be encoded with some Microsoft/Google/Adobe/RealMedia specific codec (e.g. VC1/VP8). The video codec landscape has been very fragmented for at least 30 years now, and despite the efforts of MPEG to create open standards, the situation is still quite messy. The good news is that modern phones tend to produce mostly coherent content using H.264/AVC as codec and MPEG-4 container format, which indeed corresponds to the majority of the content we see in Dropbox.

2. Limited end-user network bandwidth

Users access their Dropbox content either via their home/office connection or via a mobile connection. Leaving mobile aside, even home connections are not as fast/reliable as most network providers advertise (see the ISP speed report from Netflix for some fun numbers), so bandwidth adaptation is a must to guarantee a fluid video playout.

3. Client capabilities

Different client devices impose different constraints, mostly due to the underlying hardware chipsets, both in terms of memory bandwidth and CPU power. For instance, the iPhone 3GS only supports baseline profile H.264/AVC.

The solution to these problems is to transcode (decode and re-encode) the source video to a target resolution/bit rate and codec that is suitable for a given client. At the beginning of the development of this feature, we entertained the idea to simply pre-transcode all the videos in Dropbox to all possible target devices. Soon enough we realized that this simple approach would be too expensive at our scale, so we decided to build a system that allows us to trigger a transcoding process only upon user request and cache the results for subsequent fetches. This on-demand approach:

  1. adapts to heterogeneous devices and network conditions,
  2. is relatively cheap (everything is relative at our scale),
  3. guarantees low latency startup time.

HTTP Live Streaming is your friend

We managed to achieve our first 2 goals above by using HTTP Live Streaming (HLS). The basic idea of HLS is to structure the data in several playlists describing quality layers and data segments that are transmitted over HTTP. The standard was born as an Apple specific solution but is now an open RFC and is also (partially) supported on Android devices. The protocol effectively works in 3 steps that the player has to follow sequentially. The very first URL a player hits returns the main HLS playlist file that looks in our case something like:

#EXTM3U
#EXT-X-PLAYLIST-TYPE:VOD
#EXT-X-STREAM-INF:PROGRAM-ID=1,BANDWIDTH=150000
https://streaming.dropbox.com/stream/<access_token_layer_1>
#EXT-X-STREAM-INF:PROGRAM-ID=1,BANDWIDTH=500000
https://streaming.dropbox.com/stream/<access_token_layer_2>
#EXT-X-STREAM-INF:PROGRAM-ID=1,BANDWIDTH=1500000
https://streaming.dropbox.com/stream/<access_token_layer_3>
#EXT-X-ENDLIST

Each #EXT-X-STREAM-INF tag provides a URL to a playlist for content at a different target bit rate. In the example above, we have 3 layers at increasing qualities, so the player can pick the best one for a given connection speed. The second step consists of fetching the layer playlists, potentially parallelizing the operation for all layers to save roundtrip time. Each layer playlist looks something like:

#EXT-X-VERSION:3
#EXT-X-PLAYLIST-TYPE:VOD
#EXT-X-TARGETDURATION:10
#EXTINF:10.0,
https://streaming.dropbox.com/stream/<access_token_layer_1_segment_1>
#EXTINF:10.0,
https://streaming.dropbox.com/stream/<access_token_layer_1_segment_2>
[...]
#EXTINF:10.0,
https://streaming.dropbox.com/stream/<access_token_layer_1_segment_N>
#EXT-X-ENDLIST

Unlike the first response, each URL in the layer playlist now points to a segment of actual data. Apple has very good technical notes providing generic recommendations on how to encode and segment content for HLS streaming.

System overview

Figure 1 - Diagram of the transcoding system

Figure 1 – Diagram of the transcoding system

To begin streaming, a client application first issues a request to our web servers to obtain a temporary token (in the form of a URL) for the main HLS playlist. Since video playout typically happens on dedicated players that are not necessarily part of a client application, the token includes a one time password and expiration information that enables our servers to authenticate the external player before returning the content to it. The handler of this first request verifies if the content is already cached and, if that’s not the case, kicks off a transcoding job with different parameters based on client capabilities and network connection. Since H.264/AVC video transcoding is an extremely intensive operation and each transcoder machine can only perform a limited number of transcodes in parallel, it’s important to pick the best worker at every request. The URL we return to the client also embeds information that allows us to route the request back to the transcoding worker, which is important to be able to serve the content while it’s being transcoded and before it’s cached in our backend.

Our worker clusters are implemented on Amazon AWS and consist of the following components:

  • live transcoding servers are beefy cc2.8xlarge instances that can run several transcoding processes in parallel while still serving user’s requests. We have a hard limit on the number of concurrent transcodes on a given box and expose an application layer health check that allows us to temporarily take the machine out of service if the limit is exceeded. From our experience, each cc2.8xlarge machine can perform up to a dozen transcoding jobs in parallel before it will start falling behind.
  • memcache is used for distributed coordination of the transcoding jobs. We use memcache to a) track progress of a job and b) report machine load. We use the load information to implement a good load balancing scheduler. This is crucial to prevent machines from getting overloaded.
  • front end load balancer runs on cc1.xlarge instances and is powered by Nginx and HAProxy. We use Nginx for SSL termination and HAProxy to quickly take machines out of service when they are overloaded and fail the health check.
  • persistent cache of transcoded material happens in a separate storage system. As always, storage is cheap compared to CPU so we store the results of the transcoding process for a certain amount of time to serve them back in subsequent requests. We maintain references to cached data in our internal databases so we can implement different retention policies based on users’ usage patterns.

Preparing, encoding and segmenting

We use ffmpeg for the actual transcoding because it supports most formats. Our pipeline implements the following three steps.

1) prepare the stream for cooking

Since we want to stream data as we transcode it, we need to rearrange the input stream in a way that is suitable for piping it into ffmpeg. Many people refer to this process as “fast-starting” the video, and there are a few tools available on the internet that can help you get started. Ultimately, we wrote our own solution in python to allow us to debug issues and profile performance. In practice fast-starting for mp4 consists of extracting the “moov atom,” which contains most of the video’s metadata, rearranging it to the beginning of the file, and then adjusting the internal offsets to the data accordingly. This allows ffmpeg to immediately find the information about resolution, duration and location of data atoms and start the transcoding as the data is fed into it.

2) re-encode the stream

The command line for ffmpeg looks something like the following:

ffmpeg -i pipe:0 -dn -vcodec libx264 -vsync 1 -pix_fmt yuv420p -ac 2 
  -profile:v baseline -level 30 -x264opts bitrate=<rate>:vbv-maxrate=<rate> 
  -rc-lookahead 0 -r <fps> -g <fps> -refs 1 -acodec libfaac -async 1 
  -ar 44100 -ab 64k -f mpegts -s <target_resolution> -muxdelay 0 pipe:1

We use H.264/AVC baseline profile level 3.0 to guarantee compatibility with all devices including iPhone 3GS (we are planning to improve on that in the near future). Some of the parameters are the result of us trading off a bit of quality to minimize the startup time for live transcoding. Specifically, we found that reducing the value of muxdelay, having only one reference frame and disabling scenecut detection all contributed in reducing the latency introduced by ffmpeg. The output container format is MPEG transport as required by HLS.

3) Segment the transcoded output

The output of ffmpeg is segmented with a C++ tool we developed internally on top of libavcodec. Apple provides a segmenter tool but we decided to not use it because it runs only on Mac (we are linux friends like most readers here) and does not natively support pipelining. Also, recent versions of ffmpeg (we use 2.0) come with a segmenter tool, but we found it introduces significant latency to our pipeline. In summary, the reasons why we ended up writing our own tool were because it allows us to 1) optimize end-to-end latency, 2) guarantee the presence and the positioning of IDR (Instantaneous Decoder Refresh) frames in every segment and 3) customize the length of the segments we generate.

The last point is particularly important because the length of the video segment is directly proportional to the transmission time from the server to the client on a bandwidth constrained channel. On one hand, we want very short segments to lower the transmission time of each of them but on the other hand we’d like to minimize the number of in-flight requests and the overhead per request due to the roundtrip latency between the client and the server. Since we are optimizing for startup latency, we begin with smaller segments and then ramp up to longer ones to diminish request overhead, up to the target segment length of 5 seconds per segment. Specifically, the length for the very first segments looks something like 2s, 2s, 3s, 3s, 4s, 4s, 5s, …. We picked these values because a) the standard poses some restrictions on how fast we can increase the length of consecutive segments (this is to avoid possible underflows) and Android does not allow for fractional segment lengths (those were introduced in version 3 of the HLS standard).

Lowering startup time by pre-transcoding

During the development and tuning process of the pipeline we saw the startup time reducing dramatically from ~15/20 seconds to ~5 seconds, as you can see from the rocky graph below.

Figure 2 - Transcoder startup time

Figure 2 – Transcoder startup time

Still, that is not sufficient to provide the “feel instant” experience we wanted for our users so we revisited the idea of pre-transcoding some of the material and we decided to process only the first few seconds of every video. The pre-transcoding cluster is independent of the live transcoding one to not affect the performance of live traffic and is hooked up with a pipeline that is triggered on every file upload. The first few seconds of every video are processed and stored in our cache, and the remaining part is generated on demand whenever the user requests. This approach allows us to transmit to the client the first segments very quickly while the transcoder starts up and seeks to the desired offset. We retain references to all processed material so we can easily implement different retention policies as needed.

Recap

The combination of pre-transcoding, shorter segments at the beginning and lowered buffering time in the video processing pipeline allowed to reach our goal of 2-3 seconds startup time on a client on a good connection, providing the desired instant experience. We learned a few things when building the system at scale:

  • pre-transcoding everything would be nice and makes things much easier to implement, however it’s too expensive at our scale.
  • fast-starting is required to handle video files generated by mobile devices if you want the transcoder to progress while you feed data into it.
  • HLS is a great solution to enable streaming over heterogeneous networks/devices and allows for flexible and creative solutions in the way you structure your output.
  • load balancing is not to be underestimated. It’s a tricky problem and can easily trash your system if done wrong.
  • experimenting with ffmpeg parameters lets you explore the tradeoff between quality and latency that is appropriate for your application.

Improving Dropbox Performance: Retrieving Thumbnails

Uncategorized / 28 Comments Posted by Ziga Mahkovec on January 27, 2014

Dropbox brings your photos, videos, documents, and other files to any platform: mobile, web, desktop, or API. Over time, through automatic camera uploads on iOS and Android, you might save thousands of photos, and this presents a performance challenge: photo thumbnails need to be accessible on all devices, instantly.



We pre-generate thumbnails at various resolutions for the different devices at upload time, to reduce the cost of scaling photos at rendering time. But when users are quickly scrolling through many photos, we need to request a large number of thumbnails. Since most platforms have limitations on the number of concurrent requests, the requests might get queued and cause slow render times. We present a solution that allows us to reduce the number HTTP requests and improve performance on all platforms, without major changes to our serving infrastructure.

Request queuing

Let’s look at this problem in more detail on the web, specifically the Photos tab at www.dropbox.com/photos. Here’s what the Network view in Chrome’s Developer Tools looks like if we were to load every photo thumbnail on the page individually:



You can see that a limited set of images is loaded in parallel, blocking the next set of thumbnails from being loaded. If the latency of fetching each image is high—e.g. for users far away from our datacenters—loading the images can drastically increase the page load time. This waterfall effect is common for web pages loading lots of subresources, since most browsers have a limit of 6 concurrent connections per host name.

A common workaround for web pages is to use domain sharding, spreading resources over multiple domains (in this case photos1.dropbox.com, photos2.dropbox.com, etc.) and thus increasing the number of concurrent requests. However, domain sharding has its downsides—each new domain requires a DNS resolution, a new TCP connection, and SSL handshake—and is also not practical when loading thousands of images and requiring many domains. We saw similar issues on our mobile apps: both iOS and Android have per-host or global limits on the number of concurrent connections.

To solve the problem, we need to reduce the number of HTTP requests. This way we avoid problems with request queueing, make full use of the available connections, and speed up photo rendering.

Measuring performance

Before embarking on any performance improvement, we need to make sure we have all of the instrumentation and measurements in place. This allows us to quantify any improvements, run A/B experiments to evaluate different approaches, and make sure we’re not introducing performance regressions in the future.

For our web application, we use the Navigation Timing API to report back performance metrics. The API allows us to collect detailed metrics using JavaScript, for example DNS resolution time, SSL handshake time, page render time, and page load time:



Similarly, we log detailed timing data from the desktop and mobile clients.

All metrics are reported back to our frontends, stored in log files and imported into Apache Hive for analysis. We log every request with metadata (e.g. the originating country of the request), which allows us to break down the metrics. Hive’s percentile() function is useful to look at the page load time distribution – it’s important to track tail latency in addition to mean. More importantly, the data is fed into dashboards that the development teams use to track how we’re doing over time.

We instrumented our clients to measure how long it takes to load thumbnails. This included both page-level metrics (e.g. page render time) and more targeted metrics measured on the client (e.g. time from sending thumbnail requests to rendering all the thumbnails in the current viewport).

Batching requests

With the instrumentation in place, we set off on improving the thumbnail loading times. The first solution we had in mind was SPDY. SPDY improves on HTTP by allowing multiple multiplexed requests over a single connection. This solves the issue with request queueing and saves on round-trips (a single TCP connection and SSL handshake needs to be established for all the requests). However, we hit a few roadblocks on the way:

  • We use nginx on our frontends. At the time, there was no stable nginx version with SPDY support.
  • We use Amazon ELB for load balancing, and ELB doesn’t support SPDY.
  • For our mobile apps, we didn’t have any SPDY support in the networking stack. While there are open-source SPDY implementations, this would require more work and introduce potentially risky changes to our apps.

Instead of SPDY, we resorted to plain old HTTPS. We used a scheme where clients would send HTTP requests with multiple image urls (batch requests):

GET https://photos.dropbox.com/thumbnails_batch?paths=
        /path/to/thumb0.jpg,/path/to/thumb1.jpg,[...],/path/to/thumbN.jpg

The server sends back a batch response:

HTTP/1.1 200 OK
Cache-Control: public
Content-Encoding: gzip
Content-Type: text/plain
Transfer-Encoding: chunked

1:[...]
0:[...]
3:[...]
2:[...]
[...]

The response is:

  • Batched: we return all the images in a single plain-text response. Each image is on its own line, as a base-64-encoded data URI. Data URIs are required to make batching work with the web code rendering the photos page, since we can no longer just point an <image> src tag to the response. JavaScript code sends the batch request with AJAX, splits the response and injects the data URIs directly into <image> src tags. Base-64 encoding makes it easier to manipulate the response with JavaScript (e.g. splitting the lines). For mobile apps, we need to base64-decode the images before rendering them.
  • Progressive with chunked transfer encoding: on the backend, we fire off thumbnail requests in parallel to read the image data from our storage system. We stream the images back the moment they’re retrieved on the backend, without waiting for the entire response to be ready; this avoids head-of-line blocking, but also means we potentially send the images back out of order. We need to use chunked transfer encoding, since we don’t know the content length of the response ahead of time. We also need to prefix each line with the image index based on the order of request urls, to make sure the client can reorder the responses.
    On the client side, we can start interpreting the response the moment the first line is received. For web code we use progressive XMLHttpRequest; similarly for mobile apps, we simply read the response as it’s streamed down.
  • Compressed: we compress the response with gzip. Base64-encoding generally introduces 33% overhead. However, that overhead goes away after gzip compression. The response is no longer than sending the raw image data.
  • Cacheable: we mark the response as cacheable. When clients issue the same request in the future, we can avoid network traffic and serve the response out of cache. This does require us to make sure the batches are consistent however – any change in the request url would bypass the cache and require us to re-issue the network request.

Results

Since the scheme is relatively simple and uses plain HTTPS instead of SPDY, it allowed us to deploy it on all platforms and we saw significant performance improvements: 40% page load time improvement on web.

However, we don’t see this as a long-term strategy – we’re planning on adding SPDY support to all of our clients and take care of pipelining at the protocol level. This will simplify the code, give us similar performance improvements and better cacheability (see note about consistent batches above).

The Dropbox performance team is a small team of engineers focused on instrumentation, metrics and improving performance across Dropbox’s many platforms. If you obsess over making things faster and get excited when graphs point down and to the right, join us!

Outage post-mortem

Uncategorized / 127 Comments Posted by Akhil Gupta on January 12, 2014

On Friday evening our service went down during scheduled maintenance. The service was back up and running about three hours later, with core service fully restored by 4:40 PM PT on Sunday.

For the past couple of days, we’ve been working around the clock to restore full access as soon as possible. Though we’ve shared some brief updates along the way, we owe you a detailed explanation of what happened and what we’ve learned.


What happened?

We use thousands of databases to run Dropbox. Each database has one master and two replica machines for redundancy. In addition, we perform full and incremental data backups and store them in a separate environment.

On Friday at 5:30 PM PT, we had a planned maintenance scheduled to upgrade the OS on some of our machines. During this process, the upgrade script checks to make sure there is no active data on the machine before installing the new OS.

A subtle bug in the script caused the command to reinstall a small number of active machines. Unfortunately, some master-replica pairs were impacted which resulted in the site going down.

Your files were never at risk during the outage. These databases do not contain file data. We use them to provide some of our features (for example, photo album sharing, camera uploads, and some API features).

To restore service as fast as possible, we performed the recovery from our backups. We were able to restore most functionality within 3 hours, but the large size of some of our databases slowed recovery, and it took until 4:40 PM PT today for core service to fully return.


What did we learn?

Distributed state verification

Over the past few years our infrastructure has grown rapidly to support hundreds of millions of users. We routinely upgrade and repurpose our machines. When doing so, we run scripts that remotely verify the production state of each machine. In this case, a bug in the script caused the upgrade to run on a handful of machines serving production traffic.

We’ve since added an additional layer of checks that require machines to locally verify their state before executing incoming commands. This enables machines that self-identify as running critical processes to refuse potentially destructive operations.

Faster disaster recovery

When running infrastructure at large scale, the standard practice of running multiple replicas provides redundancy. However, should those replicas fail, the only option is to restore from backup. The standard tool used to recover MySQL data from backups is slow when dealing with large data sets.

To speed up our recovery, we developed a tool that parallelizes the replay of binary logs. This enables much faster recovery from large MySQL backups. We plan to open source this tool so others can benefit from what we’ve learned.

We know you rely on Dropbox to get things done, and we’re very sorry for the disruption. We wanted to share these technical details to shed some light on what we’re doing in response. Thanks for your patience and support.

Akhil
Head of Infrastructure

Dropbox Status Update

Uncategorized / Comments Off Posted by Dropbox Team on January 11, 2014

UPDATE 1/12 at 7:23pm PT: Dropbox should now be up and running for all of you, but we’re working through a few last issues with the Dropbox photos tab.  More info on our main blog and the latest post here.

UPDATE 1/12 at 1:59pm PT: Hi everyone, we wanted to give an update on where things stand.

As of this morning at 4:10am PT, nearly all users (over 99%) can access their files on dropbox.com. The Photos tab is still turned off, but you can access your photos via the Files tab on dropbox.com or the desktop client. We’re continuing to make a lot of progress restoring full service to all users, and are doing so in careful steps.

About 5% of our users are still experiencing problems syncing from the desktop client, and about 20% of users are having issues accessing Dropbox through our mobile apps. Within a few hours, we’ll be rolling out a change that will further improve things for those users. We’ll give an update after that.

Your files have been safe this entire time. Thanks again for your patience.

UPDATE 1/12 at 8:48am PT: We’re still seeing service issues for a small number of users. We’ve been working through the night to restore full service as soon as possible and we’ll continue until this is complete.

UPDATE 1/11 at 11:16pm PT: We’re continuing to make progress on reducing the number of users experiencing service issues. We’ll keep providing updates here.

UPDATE 1/11 at 6:35pm PT: Dropbox is still experiencing lingering issues from last night’s outage. We’re working hard to get everything back up, and want to give you an update.

No files were lost in the outage, but some users continue to run into problems using various parts of dropbox.com and our mobile apps. We’re rapidly reducing the number of users experiencing these problems, and are making good progress.

We’re also working through some issues specific to photos. In the meantime, we’ve temporarily disabled photo sharing and turned off the Photos tab on dropbox.com for all users. Your photos are safely backed up and accessible from the desktop client and the Files tab on dropbox.com.

We know how much you all rely on Dropbox, and we’re sorry for the trouble. Thanks for your patience — we’ll keep you up to date.

UPDATE 1/11 at 10:24am PT: We’re still experiencing service issues related to the outage last night. We apologize and are working to get the service fully restored as soon as possible.

UPDATE 1/10 at 8:36pm PT: Dropbox site is back up! Claims of leaked user information are a hoax. The outage was caused during internal maintenance. Thanks for your patience!

1/10 at 6:40pm PT: We are aware that the Dropbox site is currently down. This was caused during routine internal maintenance, and was not caused by external factors. We are working to fix this as soon as possible. We apologize for the inconvenience.

Scaling MongoDB at Mailbox

Uncategorized / 18 Comments Posted by Cuong Do on September 12, 2013

Mailbox has grown unbelievably quickly. During that growth, one performance issue that impacted us was MongoDB’s database-level write lock. The amount of time Mailbox’s backends were waiting for the write lock was resulting in user-perceived latency.

While MongoDB allows you to add shards to a MongoDB cluster easily, we wanted to spare ourselves potential long-term pain by moving one of the most frequently updated MongoDB collections, which stores email-related data, to its own cluster. We theorized that this would, at a minimum, cut the amount of write lock contention in half. While we could have chosen to scale by adding more shards, we wanted to be able to independently optimize and administer the different types of data separately.

I started by poring through the MongoDB documentation. I quickly found the cloneCollection command. However, to quote the MongoDB 2.2 documentation: “cloneCollection cannot clone a collection through a mongos: you must connect directly to the mongod instance.” In other words, you can’t use this command with a sharded collection. You can’t use renameCollection on sharded collections either, closing off other possibilities. There were other possible solutions, but they all would’ve impacted performance for Mailbox users or would have simply failed to work at Mailbox’s scale.

So, I wrote a quick Python script to copy the data, and another to compare the original versus the copy to ensure data integrity. Along the way, I encountered many surprises. For example, a single Python process using gevent and pymongo can copy a large MongoDB collection in half the time that mongodump (written in C++) takes, even when the MongoDB client and server are on the same machine.

Our experiences have culminated in Hydra, our newly open-sourced set of tools we’ve developed for MongoDB collection migration.

Creating the initial snapshot of a MongoDB collection

To copy all documents in a collection, I started with an intentionally naive implementation that didn’t have much more code than this:

for email_data in source_email_data.find():
    destination_email_data.insert(email_data)

Issue #1: Slowness

It was obvious that such a naive approach wouldn’t perform well for larger amounts of data, so I quickly experimented with different means of achieving faster copy performance. I implemented various micro-optimizations, like adjusting how many documents the MongoDB driver fetched at once. However, those only yielded only marginal performance improvements. My goal was to finish the data migration in about a day, I was still far from that goal.

An early experiment I did was to measure the “speed of light” for MongoDB API operations – the speed of a simple C++ implementation using the MongoDB C++ SDK. Being rusty at C++ and wanting my mostly Python-proficient colleagues to easily be able to use/adapt the code for other uses, I didn’t pursue the C++ implementation too far but found that for simple cases, a naive C++ implementation was typically 5–10 times as fast as a naive Python implementation for the same task.

So, I returned to Python, which is the default language of choice for Dropbox. Moreover, when performing a series of remote network requests, such as queries to mongod, the client often spends much of its time waiting for the server to respond; there didn’t seem to be very many CPU-intensive parts for copy_collection.py (my MongoDB collection copying tool). This was corroborated by the very low CPU usage of the initial copy_collection.py.

I then experimented with adding concurrent MongoDB requests to copy_collection.py. Initial experiments with worker threads resulted in disappointment. Next, I tried using worker processes communicating through a Python Queue object. The performance still wasn’t much better, because the overhead of the IPCs was overwhelming any potential concurrency benefits. Using Pipes and other IPC mechanisms didn’t help much either.

Next, I decided to see how much performance I could squeeze out of a single Python process using asynchronous MongoDB queries. One of the more popular libraries for this is gevent, so I decided to give it a try. gevent patches standard Python modules, such as socket, to execute asynchronously. The beauty of gevent is that you can write asynchronous code that reads simply, like synchronous code.

Traditionally, asynchronous code to copy documents between two collections might have looked like this:

import asynclib

def copy_documents(source_collection, destination_collection, _ids, callback):
    """
    Given a list of _id's (MongoDB's unique identifier field for each document),
    copies the corresponding documents from the source collection to the destination
    collection
    """

    def _copy_documents_callback(...):
        if error_detected():
            callback(error)

    # copy documents, passing a callback function that will handle errors and
    # other notifications
    for _id in _ids:
        copy_document(source_collection, destination_collection, _id,
                      _copy_documents_callback)

    # more error handling omitted for brevity
    callback(None)

def copy_document(source_collection, destination_collection, _id, callback):
    """
    Copies document corresponding to the given _id from the source to the
    destination.
    """
    def _insert_doc(doc):
        """
        callback that takes the document read from the source collection
        and inserts it into destination collection
        """
        if error_detected():
            callback(error)
        destination_collection.insert(doc, callback) # another MongoDB operation

    # find the specified document asynchronously, passing a callback to receive
    # the retrieved data
    source_collection.find_one({'$id': _id}, callback=_insert_doc)

With gevent, the code uses no callbacks and reads sequentially:

import gevent
gevent.monkey.patch_all()

def copy_documents(source_collection, destination_collection, _ids):
    """
    Given a list of _id's (MongoDB's unique identifier field for each document),
    copies the corresponding documents from the source collection to the destination
    collection
    """

    # copies each document using a separate greenlet; optimizations are certainly
    # possible but omitted in this example
    for _id in _ids:
        gevent.spawn(copy_document, source_collection, destination_collection, _id)

def copy_document(source_collection, destination_collection, _id):
    """
    Copies document corresponding to the given _id from the source to the
    destination.
    """
    # both of the following function calls block without gevent; with gevent they
    # simply cede control to another greenlet while waiting for Mongo to respond
    source_doc = source_collection.find_one({'$id': _id})
    destination_collection.insert(source_doc) # another MongoDB operation

This simple code will copy documents from a source MongoDB collection to a destination, based on their _id fields, which are the unique identifiers for each MongoDB document. copy_documents delegates the work of copying documents to greenlets (which are like threads but are cooperatively scheduled) that run copy_document(). When a greenlet performs a blocking operation, such as any request to MongoDB, it yields control to any other greenlet that is ready to execute. Since greenlets all execute in the same thread and process, you generally don’t need any kind of inter-greenlet locking.

With gevent, I was able to achieve much faster performance than either the thread worker pool or process worker pool approaches. Here’s a summary of the performance of each approach:

Approach Performance (higher is better)
single process, no gevent 520 documents/sec
thread worker pool 652 documents/sec
process worker pool 670 documents/sec
single process, with gevent 2,381 documents/sec

Combining gevent with worker processes – one for each shard – yielded a linear increase in performance. The key to using worker processes efficiently was to eliminate as much IPC as possible.

Somewhat surprisingly, using gevent in just a single process could produce a full copy of a collection in just under half the time as the mongodump tool, which is written in C++ but queries synchronously and is single-process/thread.

Issue #2: Replicating updates after the snapshot

Because MongoDB is not transactional, when you try to read a large MongoDB collection while updates are being performed to it, you will receive a result set that reflects MongoDB’s state at different points in time. For example, suppose you start reading a whole collection using a MongoDB find() query. Your result set could look like this:

included: document saved before your find()
included: document saved before your find()
included: document saved before your find()
excluded: document deleted just after your find() began
included: document inserted after your find() began

Moreover, to minimize the downtime required to point the Mailbox backend to the new copy of the collection, it was necessary to figure out a way to stream changes from the source MongoDB cluster to the new MongoDB cluster with as little latency as possible.

Like most asynchronously replicating data stores, MongoDB uses a log of operations – its oplog – to record and distribute a record of the insert/update/remove operations executed on a mongod instance to other mongod replicas. Given a snapshot of the data, the oplog can be used to apply all changes performed since the snapshot was taken.

So, I decided to stream oplog entries from the source cluster and apply those changes at the destination cluster. Thanks to an informative post on Kristina Chodorow’s blog, I was quickly able to grasp the basics of the oplog format. Replicating inserts and removes was trivial, because their serialization format is straightforward. On the other hand, updates took more work.

The structure of update oplog entries was not immediately obvious, and in MongoDB 2.2.x, it uses duplicate keys that can’t be displayed by the Mongo shell, let alone most MongoDB drivers. After some thought, I devised a workaround that simply used the _id embedded in the update to trigger another copy of the document from the source. While this doesn’t have identical semantics as applying just the specified update, this guarantees that the copied data is at least as recent as the op we’ve received. Here is a diagram showing how intermediate versions of documents (in this case, v2) are not necessarily copied, but the source and destination are still eventually consistent:

update_ops
applying update ops

I also ran into a performance issue replaying ops on the destination cluster. Though I had a separate process to replay ops for each shard, applying ops serially (my initial approach for prototyping and ensuring correctness) was far too slow to keep up with the onslaught of Mailbox queries.

Applying ops concurrently seemed to be the way to go, but the question was how to preserve correctness. Specifically, two operations affecting the same _id cannot execute out of order. A simple workaround I devised was to maintain, in a Python set, the set of _ids being modified by in-progress operations. When copy_collection.py encounters another update to an _id that is currently being updated, we block the later update and any other ops that come after it from being applied. We start applying new ops only when the older operation on the _id has finished. Here’s a diagram to illustrate op blocking:

blocking_ops
concurrent op replay

Verifying copied data

Comparing the copied data to the original is normally a straightforward operation. Doing it efficiently also isn’t particularly challenging when you use multiple processes and gevent.

However, doing it when the source and the copy are both being updated requires some thought. At first, I tried just logging warnings whenever compare_collections.py (the tool I wrote to compare two collections) found a data inconsistency in a document that had been recently updated. Later, I could repeat verification for those documents. However, that doesn’t work for deleted documents, for which there remains no last modified timestamp.

I started thinking about the term “eventual consistency,” which is often used when talking about asychronously replicating systems such as MongoDB’s replica sets and MySQL’s master/slave replication. Given enough time (i.e. after some amount of retries) and barring catastrophe, the source and the copy will eventually become consistent. So, I added retry comparisons with an increasing backoff between successive retries. There are potential issues with certain cases, such as data that oscillates between two values. However, the data being migrated didn’t have any problematic update patterns.

Before performing the final cutover from the original MongoDB cluster to the new MongoDB cluster, I wanted the ability to verify that the most recent ops had been applied. So, I added a command-line option to compare_collections.py to compare the documents modified by the most recent N ops. Running this for a sufficiently large set of ops during downtime would provide additional confidence that there weren’t undetected data inconsistencies. Running it for even hundreds of thousands of ops per shard only takes a few minutes. This also mitigates concerns regarding undetected data inconsistencies resulting from the compare/retry approach.

Handling the unexpected

Despite taking various precautions to handle errors (retries, catching possible exceptions, logging), there were still an uncomfortable number of issues arising during my final test runs leading up to the production migration. There were sporadic network issues, a specific set of documents that was consistently causing mongos to sever its connection from copy_collection.py, and occasional connection resets from mongod.

Soon, I realized that I couln’t identify all the relevant failure scenarios, so I shifted my focus to quickly recovering from failures. I added logging of _ids of documents for which compare_collections.py had detected inconstencies. Then, I created another tool whose sole job was to re-copy the documents with those _ids.

Migration time!

During the production migration, copy_collection.py created an initial snapsphot of hundreds of millions of emails and replayed more than a hundred million MongoDB operations. Performing the initial snapshot, building indices, and catching up on replication took about 9 hours – well within the 24 hour goal I had set. I continued to let copy_collection.py replay ops from the source cluster’s oplogs for another day while I used compare_collections.py to verify all copied data three times (for additional safety).

The actual cutover to the new MongoDB cluster happened recently. The MongoDB-related work was very short (a few minutes). During a brief maintence window, I ran compare_collections.py to compare documents modified by the last 500,000 operations in each shard. After detecting no inconsistencies in the most recently updated data, we ran some smoke tests, pointed the Mailbox backend code to the new cluster, and brought the Mailbox service back up to the public. Our users haven’t reported any issues caused by the cutover. This was a success in my mind, as the best backend migrations are invisible to our users.

In contrast, our backend monitoring showed us the true benefits of the migration:

write_lock
before and after

The decrease in the percentage of time the write lock was held was far better than the linear (50%) improvement we had expected based on our MongoDB profiling. Great success!

Hello, world

We’re open-sourcing Hydra, the suite of tools we developed to perform the aforementioned MongoDB collection migration. We hope this code will be useful for anyone who needs to perform a live re-partitioning of their MongoDB data.

Welcome Guido!

Uncategorized / 89 Comments Posted by Drew Houston on December 07, 2012

 

 

Today we’re excited to welcome a new member of the Dropbox family under unusual circumstances. Though he’s joining us now, his contributions to Dropbox date back to day one, all the way to the very first lines of code.

Some people only need to be introduced by their first name, and the BDFL is one of them. Dropbox is thrilled to welcome Guido, the creator of the Python programming language and a long-time friend of ours.

From the beginning, it was clear that Dropbox had to support every major operating system. Historically, doing so presented a serious challenge for developers: because each platform required different development tools and programming languages, developers had to write the same code multiple times.

We didn’t have time for that, and fortunately Python came to the rescue. Several years earlier, Python became my favorite programming language because it had a balance of simplicity, flexibility, and elegance. These qualities of Python, and the community’s work to support every major platform, let us write the code just once before running it everywhere. They have also influenced our greater design philosophy at Dropbox as we set out to build a simple product that brings your life together.

It’s been five years since our first prototype was saved as dropbox.py, and Guido and the Python community have been crucial in helping us solve interesting challenges for more than 100 million people.

So we welcome Guido to Dropbox with admiration and gratitude. Guido inspires all of us and has played a critical part in how Dropbox ties together the products, devices and services in your life. We’re delighted to have him as part of the team.

Caching in theory and practice

Uncategorized / 17 Comments Posted by Pavel Panchekha on October 16, 2012


Hello, my name is Pavel Panchekha. I was an intern at Dropbox back in ’11, and one thing I’ve investigated are various caching algorithms. The Dropbox mobile client caches frequently-accessed files, so that viewing them doesn’t require a network call. Both our Android and iOS clients use the LRU caching algorithm, which often selects good files to cache. But while this is the usual algorithm for caching, I wondered: are there better algorithms, and if not, why is LRU the best option?

Caches

Let’s formalize the problem. We have a large set of files, and we’d like an algorithm to determine which \(k\) files to keep at any point. We’re assuming all files have the same size. In fact, Dropbox stores files in 4MB blocks, so this simplification isn’t too far off (we’ll see later how to avoid it). To determine which files to keep, every time we want to use a file, we tell the cache to fetch it for us. If the cache has a file, it will give us its copy; if not, the cache has to fetch the new file, and it might also want to remove a file from its cache to make room for the new file.

Note also that we’re stuck with an on-line algorithm: we can’t predict what files a user will want in the future.

The cache needs to be fast, along two metrics. First, The cache should ensure that as many of the requests for files go to it (cache hit), not over the network (cache miss). Second, the overhead of using a cache should be small: testing membership and deciding when to replace a file should be as fast as possible. Maximizing cache hits is the goal of the first part of this post; quickly implementing the cache will be the topic of the second part.

Competitive Analysis

How do we measure the worst case number of cache misses? Unlike a normal algorithm, our runtime is driven by the user’s actions. So our worst-case performance corresponds to our worst-case user: one who maximally breaks our cache at every step.

But a pure adversarial analysis won’t work, since a user can always make our cache perform badly by just requesting lots of files –eventually, some of them won’t be cached.

The key is to compare how well our algorithm performs with how well our algorithm could possibly perform. We need some benchmark. Zero cache misses is a lower bound but is usually impossible. So instead, let’s compare our algorithm with one that can “plan ahead” perfectly: let’s compare our algorithm – which at any point only has the requests from the past – with some sort of optimal magic “future-seeing” algorithm.

More specifically, we’re going to find the ratio of cache misses from our algorithm to the number of cache misses for the optimal algorithm. And then we’re going to try to minimize this ratio across all possible sequences of file requests from the user. Generally, we’ll argue that the algorithm we’re analyzing will have at most \(A\) misses during any particular sequence of instructions, during which the optimal algorithm must have at least \(O\) misses; thus the “competitive ratio” is at most \(A / O\). This type of analysis is called competitive analysis.

In general, our method will be to pick a sequence that a chosen algorithm performs very poorly on. We find how many cache misses, \(A\), that algorithm sees for that sequence of requests. Usually, we’ll be able to calculate \(A\) precisely. Then, we’ll try to think up the cleverest possible way to cache files for that specific sequence; the number of cache misses we see we’ll call \(O\). We’ll usually find some possible way of caching files and calculate the number of cache misses for that, so we’ll get an upper bound on \(O\). The competitive ratio is \(A / O\), and since we had an upper bound on \(O\), we get a lower bound on the competitive ratio. Furthermore, our algorithm could perform even worse on a different sequence, so \(A / O\) is definitely a lower bound. This lets us say that some algorithm is really bad, but doesn’t let us say that some algorithm is really good. We’ll also prove some upper bounds on the competitive ratio, which will let us claim that some algorithms are optimal. Together, these will give us a way to compare caching algorithms.

Caching Algorithms

Before we go ahead to analyze a bunch of caching algorithms, we need caching algorithms to analyze. So let’s quickly list a bunch of popular ones:

Most Recently Used
When we need to get rid of a file, we trash the one we just recently accessed. This algorithm incorporates information about how often a file is accessed in a perverse way – it prefers to keep around old data that is rarely used instead of data that is frequently accessed. But if you use many files, without using the same files over and over again (such as, say, when viewing a photo gallery), this algorithm works very well, since you’re kicking out files you’re unlikely to see again. In effect, browsing through a complete photo gallery can take up only one “slot” in the cache, since each access you kick out the previous photo in that gallery.
Least Recently Used
When we need to get rid of a file, we get rid of the one we haven’t used in the longest time. This only requires keeping the access order of the files in the cache. By keeping files that we’ve recently accessed, it too tends to keep around files used more often; on the other hand, if a user’s interest changes, the entire cache can relatively quickly become tuned to the new interests. But this cache tends to work poorly if certain files are accessed every once in a while, consistently, while others are accessed very frequently for a short while and never again.
Least Frequently Used
When we need to get rid of a file, we get rid of the one that is least frequently used. This requires keeping a counter on each file, stating how many times it’s been accessed. If a file is accessed a lot for a while, then is no longer useful, it will stick around, so this algorithm probably does poorly if access patterns change. On the other hand, if usage patterns stay stable, it’ll (we hope) do well.

This is a nice batch of the more common and simple caching algorithms, so let’s look at how we perform using the competitive analysis above.

The Most Recently Used and the Least Recently Used Algorithms

We can easily construct sequences to stump the Most Recently Used algorithm. For example, consider the sequence of file accesses \(1, 2, \dots, k, k+1, k, k+1, k, \dots\). Most Recently Used will kick out \(k\) to make room for \(k+1\), then kick out \(k+1\) to make room for \(k\), and so on. It will have a cache miss on every file lookup for this sequence of files. And it’s so easy to do better: an optimal algorithm might, for example, kick out \(1\) to make room for \(k+1\), and never have a cache miss after that (since both \(k\) and \(k+1\) are in the cache after that). The optimal algorithm sees at most \(k+1\) cache misses, while Most Recently Used sees \(N\) cache misses, making it \((N / (k+1))\)-competitive. Since we can make \(N\) as large as we want, this can be arbitrarily large – we might call the Most Recently Used algorithm \(\infty\)-competitive. So, really bad.

The Least Recently Used algorithm is better. For example, on that input sequence, it does precisely what the optimal algorithm might do. But it still doesn’t do that well. Imagine if our sequence of requests is for files \(1, 2, \dots, k, k+1, 1, 2, \dots, k, k+1, \dots\). The Least Recently Used algorithm will miss every time, since for every request, the file requested was just kicked out. And the optimal algorithm can always just swap for the most-recently-requested file. So first, it would fail to find \(k + 1\) in the cache and replace \(k\) with it. Then it would fail to find \(k\) and replace \(k – 1\) with it. Then \(k – 1\) with \(k – 2\), and so on. This yields one cache miss every \(k\) requests; so if there are \(N\) requests total, the optimal algorithm would face \(k + \frac{N}{k}\) failures (the “k” for populating the cache with \(1, \dots, k\)), while the Least Recently Used algorithm would face \(N\) failures. Thus the Least Recently Used algorithm is at best \(N / (k + (N / k))\)-competitive, which for large \(N\) works out to be \(k\)-competitive.

This doesn’t show that the Least Recently Used algorithm is \(k\)-competitive; it tells us that the Least Recently Used algorithm isn’t better than \(k\)-competitive. But with a bit more effort, we can prove that the Least Recently Used algorithm is precisely \(k\)-competitive.

To do that, we’ll have to make an argument about all possible input sequences. The core of the proof is to look at what must happen for LRU to fail. If the Least Recently Used algorithm has \(k+1\) cache misses, it must be because \(k+1\) new files were requested. But if this happens, at least one of those files wasn’t in the cache that the optimal algorithm had (since it, too, can only cache \(k\) files).

To capture this property precisely, let’s divide the sequence of files into phases – during each phase, only some \(k\) specific files are requested. LRU may fail on at most each of these new files before they are all in the cache – at most \(k\) times. Meanwhile, the optimal algorithm fails at least once, since at least one of those files isn’t yet in the cache (if they all are, then we never ended the previous phase). So LRU is precisely \(k\)-competitive.

This doesn’t sound that good. In a way, the larger our cache, the less impressively LRU performs. But in fact, our argument that Least Recently Used is \(k\)-competitive is applicable to any algorithm for which we can predict what files it will cache. So while \(k\) times worse than perfect seems pretty poor, it is in fact the best we can do (unless we use randomized algorithms; I’ll discuss why not to do that in a bit).

LRU only made use of very basic timing information. A smarter algorithm, you might imagine, might actually maintain some popularity information: which files you use often, and which more rarely. Does it do any better?

Least Frequently Used

It seems that Least Frequently Used should do much better than LRU, since it incorporates actual information about how popular various files are, instead of just timing information.

But let’s do the analysis proper, just in case. To make LFU perform poorly, we’d need to make it keep switching between two files, each time kicking one out to make room for the other. This might happen if we use, say, files \(1, 2, \dots, k-1\) very frequently, and files \(k\) and \(k+1\) infrequently, but equally so. If we just request \(1, \dots, k-1\) once or twice and then alternate between \(k\) and \(k+1\), this isn’t too much of a problem, since eventually both \(k\) and \(k+1\) will be more frequently used than \(1, \dots, k-1\) and will both be cached. But if we first use \(1, \dots, k-1\) a bunch, so that neither \(k\) nor \(k+1\) are ever more popular than them, we can create a lot of cache misses. What we are setting up is a case where usage patterns change. First, we used \(1\) through \(k-1\) a lot, and then we changed to using \(k\) and \(k+1\) a lot.

An example such sequence requests \(1, \dots, k-1\) \(N\) times, and then alternates between \(k\) and \(k+1\) \(N\) times. Both \(k\) and \(k+1\) are less frequently used than any of \(1, \dots, k-1\), so each is kicked out to make room for the other. This leads to \(k-1\) cache misses to load \(1\) through \(k-1\) into the cache, and then \(2N\) cache misses for the requests to \(k\) and \(k+1\). On the other hand, the optimal algorithm could do so much better. For example, it could kick out \(1\) to make room for \(k+1\) when it stops being used, leading to \(k+1\) total cache misses. So the LFU algorithm had \(2N + k – 1\) misses, and the optimal algorithm had \(k + 1\). The quotient of these can be made arbitrarily large by increasing \(N\), so LFU can be arbitrarily bad.

This result is curious. LFU semi-intelligently made use of popularity data, but fared so much worse than LRU, which just made use of basic timing data. But, the cases that make LFU perform poorly are relatively real-world. For example, suppose you have a large project that you’re working on, and then you finish said project and no longer access those files. Your cache would be storing those old files instead of the new ones you’re using. So our analysis told us something surprising: that LFU, which looked so promising, could actually be absurdly bad, in perhaps real-world situations.

In fact, if you think about it, LRU does make some use of popularity information. If a file is popular enough to be used more often than once every \(k\) times, it will always be in an LRU cache. But by forgetting any information more than \(k\) files ago, the LRU algorithm prevents really old files from taking precedence over new ones.

So, Why LRU?

You’d think it’d be possible to combine the best of LRU and LFU to make an algorithm that performs better than either. Turns out, yes and no.

When we proved LRU no better than \(k\)-competitive, we choose a sequence where the next file requested was always the file not in the cache. But we can do this for any deterministic algorithm! This means that the worst-case behavior of any deterministic algorithm is guaranteed to be no better than \(k\)-competitive.

But in a practical sense, better algorithms do exist. For reference, the ARC1 and CAR2 algorithms tend to outperform Least Recently Used caches. Of course, each has the same worst-case behavior that the Least Recently Used algorithm has, but they manage to trade off between frequent and recent items in a way that often leads to better performance in practice. Of course, both are more complex than the Least Recently Used algorithm.

We can get around the theoretical deficiencies of deterministic algorithms – that the user can predict which files aren’t in the cache and thus keep requesting those – by having our algorithm make partially-random choices. This will make it harder for users to hit the worst case, but it often makes the algorithm perform worse in practice. The best a randomized algorithm can do is \(O(\log k)\) (in fact, approximately the natural log of \(k\)); see Fiat et al.3. Randomized caching algorithms have the downside of behaving in unexpected ways for the user – “Why is that file taking so long to open, I just looked at it!”. So in practice, they’re rarely used.

Tangent: while randomized algorithms cannot be used directly in practice, they do tell us something about the expected performance of deterministic algorithms. This comes from a beautiful theorem by John von Neumann, called the Minimax Theorem. Imagine that the algorithm designer and his adversary play a game: the designer chooses a caching algorithm, the adversary a sequence of files, and then the winnings are decided based on how many cache misses the cache had. Phrased this way, algorithm design falls under the purview of game theory. We can represent a randomized algorithm as a strategy that involves choosing an algorithm at random from some set, and we can represent a randomized sequence of files as a random choice from a set of possible sequences.

Continuing the tangent, let’s consider what the Minimax Theorem tells us about this game. The Minimax Theorem tells us that there exists an equilibrium strategy, where the worst-case winnings for each player is maximized. Since they’re the worst-case winnings for each player, they’re minimum winnings, so we have a minimum maximized – hence the theorem’s name. Such an equilibrium strategy might be a randomized strategy. In fact, since randomized algorithms can deliver guaranteed \(O(\log k)\) performance, better than any deterministic algorithm, we might suppose that the maximum worst-case winnings for the adversary are at most \(O(\log k)\). Similarly, the adversary will likely want to play some manner of randomized input sequence, since otherwise there would be added structure for a cache to possibly extract.

Still on tangent, note that if the algorithm designer is committed to a randomized algorithm, there may be no reason to play a randomized input sequence. This is a consequence of the second part of the Minimax Theorem (which, sadly, is not as well-known): if one player is committed to a strategy, there is an optimal, deterministic response, which attains results at least as good as those from the equilibrium strategy. In particular, if the randomized algorithm being used is well-known, there must be a sequence of inputs that has the most expected cache misses; but this can’t take longer than with a randomized input sequence (otherwise, we would have chosen this deterministic sequence as our “randomized” one). But we can turn this around: if the input sequence is pre-chosen, there is an optimal deterministic response. This option better describes the usual human user, who will not actively try to thwart the Dropbox caching algorithm, but simply accesses files in a normal fashion. In this case, the sequence of files is random and pre-determined, so there is an optimal deterministic response. And the expected number of cache misses from such is at most \(O(\log k)\). So a good deterministic algorithm, while it has a worst-case competitiveness of \(O(k)\), may have an expected competitiveness of at most \(O(\log k)\). And, in fact, LRU is one of these good deterministic algorithms.

Another way to convince yourself that the \(k\)-competitiveness of LRU is not that bad is compare an LRU cache not with an optimal cache of the same size, but with an optimal but smaller cache. In this case, you can prove a better result. For example, an LRU cache is at most twice as bad as an optimal cache half its size. Compared to an optimal cache of 100 files, an LRU cache for 200 files is at most twice as bad.

Overall, the caching algorithm you want to use is usually LRU, since it is theoretically very good and in practice both simple and efficient. For example, the Dropbox iOS and Android clients both use LRU caches. The Linux kernel uses a variant called segmented LRU.

On to some code.

Caching Algorithms in Practice

Our LRU implementation needs to do two things quickly. It needs to access each cached page quickly, and it needs to know which files are most and least recent. The lookup suggests a hash table, maintaining recency suggests a linked list; then each step can be done in constant time. A hash table can point to its file’s node in the list, which we can then go ahead and move around. Here goes.

class DoubleLinkedNode:
    def __init__(self, prev, key, item, next):
        self.prev = prev
        self.key = key
        self.item = item
        self.next = next

class LRUCache:
    """ An LRU cache of a given size caching calls to a given function """

    def __init__(self, size, if_missing):
        """
        Create an LRUCache given a size and a function to call for missing keys
        """

        self.size = size
        self.slow_lookup = if_missing
        self.hash = {}
        self.list_front = None
        self.list_end = None

Our constructor (__init__) takes a size for the cache and a function to call to compute an item that isn’t in the cache. This could, for example, request a file over the network. After that we set up a hash table and a doubly-linked list for files.

To look up an item, we either find it in the cache already (that is, in the hash table) or we insert it in. We’ll distinguish between these cases and call out to helpers.

def get(self, key):
    """ Get the value associated with a certain key from the cache """

    if key in self.hash:
        return self.from_cache(key)
    else:
        new_item = self.slow_lookup(key)

        if len(self.hash) >= self.size:
            self.kick_item()

        self.insert(key, new_item)
        return new_item

To look up an item that’s already in the cache, we just need to move its node in the list to the front of the list.

def from_cache(self, key):
    """ Look up a key known to be in the cache. """

    node = self.hash[key]
    assert node.key == key, "Node for LRU key has different key"

    if node.prev is None:
        # it's already in front
        pass
    else:
        # Link the nodes around it to each other
        node.prev.next = node.next
        if node.next is not None:
            node.next.prev = node.prev
        else: # Node was at the list_end
            self.list_end = node.prev

        # Link the node to the front
        node.next = self.list_front
        self.list_front.prev = node
        node.prev = None
        self.list_front = node

    return node.item

To kick an item, we need only take the node at the end of the list (the one that’s least recently used) and remove it.

def kick_item(self):
    """ Kick an item from the cache, making room for a new item """

    last = self.list_end
    if last is None: # Same error as [].pop()
        raise IndexError("Can't kick item from empty cache")

    # Unlink from list
    self.list_end = last.prev
    if last.prev is not None:
        last.prev.next = None

    # Delete from hash table
    del self.hash[last.key]
    last.prev = last.next = None # For GC purposes

Finally, to add an item, we can just link it to the front of the list and add it to the hash table.

def insert_item(self, key, item):
    node = DoublyLinkedNode(None, key, item, None)

    # Link node into place
    node.next = self.list_front
    if self.list_front is not None:
        self.list_front.prev = node
    self.list_front = node

    # Add to hash table
    self.hash[key] = node

There it is, a working, \(k\)-competitive, LRU cache.

Loose Ends

You’ll note that we’ve been assuming so far that all files are the same size. But in practice, this is of course untrue. How do we deal with bigger and smaller files? Well, it turns out, Dropbox naturally subdivides files into blocks (4MB big files, in fact). So instead of caching particular files, we can cache particular blocks, which are close enough in size that the Least Recently Used algorithm above works. Equivalently, we just kick out files until there is enough room for whatever file we want to load.

Another problem that a real-world cache needs to solve is the issue of cache invalidation – that is, since the files we are caching can change on the server, how do we tell that our cache is out of date? A simple way is to always download an index, which tells you the file’s revision number, but not the file data itself. You can do this on a per-directory basis, so that it’s not too much data by itself. Then every time you find a file in the cache, you simply check when your copy was last modified and when the server’s copy was last modified. This lets you know whether to renew your copy. Going even further, you can cache these indices for each directory, and use the same logic to determine whether they need to be downloaded again. This is what the Android and iOS clients do.

Conclusions

Caches can be used in front of any slow part of your application — communication over a network, reads from disk, or time-intensive computation. Caching is especially important in mobile programs, where network communication is often both necessary and costly, so it’s good to know the theory and do it right. Luckily, the best solution for caching problems is usually the Least Recently Used algorithm, which is both efficient and simple to implement.

Thanks to Dan Wheeler, Tido the Great, Aston Motes, Albert Ni, Jon Ying, and Rian Hunter for proofreading.

Footnotes:

1 N. Megiddo & D. Modha (2003), “ARC: A Self-Tuning, Low Overhead Replacement Cache”

2 S. Bansal & D. Modha (2004), “CAR: Clock with Adaptive Replacement”.

3 A. Fiat, R. Karp, M. Luby, M. McGeoch, D. Sleator & N. Young (1991), “Competitive paging algorithms”.

Comtypes: How Dropbox learned to stop worrying and love the COM

Uncategorized / 18 Comments Posted by Alicia Chen on October 04, 2012

Here at Dropbox, we often use Python in uncommon ways. Today, I’ll be writing about a module that few Python users have even heard of before—comtypes. Comtypes is built on top of ctypes and allows access to low level Windows APIs that use COM. This module allows you to write COM-compatible code using only Python. For example, the Dropbox desktop client feature that allows you to upload photos from a camera uses comtypes to access Windows Autoplay. But before we talk about comtypes, we have to talk about COM.

What is COM?

The Component Object Model is a standard introduced by Microsoft back in 1993. It allows two software components to interact without either one having knowledge of how the other is implemented, even if the components are written in different languages, running in different processes, or running on different machines and different platforms. Many Windows APIs still rely on COM, and occasionally, we have to work with one of them. The camera upload feature we released this year runs on Windows XP, an OS from 10 years ago, as well as Windows 8, an OS that hasn’t been released yet. And it does all this using a standard that was created almost 20 years ago.

On Windows, COM is both a standard and a service. It provides all the systems and utilities necessary to make inter-component compatibility possible. The standard requires interfaces to be compiled into a binary format that is language agnostic. For this purpose, it includes a specification for its own interface language—the Microsoft Interface Definition Language, aka MIDL—which is compiled into a binary called a type library that is then included inside a runnable such as a .dll or .exe. COM also allows run-time querying of supported interfaces, so that two objects can agree on an interface much like strangers meeting in a foreign country—”Do you speak IHardwareEventHandler version 2? No? Well, parlez-vous version 1?” On top of that, it also provides for object reference counting, inter-process marshalling, thread handling, and much more. Without this functionality, a component implementer would have to make sure objects used by a different process are cleaned up eventually, but not while they’re still being referenced. She’d have to serialize arguments to pass between components, and figure out how to control access from multi-threaded components into objects that may or may not be thread-safe.

COM handles these things for you, but this functionality comes at a cost. It involves a fair amount of complexity, which is unfortunately necessary, and a lot of syntactic convolution, which is just plain unfortunate. Writing an object that uses a COM component, aka a COM client, is difficult. Writing a COM object that other components can use, aka a COM server, can be downright devilish.

If you think COM seems like magic, you’re right—it is definitely some sort of black magic. COM requires incantations such as

CoRegisterClassObject(
    {0x005A3A96, 0xBAC4, 0x4B0A, {0x94, 0xEA, 0xC0, 0xCE, 0x10, 0x0E, 0xA7, 0x36}},
    NULL,
    CLSCTX_INPROC_SERVER | CLSCTX_LOCAL_SERVER,
    REGCLS_MULTIPLEUSE,
    &lpdwRegister
    );

and the use of strange ritual equipment like MIDL compilers. To ensure unambiguity in class and interface identification, everything is referenced by GUIDs, which are undescriptive and unwieldy at best. And when creating a COM server, the sheer number of configuration options at every step of the way can be paralyzing. You have to answer questions such as “Are my threads running in Single Threaded Apartments or Multithreaded Apartments?” and “What does it mean to set my ThreadingModel to Both instead of Free?” Understanding these questions requires a lot of COM-specific background knowledge, and most articles about these choices are pages long, and often involve charts, diagrams, and sample code.

I came to Dropbox with enough knowledge of COM to squeak by, and still consider myself no more than an advanced novice. If one were to attempt to write pure Python code that used or, heaven forbid, implemented a COM object, one would need to generate and parse the binary type library files that specify COM interfaces, perform all the complex Windows registry rituals, track all the reference counts to COM objects, as well as correctly write the endless syntactical mumbo jumbo. Fortunately for us, the comtypes module exists to abstract (almost) all of this horribleness away from us.

Comtypes

If COM is black magic, then comtypes is the mysterious witch doctor service that you contract to perform the black magic for you. For simple tasks, everything likely works fine. Unfortunately, if you need to do anything very complex, you run the risk of being left in the dark as to what sort of invocations were performed, only to find the demon knocking on your door.

Still, comtypes makes life much easier. When it works well, you can simply feed comtypes the path to the dll or exe of the object you’re trying to use, and then write pretty straightforward code like

device_obj = CreateObject("PortableDevice.PortableDevice", IPortableDevice)
contents = device_obj.Content()
for item in contents:
    print item

This (slightly simplified) sample code allows access to the contents of a camera attached to the computer. The deviceobj is a Python wrapper around a COM object that is actually implemented elsewhere on the system, one that represents a camera we can interface with. Underneath, comtypes will be busy CoCreateInstancing, QueryInterfacing, and wrapping ctypes objects with Python objects for your ease of use. There’s usually no need for you to worry about the million things that are going on underneath. But unfortunately, things don’t work smoothly all the time, so what kind of hackers would we be if we didn’t open it up to see how it works?

Automatic code generation: GetModule

The magic begins in the comtypes.client module. The handy helper function GetModule will take a binary file like a .tlb or .exe, extract the binary data, and automatically generate Python code that, like a header file, specifies all the interfaces, methods, and structs that you need to use a particular COM object. Anyone’s who’s worked with the Windows API might be familiar with the rabbit-hole of struct and type declarations. A FORMATETC struct, for example, is one that is used in drag and drop APIs. It is declared as two mystery types followed by three 32 bit ints. Further digging will reveal that one of the unknown types is an enum, but the other is another struct involving more unknown types. For it to be usable in Python, you have to break things down into known types without the benefit of importing hundreds of Windows headers. GetModule will do all of these things for you, but the generated code hides the interesting part, the wrapper classes that actually proxy to the real COM objects underneath. So it’s time to dig a little deeper.

A metaclass in the wild

This brings us to the comtypes class IUnknown, the root of all evil (no joke—check __init__.py:979). In COM, IUnknown is the grandfather of all interfaces, the interface which all other interfaces inherit from. It contains only three functions:

// QueryInterface returns a pointer to the interface you are querying for,
// or an error if the object does not implement it
int QueryInterface(InterfaceID refiid, void** ppObjectOut);

// These methods are used for reference counting
int AddRef();
int Release();

In comtypes land, IUnknown is actually a base class. For any COM interface that you intend to call into—IPortableDevice for example—you must create a class that inherits from IUnknown. All the methods in the interface are declared in the variable _methods_ as tuples, specifying function name, types and names of args and return values.

class IPortableDevice(IUnknown):
   _iid_ = GUID('{625e2df8-6392-4cf0-9ad1-3cfa5f17775c}')
   _methods_ = [
       COMMETHOD([], HRESULT, 'Open',
           ( ['in'], LPWSTR, 'pszPnpDeviceID' ),
           ( ['in'], POINTER(IPortableDeviceValues), 'pClientInfo')),
       COMMETHOD([], HRESULT, 'Content',
           ( ['out'], POINTER(POINTER(IPortableDeviceContent)), 'ppContent'))
   ]

The class IUnknown itself is actually pretty simple. The magic happens in its metaclass _cominterface_meta, which turns these tuples into bound methods. For the uninitiated, all Python classes are actually objects, and metaclasses are things that make classes. When you declare a class like IPortableDevice that inherits from IUnknown, the metaclass of IUnknown takes COMMETHODs declared above and creates two bound methods: IPortableDevice.Open, which takes in two parameters and returns nothing; and IPortableDevice.Content, which takes no parameters and returns one. These wrapper methods check that calls are made with the appropriate number and type of inputs, a necessity when communicating between untyped, flying-by-the-seat-of-your-pants Python and statically typed, compiled languages like C++. The wrappers then proxy the method calls into an actual COM object that was instantiated under the covers, wrap the return values in Python types, and return them to you, transforming returned error codes into Python exceptions along the way. It’s wonderful, except when it doesn’t work exactly as intended.

The most painful such incident brought development to a dead halt for two days. The only symptom was that the program would occasionally crash after reading a bunch of images from a camera. The bug was non-deterministic and no exception was generated. After endless hours of printing and prodding, I finally found the root cause of the problem. In COM, the implementer of an interface typically does AddRef on the object when he creates it so that it is “born” with a reference which is passed to the caller of CreateObject, while the user of the interface is responsible for calling Release when he is done with it. Additional calls to AddRef and Release are only necessary if the user makes copies of the reference to the object. So in comtypes, the __init__ method of a comtypes object does not call AddRef on the COM interface, but deletion does call Release. This in itself is only passingly strange, because it usually works.

However, the clever wrapping of COM objects sometimes results in comtypes objects being created unexpectedly, and then also deleted unexpectedly. For example, in the following code

# The following is equivalent to the C code
#   IDeviceItem* idevice_item_array = IDeviceItem[10];
#   device->GetItems(10, &idevice_item_array);
idevice_item_array = (pointer(IDeviceItem) * 10)()
device_obj.GetItems(10, idevice_item_array)
device_item = idevice_item_array[0]

you would expect device_item to be a pointer to an IDeviceItem. Normally, you would have to “dereference” the pointer to get to the item itself, as in

# In ctypes, pointer.contents refers to the target of the pointer
device_item = idevice_item_array[0].contents
device_item.do_something()

However, when you index the array, comtypes helpfully transforms idevice_item_array[0] from type pointer(IDeviceItem) to type IDeviceItem, so instead we have

# idevice_item_array[0] is already of type IDeviceItem??
idevice_item_array[0].do_something()

In the process, it unexpectedly creates an instance of IDeviceItem. More importantly, it unexpectedly destroys an instance of IDeviceItem. So, the following code

for i in range(100):
    print idevice_item_array[0]

actually crashes Python because it creates and destroys a hundred IDeviceItem objects, resulting in a hundred calls to Release on the real COM object. After the first Release call, the COM object is considered deleted. Whenever the garbage collector for that COM object is triggered, everything explodes.

The workaround? Save your reference into a Python object and keep it around. Don’t index your COM object arrays more than once.

device_item = idevice_item_array[0]
for i in xrange(100):
    print device_item

This results in exactly one call to Release, which occurs when the Python object deviceitem is destroyed.

All of this happened within my first few months at Dropbox, and I barely spoke Python at the time. I learned what a metaclass was before I had fully mastered list slicing syntax. Meanwhile, my counterpart on the Mac camera uploads side was not having the easiest time either. Without the benefit of a compatibility enforcer like COM, he was trying to ferret out why an OS X library was trying to execute Dropbox code as PowerPC assembly on an X86 machine (it’s complicated—the explanatory comment is about fifty lines long). It made me feel a little bit better about dealing with incorrect vtable pointers and bad reference counting.

Discovering comtypes was an integral part of the development of the photo feature, and it certainly presented enough excitement to be considered an adventure. In the end, for all the problems I encountered, having comtypes made it much easier to access COM APIs. Reference counting bugs are the price you pay when you work with low-level code, but it certainly would have been much more work to write our own Python wrappers around COM. COM may be difficult to use, and comtypes occasionally frustrating, but with a working knowledge of how to use the first and how to work around unexpected pitfalls in the second, we can plow ahead with future Windows features. In fact, not long after we released the camera feature, we found ourselves again needing to interact with a COM component. With all this knowledge under my hat, I made myself a COM client in no time. It was a glorious victory.