Gradual Epiphany

Thoughts on Dynamo's "Flawed Architecture"

In general, I think it’s a little inflammatory to make sweeping statements about the fitness of a given architecture. Every architecture has its flaws; it’s an expected state when you are faced with diametrically opposing constraints. The real question that should be asked is whether or not an architecture solves the problems for which it was designed in a reliable and efficient manner.

Joydeep Sarma posted an entry claiming that Dynamo is a “flawed architecture”. I’m not really qualified to prove or disprove Mr. Sarma’s claim, but having implemented a Dynamo clone, I think that he may be a little confused about how things work in these systems. What follows are a few quotes from his write-up followed by my own responses.

Let’s say that one is storing key-value pairs in Dynamo - where the value encodes a ‘list’. If Dynamo returns a stale read for a key and claims the key is missing, the application will create a new empty list and store it back in Dynamo. This will cause the existing key to be wiped out. Depending on how ’stale’ the read was - the data loss (due to truncation of the list) can be catastrophic. This is clearly unacceptable. No application can accept unbounded data loss - not even in the case of a Disaster.

Dynamo implementations protect against this scenario by using vector clocks. If we define a “stale read” as one which returns the key (or absence thereof) and an older vector clock, then any writes which use this older/non-existent vector clock will generate a conflict and the server will store two versions of the same key. The application then has the opportunity to resolve this conflict on the next read. When used in conjuction with quoroms for reads and writes, this approach proves to be exceedingly robust.

Dynamo starts by saying it’s eventually consistent - but then in Section 4.5. it claims a quorum consensus scheme for ensuring some degree of consistency. It is hinted that by setting the number of reads (R) and number of writes (W) to be more than the total number of replicas (N) (ie. R+W>N) - one gets consistent data back on reads. This is flat out misleading. On close analysis one observes that there are no barriers to joining a quorum group (for a set of keys). Nodes may fail, miss out on many many updates and then rejoin the cluster - but are admitted back to the quorum group without any resynchronization barrier. As a result, reading from R copies is not sufficient to give up-to-date data.

One of the foundational assumptions in the Dynamo system is that you define as many replicas as necessary to achieve your desired level of reliability. As with any replication based system, if you lose all of your replicas, there is no meaningful recovery. However, if we assume that you will always have some number of replicas functional, and we introduce an appropriate quorum on operations, we can identify those nodes which return stale data and repair them appropriately. In other words, it’s perfectly possible not to have resync barrier on joining, yet still ensure consistency in the answers provided to the client.

It might be helpful to recall that there are three levels of repair: read-repairs, hinted handoffs and replica synchronization. Two of these three are done in near-real time, thus minimizing the actual drift between nodes. Read repair deals with stale data on a per key/operation basis; the coordinator for a request can identify nodes responding with stale data and update them accordingly, using responses from other less stale nodes. Hinted handoffs are a bulk operation that is done when a node rejoins the cluster — the keys updated while the node was down are replayed (in essence) to the rejoining node. Replica sync is something that is typically done once a day and does require a traversal of all the data for a given partition. Tricks like Merkel trees, however, permit only the changed portion of the data to be exchanged, so in practice it’s not nearly as expensive as one might imagine in the abstract.

Lack of point in time consistency at the surviving replica (that is evident in this scenario) is very problematic for most applications. In cases where one transaction (B) populates entites that refer to entities populated in previous transactions (A), the effect of B being applied to the remote replica without A being applied leads to inconsistencies that applications are typically ill equipped to handle (and doing so would make most applications complicated).

The Dynamo paper makes it very clear that applications do require more logic to deal with these situations. Yes, it’s more work for the application, but in practice, it’s not that bad. It’s also important to point out that data dependencies are handled differently in these key/value stores than they are in a typical ACID environment. Usually apps will store the data in a denormalized form, so dependencies amongst key versions are minimal (if they exist at all). This makes it much easier to deal with conflicts as all the relevant data is on hand during the resolution phase.

I’ll leave it to someone else to do a more exhaustive analysis of Mr. Samra’s arguments. It’s been my experience over the past 2 years that Dynamo is one of those systems that you really have to see in action (or implement it) to appreciate the wonderful elegance and resiliency of the design. It’s certainly not a one-size-fits-all solution, but works very well in the appropriate problem space.