Monday, May 14, 2012

The correct way of using DynamoDB BatchWriteItem with boto

In my previous post I wrote about the advantages of using the BatchWriteItem functionality in DynamoDB. As it turns out, I was overly optimistic when I wrote my initial code: I only called the batch_write_item method of the layer2 module in boto once.

The problem with this approach is that many of the batched inserts can fail, and in practice this happens quite frequently, probably because of transient network errors. The correct approach is to inspect the response object returned by batch_write_item -- here is an example of such an object:

{'Responses': {'mytable': {'ConsumedCapacityUnits': 5.0}},
 'UnprocessedItems': {'mytable': [
{'PutRequest': {'Item': {'mykey': 'key1', 'myvalue': 'value1'}}},
{'PutRequest': {'Item': {'mykey': 'key2', 'myvalue': 'value2'}}},
{'PutRequest': {'Item': {'mykey': 'key3', 'myvalue': 'value3'}}}]}}

You need to look for the value corresponding to the 'UnprocessedItems' key. This value is a dictionary keyed by the name of the table you're inserting items in. The value corresponding to that key gives you a list of other dictionaries with keys corresponding to the operations you applied to the table ('PutRequest' in my case). Going one level deeper allows you to finally obtain the attributes (keys + values) of the items that failed, which you can then try to re-insert.

So basically you need to stay in a loop and keep calling batch_write_items until UnprocessedItems corresponds to an empty list. Here is a gist containing code that reads a log file in lzop format, looks for lines containing a key + white space + a value, then inserts items based on those key/value pairs into a DynamoDB table. I've been pretty happy with this approach.

Before I finish, I'd like to reiterate the gripe I have about the static nature of determining your Read and Write Throughput when dealing with DynamoDB. I understand that it makes life easier for AWS in terms of the capacity planning they have to do on their end to scale the table across multiple instances, but it's a black art when it comes to capacity planning you need to do as a user. You almost always end up overcommitting as a DynamoDB user, and it's hard to make sense sometimes of the capacity units you're consuming, especially when doing inserts of large volumes of data.


Jonathan Q said...

Very good advice. I had to do something similar when I was writing the batch_get with boto. I just kept iterating over the results when there were "Unprocessed" keys.

However one thing we added (and I would suggest here for anyone using this in production code) is to set some sort of a limit on the number of times you keep trying. If there was an item that just couldn't be added to Dynamo (or couldn't be retrieved in a batch_get) - you'd get stuck in an infinite loop trying to fetch the item.

It's probably unlikely - but relying on Dynamo to always (eventually) return an item or always allow you to put an item is probably a bad idea.

We just set an arbitrary limit and if we reached that number of recursive calls, just admit those items failed and notify as appropriate and return.

Grig Gheorghiu said...

That's a great tip, Jonathan! Thanks so much for sharing it.

Modifying EC2 security groups via AWS Lambda functions

One task that comes up again and again is adding, removing or updating source CIDR blocks in various security groups in an EC2 infrastructur...