Recently I built an application that uses AWS DynamoDB as its backend datastore. This platform from Amazon (AWS) is great in many ways. The product takes care of scalability and high availability out of the box, in other words Amazon takes care of this for you, and you have no servers to manage or patches and updates to install or deal with. It has a rich API to program against, and for me, an added benefit is that it also has an API for Go (Golang) which is the primary language I am currently using to build my lambda services. Like other NoSQL platforms, when creating tables, only the key fields are needed that are used in the primary key or indexes. Any other item can be added simply by passing them into your put statements.
One challenge I had to overcome revolved around being able to perform different filters and sorts, actually a lot of different filters and sorts. As I started modeling the data, I first realized that it was necessary to understand the application’s data access patterns. This is important for several reasons, but the most important one is that the cost incurred of using DynamoDB can be greatly impacted by how you model your tables and query the data. The reason for this is that depending on your filtering and sorting needs, you may need to have multiple redundant copies of the data. DynamoDB uses 2 types of indexes to allow for efficient querying of data:
- Must be created at time of time table creation, not after
- Combination of partition key and range key must be unique (composite key)
- Attributes not projected in an LSI can be retrieved also, dynamodb will do a table fetch as part of the query and read the entire item, resulting in latency, I/O operations, and a higher throughput cost.
- Updates are synchronous as part of the put/delete/update
- Read and write capacity units are shared with the table
- Can be created after table has been created
- Updates are asynchronous (eventually consistent) as part of the put/delete/update
- Eventually consistent reads consume ½ of a RCU 2x4kb=8kb
- Uses it’s own read and write capacity units and not shared with the table
- Partition key can be different from the table partition key since GSI are stored in a completely separate from the table partition.
- Put/Delete/Update to a table can also use WCU on a GSI as well
- Combination of partition key and range key do not need to be unique
- Can only return attributes inside of the index, can not retrieve from parent table
- Space consumed by a global secondary indexed item is the sum of:
- Byte size of the key (partition and sort )
- Byte size of the index key attribute
- Byte size of the projected attributes (if any)
- 100 bytes of overhead per index item
The above index information was taken from my DynamoDB Cheat Sheet.
Once I had a thorough understanding of how indexes work with DynamoDB, I realized there was going to be a lot of redundancy in my application, which is not a bad thing, but redundancy in data that is updated also requires an event driven architecture to allow redundant copies to be updated using an Eventual Consistency model. To do this, I utilized AWS DynamoDB Event Streams, and AWS Lambda. With DynamoDB Event Streams, I was able to react to any changes to items within certain tables. I created Lambda functions that were triggered any time an item was added, modified, or deleted. DynamoDB would automatically trigger the stream to pass the data to the subscribed lambda function. I was able to do all of this without needing to use SQS or SNS (other serverless products from AWS). The stream data will be stored for 24 hours, and retried automatically if it is not successfully delivered to the lambda function.
The reason for the redundancy in this application is because DynamoDB is very particular in how the data must be modeled (as with other NoSQL implementations). Most of my queries were not filtering on one parent level record, but multiple, and since the partition key field must exist in the query, I was not able to query against all records, but only those of one parent. This is great any time you want to look up items based on their parent it, but terrible when you want to query for items from multiple parents. To do this you must use a table scan, which means you will be billed for all items scanned regardless if they are returned or not. So querying for items and using filters that are not in your Primary Key (partition key and range key if used) will return ALL records from the table or index, and then apply the filters. Because of this, I had to create many tables to support the different querying needs, and each table had different partition and range keys. In order sort for items based on a date for example, i ended up creating a table for the same data, but used the month and year as the partition key. I then has to run concurrent queries against the table, where each query used a different month and day. I then combined the results. As you can see this required a lot more effort than just being able to query against one table. I’m not recommending against using DynamoDB, I actually like it quite a bit and feel it has great potential. I mainly wrote this article to stress how important is it to understand your data needs, since this will determine how you partition your data and store redundant copies. I now see how it was almost a required feature for DynamoDB to support Event Streams, otherwise I see how complicated it would be to use the product in a system that does more than just reads. DyamoDB can also be supplemented with a good caching model, to reduce the reads that are actually performanced against the tables.
Coming soon, I’ll be writing about my experiences with Amazon RDS Aurora, which i am using now on a current project.