Wednesday, 13 November 2024

Difference between PartitionKey and RowKey in Windows Azure Table Storage

 All Windows Azure storage abstractions (Blob, Table, Queue) are built upon the same stack (whitepaper here). While there’s much more to tell about it, the reason why it scales is because of its partitioning logic. Whenever you store something on Windows Azure storage, it is located on some partition in the system. Partitions are used for scale out in the system. Imagine that there’s only 3 physical machines that are used for storing data in Windows Azure storage:

Based on the size and load of a partition, partitions are fanned out across these machines. Whenever a partition gets a high load or grows in size, the Windows Azure storage management can kick in and move a partition to another machine:

By doing this, Windows Azure can ensure a high throughput as well as its storage guarantees. If a partition gets busy, it’s moved to a server which can support the higher load. If it gets large, it’s moved to a location where there’s enough disk space available.

Partitions are different for every storage mechanism:

In blob storage, each blob is in a separate partition. This means that every blob can get the maximal throughput guaranteed by the system.

In queues, every queue is a separate partition.

In tables, it’s different: you decide how data is co-located in the system.

PartitionKey in Table Storage

In Table Storage, you have to decide on the PartitionKey yourself. In essence, you are responsible for the throughput you’ll get on your system. If you put every entity in the same partition (by using the same partition key), you’ll be limited to the size of the storage machines for the amount of storage you can use. Plus, you’ll be constraining the maximal throughput as there’s lots of entities in the same partition.


Should you set the PartitionKey to the same value for every entity stored? No. You’ll end up with scaling issues at some point.

Should you set the PartitionKey to a unique value for every entity stored? No. You can do this and every entity stored will end up in its own partition, but you’ll find that querying your data becomes more difficult. 

RowKey in Table Storage


A RowKey in Table Storage is a very simple thing: it’s your “primary key” within a partition. PartitionKey + RowKey form the composite unique identifier for an entity. Within one PartitionKey, you can only have unique RowKeys. If you use multiple partitions, the same RowKey can be reused in every partition.


So in essence, a RowKey is just the identifier of an entity within a partition.


PartitionKey and RowKey and performance

Before building your code, it’s a good idea to think about both properties. Don’t just assign them a guid or a random string as it does matter for performance.


The fastest way of querying? Specifying both PartitionKey and RowKey. By doing this, table storage will immediately know which partition to query and can simply do an ID lookup on RowKey within that partition.


Less fast but still fast enough will be querying by specifying PartitionKey: table storage will know which partition to query.


Less fast: querying on only RowKey. Doing this will give table storage no pointer on which partition to search in, resulting in a query that possibly spans multiple partitions, possibly multiple storage nodes as well. Wihtin a partition, searching on RowKey is still pretty fast as it’s a unique index.


Slow: searching on other properties (again, spans multiple partitions and properties).



No comments:

Post a Comment