Post

DynamoDB Table Design Principles - Filtering

Abstract

We will continue exploring different technics that allow to extend Access Patterns to a Single Table.

Target DynamoDB Table Structure

diagram2.png

NoSQL DynamoDB Table Structure is generated using MultiCloud Diagrams Framework

Access Patterns to implement

In the previous post DynamoDB Table Design Principles - One-To-Many Relationships in NoSQL , when we implemented One-To-Many relationship using SingleTable Design Pattern, and we had completed first 3 Data access patterns:

IDAccess PatternAccess BYKey Condition
A1Get user profileusernamePK = USER#alex AND BEGINS_WITH(SK, ‘PROFILE’)”
A2Get orders for userBY usernamePK = USER#alex AND BEGINS_WITH(SK, ‘ORDER#’)”
A3Get single order and order itemsGSI: orderidSK = ORDER#orderid

Now we will continue Table design to follow all other requested Access Patterns:

IDAccess Pattern
A4Get orders for user by status
A5Get open orders
A5Get orders for user, shipped before 01.01.2023
A6Get orders for user, shipped Between 01.01.2023 - 01.02.2023

RDBMS-way for filtering

All this Access Pattern are analogue to WHERE operation in RDBMS, if to write in native-SQL this will look like:

1
SELECT * FROM ORDERS WHERE USERNAME = 'bob' AND ORDERDATE > GETDATE() - 14

Technic #4: Filter Expression

Both Query and Scan operations, support FilterExpressions:

1
2
3
4
5
dynamodb = boto3.client('dynamodb')
response = dynamodb.query(
  TableName='UsersAndOrdersTable',
  FilterExpression="Status = SHIPPED"
)

Problems:

  • to use the query we need always provide the PK, we can not query entire dataset.
  • more than that, because the Orders with same status are in different partitions we can not query them.

Scan FilterExpression

1
2
3
4
5
dynamodb = boto3.client('dynamodb')
response = dynamodb.scan(
  TableName='UsersAndOrdersTable',
  FilterExpression="Status = SHIPPED"
)

Scan with FilterExpression evaluation logic under the hood

Following diagram describes how FilterExpression is evaluated in DynamoDB:

stateDiagram
  [*] --> s1
  state "Read all items" as s1
  note left of s1
    LIMIT in size of 1Mb
    notes.
  end note
  state "Filter Items that do not match" as s2
  state "Return results DataSet" as s3
  note left of s3
    Pros: less network transfer
    Cons: All items are read
  end note
  s1 --> s2
  s2 --> s3
  s3 --> [*]

Detailed Problems with FilterExpression in Query and Scan

If there is a 1GB of data in the Table, and predicate that is included to FilterExpression narrows the DataSet to only 1KB of Data, since there is a HARD limit of 1Mb per single read operation - a scan will perform 1000 requests/responses network rounds to a table to read all 1GB of data and filter it to only 1KB.

This is the worst case scenario when you can experience long response time or even timeouts.

Filtered Scan will save bandwidth on a wire, will allow less aggregation on application side. But it will not save the performance and will not add effective data access.

Technic #5: Composite Sort Key

In our Table we have Order Status and Order CreatedAt as a separated Attributes in the table:

initial_table.png

Instread, we will add new attribute OrderStatusDate that combines values of these both 2 Attributes (Status & CreatedAt) - A Composite Sort Key.

table_with_composite sort_key.png

To provide A4 access pattern, we will introduce a new GSI with PartitionKey of PK and SortKey with OrderStatusDate:

img.png

This is how A4 query will look like:

IDAccess PatternAccess BYKey Condition
A4Get orders for user by statusGSI:PK = USER#alex AND BEGINS_WITH(OrderStatusDate, ‘SHIPPED#’)

More than that, Composite Sort Key allows to use both composite attributes when querying, reusing same GSI for more access patterns:

IDAccess PatternAccess BYKey Condition
A5Get orders for user, shipped before 01.01.2023GSI:PK = USER#alex AND BEFORE(OrderStatusDate, ‘SHIPPED#01.01.2023’)
A6Get orders for user, shipped Between 01.01.2023 - 01.02.2023GSI:PK = USER#alex AND BETWEEN(OrderStatusDate, ‘SHIPPED#01.01.2023’, ‘SHIPPED#01.02.2023’)

Technic #6: Sparse Index

This is more a DataWareHouse DB access pattern - query all Database and get all Orders that are with ‘PLACED’ status only.

1
SELECT * FROM ORDERS WHERE STATUS = 'PLACED'

Orders are partitions based on PK USER - so there are multiple Orders with PLACED status, but they are stored within different partitions. We need to provide a global Filter for entire Table not a single partition.

Let’s add one more attribute PlacedId - it will be a marker attribute for Items that are in PLACED status. The value can be UUID, Date, PK, etc.

sparce_index.png

Next we will introduce new GSI index view, to effectively query Order with PLACED status:

img.png

There are 4 advantages when applying sparce_index:

  • GSI index in size is limited to the number of Items that have SparceIndex key criteria
  • Entities set in index is limited to ORDER - there is no need to store relations for other types (more efficient storage)
  • once the item has changed its status its SparceIndex attribute should be dropped - after that it will not be returned in queries
  • Projection can be used to limit number of fields (attributes) that are needed when requesting (more effective storage)

Conclusions

Key takeaways whe applying Filtering to DynamoDB:

  • Avoid using FilterExpression in general (with assumption that it can be applied only on small DataSets).
  • Use Composite Sort Key - that will allow to run optimal queries for both Attributes of the Key.
  • If Composite Sort Key is sharing the same Partition Key as Table, include it as LSI on table planning time, otherwise introduce GSI.
This post is licensed under CC BY 4.0 by the author.