Having configurable data retention is an important aspect of any data storage system, but there are a number of factors to consider in order to make it happen. Data retention is important because it can help limit the exposure and propagation of sensitive data, it can help to avoid the use of data that is out of date, and it can reduce storage costs by ensuring we only keep the data that we need. At Asana, we’ve implemented configurable data retention and cleanup for our data lake in three ways: stale data cleanup, date partition cleanup, and historical snapshot cleanup.
As part of the design process, we considered pre-existing solutions, such as AWS-native retention policies, but we were unable to find something that was sufficient for our use cases. S3 retention policies are unable to handle the complexity of our retention logic, and Glue does not have built-in retention at all.
Before we get into the details of how this is implemented, it will be helpful to have some context on the systems we use and how we are using them. When we say data lake, we are talking about data that is stored in S3 and cataloged by Glue.
When we write data to S3, we always write it to a path that contains a date stamp relevant to the data we are writing, and a job timestamp, unique to the current write we are doing. This will look like s3://bucket/table-name/ds=/job_ts /. Writing to unique paths is important for data accessibility. When we’re writing new data, we don’t want any queries currently in flight to fail. We only update the reference in Glue after the S3 write completes successfully.
While writing to unique paths solves the issue of data availability, it creates another. If we always write data to a unique path, and then update Glue to reference the latest data, what happens when we update data? The old data is still there in S3, but is now no longer referenced by Glue. This renders it practically inaccessible, and now it is essentially just taking up space. The process we developed to clean up this stale data has three main parts, and it functions similarly to the way deletion works in a traditional database.
First, every time we update data in the data lake, we also write a stale data record (similar to a tombstone) that identifies the old data as potentially stale.
In order for a table to be eligible for stale data cleanup, it must be configured as such, so the second part is setting a stale data retention period. The reason why we retain the data for a period of time and don’t just delete it right away is to allow for the ability to recover from errors in our data pipelines. We may need to rollback an update to a table or debug something that went wrong.
The third part of this system is an asynchronous process that iterates over every table with a stale data retention period, checks for stale data records, verifies the data that was marked stale is no longer referenced anywhere in Glue, then deletes the data and the stale data record if all checks pass.
This system allows us to have the benefits of writing to unique paths while reducing the costs associated, both monetary and operational. We have some tables that update many times per day, and it would be wasteful to let all of the stale data sit around indefinitely and impractical to do the cleanup manually.
I mentioned earlier that we write to date stamped paths, and that will be relevant here. Writing to a separate path per day is useful in two ways. For data that is partitionable by day, we can write only the data for a given day at a time, which can save time and compute resources. For smaller data sets that are not partitionable by day, we may want to save a snapshot of the entire data set each day to be able to analyze changes over time.
We might not want to actually keep this data around forever, though, for a few reasons. Sometimes we only need to keep data around for a set number of days to accomplish our goals; it is safer to keep any sensitive data around for as short as is practical; and the less data we store, the less it costs.
We clean up date partitions and historical snapshots in the same way. First, a user sets a retention period, either for date partitions or historical snapshots. Then, there is an asynchronous process that runs periodically, iterates over all tables with retention periods set and deletes anything, both from S3 and from Glue, that is older than the retention period.
The main difference between this system and the stale data cleanup system is that this system does not check if the data is still referenced. If the data is past its retention period, it is cleaned up, no other questions asked.
It’s important to be able to observe what the cleanup processes have done after the fact, and for that, we persist a detailed log of each deletion attempt that lets us know the database, table, S3 location, success or failure reason, the volume of data deleted, and some other metadata. This has come in handy in debugging the system, recovering from errors and providing metrics about how much data the system is saving over time.
After implementing data retention for stale data, date partitions and historical snapshots, we found other use cases for the pattern. We now use the same pattern for cleaning up Glue table versions (Glue has a per-account limit on the number of table versions, and at our scale, we were hitting it), and cleaning up personal temp tables.
Asana Infrastructure teams strive to create a robust and efficient backbone of systems that support other engineers and data scientists so they can do their jobs without having to think about what’s happening behind the scenes. Configurable data retention is one of many problems the Core Data Infrastructure team is solving to make that a reality.