Optimizing Redshift Datawarehouse Bulk Data Extracts
Extracting large amounts of bulk data from Amazon Redshift data warehouses using the Unload command can be 10x faster if done the right way. This is our experience finding this out.

East Agile does a lot of work building and working with very large sets of data, especially in our Data Science practice. We faced some challenges extracting data from a very large Amazon Redshift data warehouse of time series signals to support data mining using machine learning.
Originally we used the UNLOAD function to unload more than 63,000 time series into a single file on S3. This was fast but there were some problems: 

  • We couldn't read the whole file because it wouldn't fit into memory.
  • We could read the file by chunks but it would be difficult to utilize multithreading, and difficult to process as well.

So we used the PARTITION BY keyword:


 UNLOAD('SELECT feature_id, ....') TO <s3-path> PARTITION BY feature_id

The PARTITION BY feature_id  parameter makes it so that it will unload a file for each feature_id. So with ~63,000 features, it will create 63,000 files on S3.

By using this approach we can read and process multiple files in parallel. However, the UNLOAD time is 4hrs. So we came up with a final solution: we UNLOAD our data into 100 files, so each file will contain around 630 features. The new UNLOAD command looks like this:


 UNLOAD('SELECT feature_id % 200 AS part_id, ....') TO <s3-path> PARTITION BY part_id

(We used % 200 because feature_id are evens number, this is due to the nature of Redshift SERIAL data type)

The final solution greatly reduced the unloading time while still being able to process multiple files in parallel. We also tested the processing time, which also seemed to improve slightly.

Tags
Tips & Tricks
5374
Share this
Leave a comment
There are no comments about this article, let us know what you think?

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.