East Agile does a lot of work building and working with very large sets of data, especially in our Data Science practice. We faced some challenges extracting data from a very large Amazon Redshift data warehouse of time series signals to support data mining using machine learning.
Originally we used the UNLOAD function to unload more than 63,000 time series into a single file on S3. This was fast but there were some problems:
- We couldn't read the whole file because it wouldn't fit into memory.
- We could read the file by chunks but it would be difficult to utilize multithreading, and difficult to process as well.
So we used the PARTITION BY keyword:
UNLOAD('SELECT feature_id, ....') TO <s3-path> PARTITION BY feature_id
The PARTITION BY feature_id parameter makes it so that it will unload a file for each feature_id. So with ~63,000 features, it will create 63,000 files on S3.
By using this approach we can read and process multiple files in parallel. However, the UNLOAD time is 4hrs. So we came up with a final solution: we UNLOAD our data into 100 files, so each file will contain around 630 features. The new UNLOAD command looks like this:
UNLOAD('SELECT feature_id % 200 AS part_id, ....') TO <s3-path> PARTITION BY part_id
(We used % 200 because feature_id are evens number, this is due to the nature of Redshift SERIAL data type)
The final solution greatly reduced the unloading time while still being able to process multiple files in parallel. We also tested the processing time, which also seemed to improve slightly.