This is the third article I’ve done on the TPC-H benchmark, and part 2 on AWS RedShift. Read the first article here, and the first part on AWS Redshift here.

Previously, I covered loading data into a database in RedShift, and discovered (not unsurprisingly) that the performance of SQL Server on a laptop is pretty much the same as that on RedShift, for a 1GB dataset. No great surprises there.

This time, I’m using a 100GB data set. Having spent quite a while generating the dataset, and then getting it uploaded to S3, I’m now in a position to start the loads.

GZIP is King!

One of the great features of RedShift is that it’ll load GZIP’d datasets directly into the database, you just need to add the gzip parameter at the end. So the load statements below are the same as before, they now have gzip at the end, and I’m loading from a gzip file rather than the straight text version.

copy customer from ‘s3://oldnick-tpch/customer.tbl.gz’ CREDENTIALS ‘aws_access_key_id=<Your-Access-Key-ID>;aws_secret_access_key=<Your-Secret-Access-Key>’ delimiter ‘|’ gzip;
copy orders from ‘s3://oldnick-tpch/orders.tbl.gz’ CREDENTIALS ‘aws_access_key_id=<Your-Access-Key-ID>;aws_secret_access_key=<Your-Secret-Access-Key>’ delimiter ‘|’ gzip;
copy lineitem from ‘s3://oldnick-tpch/lineitem.tbl.gz’ CREDENTIALS ‘aws_access_key_id=<Your-Access-Key-ID>;aws_secret_access_key=<Your-Secret-Access-Key>’ delimiter ‘|’ gzip;
copy nation from ‘s3://oldnick-tpch/nation.tbl.gz’ CREDENTIALS ‘aws_access_key_id=<Your-Access-Key-ID>;aws_secret_access_key=<Your-Secret-Access-Key>’ delimiter ‘|’ gzip;
copy part from ‘s3://oldnick-tpch/part.tbl.gz’ CREDENTIALS ‘aws_access_key_id=<Your-Access-Key-ID>;aws_secret_access_key=<Your-Secret-Access-Key>’ delimiter ‘|’ gzip;
copy partsupp from ‘s3://oldnick-tpch/partsupp.tbl.gz’ CREDENTIALS ‘aws_access_key_id=<Your-Access-Key-ID>;aws_secret_access_key=<Your-Secret-Access-Key>’ delimiter ‘|’ gzip;
copy region from ‘s3://oldnick-tpch/region.tbl.gz’ CREDENTIALS ‘aws_access_key_id=<Your-Access-Key-ID>;aws_secret_access_key=<Your-Secret-Access-Key>’ delimiter ‘|’ gzip;
copy supplier from ‘s3://oldnick-tpch/supplier.tbl.gz’ CREDENTIALS ‘aws_access_key_id=<Your-Access-Key-ID>;aws_secret_access_key=<Your-Secret-Access-Key>’ delimiter ‘|’ gzip;

I quite like the AWS interface for loading the data, so while the loads above are running, I can go into the AWS RedShift management console and see the progress of the loads:

image

I can also see various performance metrics while the jobs are running. Shown below are some of the more interesting ones. It’s particularly interesting that the CPU Utilisation is pegged at 100% while the load is running. I’m guessing that this is due to the loads being GZIP’d, so there’ll be an overhead of decompression in there, aside from the overhead of the load itself.

image

image

image

image

image

image

image

The Results

Having then run the same query, it took longer to run (as expected).

100gb Dataset Time to Return (sec)
Redshift (1 node cluster) cold 3min 11 sec
Redshift (1 node cluster) warm 2min 47 sec

So, with 100 times the data, the time to execute is slightly over 100 times the time. However, based on the maxing out of the CPU and the IOPS, the spec of the Redshift environment probably needs to be a higher spec for a 100gb dataset.

Comments are welcome as I’m aware that this is a specific test, and should not be taken as a rounded evaluation of Redshift.