Fun with TPC-H (Part 2b) – AWS RedShift

Filed under: Cloud, Thoughts — 2 Comments

15/03/2013

This is the third article I’ve done on the TPC-H benchmark, and part 2 on AWS RedShift. Read the first article here, and the first part on AWS Redshift here.

Previously, I covered loading data into a database in RedShift, and discovered (not unsurprisingly) that the performance of SQL Server on a laptop is pretty much the same as that on RedShift, for a 1GB dataset. No great surprises there.

This time, I’m using a 100GB data set. Having spent quite a while generating the dataset, and then getting it uploaded to S3, I’m now in a position to start the loads.

GZIP is King!

One of the great features of RedShift is that it’ll load GZIP’d datasets directly into the database, you just need to add the gzip parameter at the end. So the load statements below are the same as before, they now have gzip at the end, and I’m loading from a gzip file rather than the straight text version.

copy customer from ‘s3://oldnick-tpch/customer.tbl.gz’ CREDENTIALS ‘aws_access_key_id=<Your-Access-Key-ID>;aws_secret_access_key=<Your-Secret-Access-Key>’ delimiter ‘|’ gzip;
copy orders from ‘s3://oldnick-tpch/orders.tbl.gz’ CREDENTIALS ‘aws_access_key_id=<Your-Access-Key-ID>;aws_secret_access_key=<Your-Secret-Access-Key>’ delimiter ‘|’ gzip;
copy lineitem from ‘s3://oldnick-tpch/lineitem.tbl.gz’ CREDENTIALS ‘aws_access_key_id=<Your-Access-Key-ID>;aws_secret_access_key=<Your-Secret-Access-Key>’ delimiter ‘|’ gzip;
copy nation from ‘s3://oldnick-tpch/nation.tbl.gz’ CREDENTIALS ‘aws_access_key_id=<Your-Access-Key-ID>;aws_secret_access_key=<Your-Secret-Access-Key>’ delimiter ‘|’ gzip;
copy part from ‘s3://oldnick-tpch/part.tbl.gz’ CREDENTIALS ‘aws_access_key_id=<Your-Access-Key-ID>;aws_secret_access_key=<Your-Secret-Access-Key>’ delimiter ‘|’ gzip;
copy partsupp from ‘s3://oldnick-tpch/partsupp.tbl.gz’ CREDENTIALS ‘aws_access_key_id=<Your-Access-Key-ID>;aws_secret_access_key=<Your-Secret-Access-Key>’ delimiter ‘|’ gzip;
copy region from ‘s3://oldnick-tpch/region.tbl.gz’ CREDENTIALS ‘aws_access_key_id=<Your-Access-Key-ID>;aws_secret_access_key=<Your-Secret-Access-Key>’ delimiter ‘|’ gzip;
copy supplier from ‘s3://oldnick-tpch/supplier.tbl.gz’ CREDENTIALS ‘aws_access_key_id=<Your-Access-Key-ID>;aws_secret_access_key=<Your-Secret-Access-Key>’ delimiter ‘|’ gzip;

I quite like the AWS interface for loading the data, so while the loads above are running, I can go into the AWS RedShift management console and see the progress of the loads:

I can also see various performance metrics while the jobs are running. Shown below are some of the more interesting ones. It’s particularly interesting that the CPU Utilisation is pegged at 100% while the load is running. I’m guessing that this is due to the loads being GZIP’d, so there’ll be an overhead of decompression in there, aside from the overhead of the load itself.

The Results

Having then run the same query, it took longer to run (as expected).

100gb Dataset	Time to Return (sec)
Redshift (1 node cluster) cold	3min 11 sec
Redshift (1 node cluster) warm	2min 47 sec

So, with 100 times the data, the time to execute is slightly over 100 times the time. However, based on the maxing out of the CPU and the IOPS, the spec of the Redshift environment probably needs to be a higher spec for a 100gb dataset.

Comments are welcome as I’m aware that this is a specific test, and should not be taken as a rounded evaluation of Redshift.

Tags: Amazon Redshift, Cloud, Thoughts, TPC-H

Comments RSS feed

2 Comments:

Ivan Glushkov

22/01/2014 at 14:20

What about running the same query on a 2-nodes cluster? I mean 2*8XL nodes.

Reply
- Nick
  
  06/02/2014 at 10:06
  
  Interesting idea however not had time to try that yet. Well have a go when time permits. 🙂
  
  Reply

Dev Blog

2 Comments:

Leave a comment Cancel reply

Disclaimer

Recommended Reading

Tag Cloud

My Tweets

Follow Blog via Email

Friends & links

Pages

Monthly archives

Dev Blog