Tag Archive: TPC-H


This is the third article I’ve done on the TPC-H benchmark, and part 2 on AWS RedShift. Read the first article here, and the first part on AWS Redshift here.

Previously, I covered loading data into a database in RedShift, and discovered (not unsurprisingly) that the performance of SQL Server on a laptop is pretty much the same as that on RedShift, for a 1GB dataset. No great surprises there.

This time, I’m using a 100GB data set. Having spent quite a while generating the dataset, and then getting it uploaded to S3, I’m now in a position to start the loads.

GZIP is King!

One of the great features of RedShift is that it’ll load GZIP’d datasets directly into the database, you just need to add the gzip parameter at the end. So the load statements below are the same as before, they now have gzip at the end, and I’m loading from a gzip file rather than the straight text version.

copy customer from ‘s3://oldnick-tpch/customer.tbl.gz’ CREDENTIALS ‘aws_access_key_id=<Your-Access-Key-ID>;aws_secret_access_key=<Your-Secret-Access-Key>’ delimiter ‘|’ gzip;
copy orders from ‘s3://oldnick-tpch/orders.tbl.gz’ CREDENTIALS ‘aws_access_key_id=<Your-Access-Key-ID>;aws_secret_access_key=<Your-Secret-Access-Key>’ delimiter ‘|’ gzip;
copy lineitem from ‘s3://oldnick-tpch/lineitem.tbl.gz’ CREDENTIALS ‘aws_access_key_id=<Your-Access-Key-ID>;aws_secret_access_key=<Your-Secret-Access-Key>’ delimiter ‘|’ gzip;
copy nation from ‘s3://oldnick-tpch/nation.tbl.gz’ CREDENTIALS ‘aws_access_key_id=<Your-Access-Key-ID>;aws_secret_access_key=<Your-Secret-Access-Key>’ delimiter ‘|’ gzip;
copy part from ‘s3://oldnick-tpch/part.tbl.gz’ CREDENTIALS ‘aws_access_key_id=<Your-Access-Key-ID>;aws_secret_access_key=<Your-Secret-Access-Key>’ delimiter ‘|’ gzip;
copy partsupp from ‘s3://oldnick-tpch/partsupp.tbl.gz’ CREDENTIALS ‘aws_access_key_id=<Your-Access-Key-ID>;aws_secret_access_key=<Your-Secret-Access-Key>’ delimiter ‘|’ gzip;
copy region from ‘s3://oldnick-tpch/region.tbl.gz’ CREDENTIALS ‘aws_access_key_id=<Your-Access-Key-ID>;aws_secret_access_key=<Your-Secret-Access-Key>’ delimiter ‘|’ gzip;
copy supplier from ‘s3://oldnick-tpch/supplier.tbl.gz’ CREDENTIALS ‘aws_access_key_id=<Your-Access-Key-ID>;aws_secret_access_key=<Your-Secret-Access-Key>’ delimiter ‘|’ gzip;

I quite like the AWS interface for loading the data, so while the loads above are running, I can go into the AWS RedShift management console and see the progress of the loads:

image

I can also see various performance metrics while the jobs are running. Shown below are some of the more interesting ones. It’s particularly interesting that the CPU Utilisation is pegged at 100% while the load is running. I’m guessing that this is due to the loads being GZIP’d, so there’ll be an overhead of decompression in there, aside from the overhead of the load itself.

image

image

image

image

image

image

image

The Results

Having then run the same query, it took longer to run (as expected).

100gb Dataset Time to Return (sec)
Redshift (1 node cluster) cold 3min 11 sec
Redshift (1 node cluster) warm 2min 47 sec

So, with 100 times the data, the time to execute is slightly over 100 times the time. However, based on the maxing out of the CPU and the IOPS, the spec of the Redshift environment probably needs to be a higher spec for a 100gb dataset.

Comments are welcome as I’m aware that this is a specific test, and should not be taken as a rounded evaluation of Redshift.

So, this is the second article I’ve written against the TPC-H Benchmark (Part one here). Recently, Amazon announced that their ‘fast, fully managed petabyte-scale data warehouse service’ was available for public consumption. Having finally had some time to play, I thought I’d take it for a spin.

I was able to get a single node cluster up and running pretty quickly, and installed their sample data set easily. You can read how to go about this in their Getting Started Guide.

The initial issue I had with the sample data set was, well, it was pretty small. Ok, it got the concepts over, but I wanted more. I wanted to get an idea of performance and how it compared across the different levels. I wanted more data.

So, I decided to dump my set of test data (1Gb TPC-H, see part 1 for creating this) into it, and covered here is how I did it.

Getting Started

I’m going to assume that you’ve made it through steps 1-4 of the Getting Started guide above (which covers Prerequisites, Launching the Cluster, Security setup and Connecting to the cluster).

Shown below are the statements used to create the TPC-H tables, within the Redshift environment. You’ll need to create a connection to the Redshift environment, use SQL Workbench to connect to it, and copy and paste this into the SQL window.

CREATE TABLE customer(
C_CustKey int ,
C_Name varchar(64) ,
C_Address varchar(64) ,
C_NationKey int ,
C_Phone varchar(64) ,
C_AcctBal decimal(13, 2) ,
C_MktSegment varchar(64) ,
C_Comment varchar(120) ,
skip varchar(64)
);

CREATE TABLE lineitem(
L_OrderKey int ,
L_PartKey int ,
L_SuppKey int ,
L_LineNumber int ,
L_Quantity int ,
L_ExtendedPrice decimal(13, 2) ,
L_Discount decimal(13, 2) ,
L_Tax decimal(13, 2) ,
L_ReturnFlag varchar(64) ,
L_LineStatus varchar(64) ,
L_ShipDate datetime ,
L_CommitDate datetime ,
L_ReceiptDate datetime ,
L_ShipInstruct varchar(64) ,
L_ShipMode varchar(64) ,
L_Comment varchar(64) ,
skip varchar(64)
);
CREATE TABLE nation(
N_NationKey int ,
N_Name varchar(64) ,
N_RegionKey int ,
N_Comment varchar(160) ,
skip varchar(64)
);
CREATE TABLE orders(
O_OrderKey int ,
O_CustKey int ,
O_OrderStatus varchar(64) ,
O_TotalPrice decimal(13, 2) ,
O_OrderDate datetime ,
O_OrderPriority varchar(15) ,
O_Clerk varchar(64) ,
O_ShipPriority int ,
O_Comment varchar(80) ,
skip varchar(64)
);

CREATE TABLE part(
P_PartKey int ,
P_Name varchar(64) ,
P_Mfgr varchar(64) ,
P_Brand varchar(64) ,
P_Type varchar(64) ,
P_Size int ,
P_Container varchar(64) ,
P_RetailPrice decimal(13, 2) ,
P_Comment varchar(64) ,
skip varchar(64)
);
CREATE TABLE partsupp(
PS_PartKey int ,
PS_SuppKey int ,
PS_AvailQty int ,
PS_SupplyCost decimal(13, 2) ,
PS_Comment varchar(200) ,
skip varchar(64)
);
CREATE TABLE region(
R_RegionKey int ,
R_Name varchar(64) ,
R_Comment varchar(160) ,
skip varchar(64)
);
CREATE TABLE supplier(
S_SuppKey int ,
S_Name varchar(64) ,
S_Address varchar(64) ,
S_NationKey int ,
S_Phone varchar(18) ,
S_AcctBal decimal(13, 2) ,
S_Comment varchar(105) ,
skip varchar(64)
);

Next up, we need to get some data into it. I’ve had a copy of the TPC-H files sitting on my S3 account for a while, so I was hoping to just point Redshift at that (just like the sample code does). This was where I ran into my first issue. There may be an easier way, but I wanted to do it quickly. The problem was that I couldn’t get the S3 URL syntax to work, and this appears to be because my S3 Buckets are sitting in Ireland (EU). The S3 syntax looks to only work if you are using ‘US Standard’ as your S3 storage. I could be wrong, but I’m not an S3 expert.

Anyway, having created an S3 bucket in US Standard, and transferred the files over, I used the following to copy the contents from these files into the tables created in Redshift.

copy customer from ‘s3://oldnick-tpch/customer.tbl’ CREDENTIALS ‘aws_access_key_id=<Your-Access-Key-ID>;aws_secret_access_key=<Your-Secret-Access-Key>’ delimiter ‘|’;
copy orders from ‘s3://oldnick-tpch/orders.tbl’ CREDENTIALS ‘aws_access_key_id=<Your-Access-Key-ID>;aws_secret_access_key=<Your-Secret-Access-Key>’ delimiter ‘|’;
copy lineitem from ‘s3://oldnick-tpch/lineitem.tbl’ CREDENTIALS ‘aws_access_key_id=<Your-Access-Key-ID>;aws_secret_access_key=<Your-Secret-Access-Key>’ delimiter ‘|’;
copy nation from ‘s3://oldnick-tpch/nation.tbl’ CREDENTIALS ‘aws_access_key_id=<Your-Access-Key-ID>;aws_secret_access_key=<Your-Secret-Access-Key>’ delimiter ‘|’;
copy part from ‘s3://oldnick-tpch/part.tbl’ CREDENTIALS ‘aws_access_key_id=<Your-Access-Key-ID>;aws_secret_access_key=<Your-Secret-Access-Key>’ delimiter ‘|’;
copy partsupp from ‘s3://oldnick-tpch/partsupp.tbl’ CREDENTIALS ‘aws_access_key_id=<Your-Access-Key-ID>;aws_secret_access_key=<Your-Secret-Access-Key>’ delimiter ‘|’;
copy region from ‘s3://oldnick-tpch/region.tbl’ CREDENTIALS ‘aws_access_key_id=<Your-Access-Key-ID>;aws_secret_access_key=<Your-Secret-Access-Key>’ delimiter ‘|’;
copy supplier from ‘s3://oldnick-tpch/supplier.tbl’ CREDENTIALS ‘aws_access_key_id=<Your-Access-Key-ID>;aws_secret_access_key=<Your-Secret-Access-Key>’ delimiter ‘|’;

You’ll need to replace <Your-Access-Key-ID> with your Amazon access key and <Your-Secret-Access-Key> with your secret key, though I bet you’d guessed that. Also, note that it’s possible to load from a gzipped file by adding the gzip parameter to the  copy statement, though I didn’t discover this till after the load.

After waiting a little while, though not too long, for Redshift to bring the data in from S3, you can use these queries to check the counts.

select count(*) from customer;
select count(*) from orders;
select count(*) from lineitem;
select count(*) from nation;
select count(*) from part;
select count(*) from partsupp;
select count(*) from region;
select count(*) from supplier;

Next, the Developer Guide section covering loading data into Redshift say you should run the following statements after loading. Analyze updates the database statistics, and Vacuum then reclaims storage space.

analyze;
vacuum;

So, there we go, now we’ve got a Redshift cluster running the TPC-H tables. So next I thought I’d do a basic test to compare results.

My test query for this is shown below, and just does some aggregation against the lineitem table (6 million or so rows).

select l_returnflag, l_linestatus, sum(l_quantity) as sum_qty,
sum(l_extendedprice) as sum_base_price, sum(l_extendedprice*(1-l_discount)) as sum_disc_price,
sum(l_extendedprice*(1-l_discount)*(1+l_tax)) as sum_charge, avg(l_quantity) as avg_qty,
avg(l_extendedprice) as avg_price, avg(l_discount) as avg_disc,  count(*) as count_order
from lineitem
group by l_returnflag, l_linestatus
order by l_returnflag,l_linestatus;

So I ran this on my laptop (i7, 12 Gb RAM, 512GB SSD) a couple of times, once as a straight query, and once with a Columnstore index on it, cold (after restart) and warm (2nd time).

SQL times are shown based on SET STATISTICS TIME ON times.

Analysing Redshift was interesting. Since I’ve not done much with Postgres-SQL, I had a look through the Redshift documentation to see what is going on. I found an interesting page showing how to determine if a query is running from disk . Working through this I saw that, once I got the query id from the query below, I could get the query details including memory used and times.

Getting the Query Id

select query, elapsed, substring
from svl_qlog
order by query
desc limit 5;

select *
from svl_query_summary
where query = 5931

image

So, having seen those figures, I had a look at the cluster details.

Initially I was using 1 node, so I went up a notch, to a 2 node cluster of the more powerful nodes.

Single Node Testing
image
Multi Node Testing
image

The Results

Time to Return (sec)
Laptop – SQL 2012 (Cold) 24515ms CPU time, 6475ms elapsed
Laptop – SQL 2012 (Warm) 24016ms CPU time, 6060ms elapsed.
Laptop – SQL 2102 Columnstore (Cold) 531ms CPU time, 258ms elapsed
Laptop – SQL 2102 Columnstore (Warm) 389ms CPU time, 112ms elapsed
Redshift (1 node cluster) 1.24 sec
Redshift (2 node cluster 1.4 sec

So, obviously, I’m not stretching the performance of the Redshift cluster.

Part 2b of this will cover similar tests, though I’ll be doing it with a 100GB TPC-H test data set.

Keep ’em peeled for the next post!

T-SQL Tuesday again, and this month it’s hosted by Amit Banerjee at TroubleshootingSQL.

One of the things that I’ve become more aware of, due to preparation for the MCM certification and working on larger data warehousing projects is that multiple database file can always give you a performance improvement.

From testing that I’ve done, this is even apparent on small local databases.

As part of a series of blog posts that I’m doing, around the TPC-H benchmarks, I’ve been loading and and reloading a dataset of approx. 8.6 million records (in 8 entities). This dataset is an example of the default, 1Gb set from the TPC-H benchmark (downloadable here (approx 276mb), or you can read my previous blog article on creating it yourself)

To get some decent figures to show how how much of an improvement you can get with multiple files, I’ve created a script which does the following steps.

    1. Creates the database (2Gb per database file, and 512mb for the log file)
    2. Creates the tables
    3. Bulk loads data using a set of flat files
    4. Gives a count of each of the tables

A copy of the script is available here.

I carried out a few different tests, based on :

  • Single or Multiple files
  • Different Media
    • Running on the C drive (5400rpm SATA drive)
    • Running on USB Pen Drives
    • Running on an eSATA drive
  • Splitting over multiple media
  • Having the Transaction log stored separately

The results I found are shown below (times are shown in Minutes, Seconds, milliseconds (mm:ss:ms) )

image

The benchmark is the run on a single file, on my internal drive.

DB Build is the time to create the database, note that I’m using Instant File Initialisation, and so should you (unless you have a very good reason not to!)

Data load is the time to build the tables and load them

Conclusions

Effectively, you can see the following:

    1. Regardless of media, Multiple files always give a performance improvement
    2. USB Pen Drives are rubbish. Don’t use them for databases
    3. A fast drive, separate to the O/S, and separate from the System database will give a significant improvement

The best performance I managed to achieve was with the multiple database files, using the eSata drive.

However, given that I work primarily on a laptop, the fact that I can get a huge improvement (over a third!) by using multiple database files on the internal drive is impressive.

I’d be interested to know how much of an improvement you get on this, how much does your mileage vary ?

Thanks for reading, and thanks to Amit for hosting.

This’ll be the first in a series of articles I’m planning on writing about the TPC-H Benchmark, and SQL Server.

Recently I attended an HP/Microsoft event, where they discussed the new offerings from the two companies, around the FastTrack Data Warehouse.

Having spent some time reading this document, and looking at the benchmarks in there (BCR – Benchmark Consumption Rate, MCR – Maximum CPU Core Consumption Rate, UDC – Required User Data Capacity and more importantly the QphH), I thought I’d look into them a little more.

I’m going to cover the experiences I had in getting this up and running, which was interesting since I couldn’t find anywhere where there was a reasonable example that ‘anybody’ could use to build a database using this dataset. Also, I wanted to run the benchmarks and see where my laptop was, compared with some of the meatier servers on the TPC-H Top Tens.

To carry these tests out, I’m running on my laptop (Sony Vaio, with i5-430M (2 cores, 4 threads at 2.27Ghz) and 8Gb RAM). I’m aware that this is significantly different to the specification of some of these servers, however, there is also a significant difference in price, and it’s an interesting (to some) test…

Overview

The benchmarks are based on a standard set of benchmark data, the TPC-H data, which is available to download here.

“The TPC Benchmark™H (TPC-H) is a decision support benchmark. It consists of a suite of business oriented ad-hoc queries and concurrent data modifications. The queries and the data populating the database have been chosen to have broad industry-wide relevance. This benchmark illustrates decision support systems that examine large volumes of data, execute queries with a high degree of complexity, and give answers to critical business questions.” – TPC Website

Getting Started – DBGen

To get the dataset, you first need to download the DBGEN Reference data set, which is available from the link above. The dataset is built using an application, which the download contains, in C++ form. I opened the project files using Visual Studio 2010, built them, and got a resulting dbgen.exe file. This was much more straightforward than I was expecting.

There are a number of parameters for this application, but if you run it with the default settings, you get the 1gb dataset. If you use the –s parameter (so the scale factor) set to 10 (so –s10), you get a 10Gb dataset. From the TPC-H Results list, the results are all from 100 upwards (so –s100, to get a 100gb dataset).

Getting Started – SQL Server

Next up, I need to create a database. For the initial test, I created the database on an external drive (connected through eSATA). I used the following script.

CREATE DATABASE [TPCH] ON  PRIMARY
( NAME = N'tpch', FILENAME = N'X:\TPC-H\tpch1gb.mdf' , SIZE = 1024MB , MAXSIZE = UNLIMITED, FILEGROWTH = 128MB ),
( NAME = N'tpch_2', FILENAME = N'X:\TPC-H\tpch1gb_2.ndf' , SIZE = 1024MB , MAXSIZE = UNLIMITED, FILEGROWTH = 128MB )
LOG ON ( NAME = N'tpch_log', FILENAME = N'X:\TPC-H\tpch1gb_log.ldf' , SIZE = 512MB , MAXSIZE = 2048GB , FILEGROWTH = 100MB )
GO

Database tables were created using the following script, which is all built using the schema available in the TPC-H documentation.

CREATE TABLE [dbo].[customer](
    [C_CustKey] [int] NULL,
    [C_Name] [varchar](64) NULL,
    [C_Address] [varchar](64) NULL,
    [C_NationKey] [int] NULL,
    [C_Phone] [varchar](64) NULL,
    [C_AcctBal] [decimal](13, 2) NULL,
    [C_MktSegment] [varchar](64) NULL,
    [C_Comment] [varchar](120) NULL,
    [skip] [varchar](64) NULL
) ON [PRIMARY]
GO
CREATE TABLE [dbo].[lineitem](
    [L_OrderKey] [int] NULL,
    [L_PartKey] [int] NULL,
    [L_SuppKey] [int] NULL,
    [L_LineNumber] [int] NULL,
    [L_Quantity] [int] NULL,
    [L_ExtendedPrice] [decimal](13, 2) NULL,
    [L_Discount] [decimal](13, 2) NULL,
    [L_Tax] [decimal](13, 2) NULL,
    [L_ReturnFlag] [varchar](64) NULL,
    [L_LineStatus] [varchar](64) NULL,
    [L_ShipDate] [datetime] NULL,
    [L_CommitDate] [datetime] NULL,
    [L_ReceiptDate] [datetime] NULL,
    [L_ShipInstruct] [varchar](64) NULL,
    [L_ShipMode] [varchar](64) NULL,
    [L_Comment] [varchar](64) NULL,
    [skip] [varchar](64) NULL
) ON [PRIMARY]
GO
CREATE TABLE [dbo].[nation](
    [N_NationKey] [int] NULL,
    [N_Name] [varchar](64) NULL,
    [N_RegionKey] [int] NULL,
    [N_Comment] [varchar](160) NULL,
    [skip] [varchar](64) NULL
) ON [PRIMARY]
GO
CREATE TABLE [dbo].[orders](
    [O_OrderKey] [int] NULL,
    [O_CustKey] [int] NULL,
    [O_OrderStatus] [varchar](64) NULL,
    [O_TotalPrice] [decimal](13, 2) NULL,
    [O_OrderDate] [datetime] NULL,
    [O_OrderPriority] [varchar](15) NULL,
    [O_Clerk] [varchar](64) NULL,
    [O_ShipPriority] [int] NULL,
    [O_Comment] [varchar](80) NULL,
    [skip] [varchar](64) NULL
) ON [PRIMARY]
GO
CREATE TABLE [dbo].[part](
    [P_PartKey] [int] NULL,
    [P_Name] [varchar](64) NULL,
    [P_Mfgr] [varchar](64) NULL,
    [P_Brand] [varchar](64) NULL,
    [P_Type] [varchar](64) NULL,
    [P_Size] [int] NULL,
    [P_Container] [varchar](64) NULL,
    [P_RetailPrice] [decimal](13, 2) NULL,
    [P_Comment] [varchar](64) NULL,
    [skip] [varchar](64) NULL
) ON [PRIMARY]
GO
CREATE TABLE [dbo].[partsupp](
    [PS_PartKey] [int] NULL,
    [PS_SuppKey] [int] NULL,
    [PS_AvailQty] [int] NULL,
    [PS_SupplyCost] [decimal](13, 2) NULL,
    [PS_Comment] [varchar](200) NULL,
    [skip] [varchar](64) NULL
) ON [PRIMARY]
GO
CREATE TABLE [dbo].[region](
    [R_RegionKey] [int] NULL,
    [R_Name] [varchar](64) NULL,
    [R_Comment] [varchar](160) NULL,
    [skip] [varchar](64) NULL
) ON [PRIMARY]
GO
CREATE TABLE [dbo].[supplier](
    [S_SuppKey] [int] NULL,
    [S_Name] [varchar](64) NULL,
    [S_Address] [varchar](64) NULL,
    [S_NationKey] [int] NULL,
    [S_Phone] [varchar](18) NULL,
    [S_AcctBal] [decimal](13, 2) NULL,
    [S_Comment] [varchar](105) NULL,
    [skip] [varchar](64) NULL
) ON [PRIMARY]
GO

Finally, we get to load the data. I had the source files in the ‘C:\TPC-H\source\’ directory, loading into the database on the external drive (to reduce I/O contention).

The data was loaded using the following:

BULK INSERT part FROM 'C:\TPC-H\source\part.tbl' WITH (TABLOCK, DATAFILETYPE='char', CODEPAGE='raw', FIELDTERMINATOR = '|')
BULK INSERT customer FROM 'C:\TPC-H\source\customer.tbl' WITH (TABLOCK, DATAFILETYPE='char', CODEPAGE='raw', FIELDTERMINATOR = '|')
BULK INSERT orders FROM 'C:\TPC-H\source\orders.tbl' WITH (TABLOCK, DATAFILETYPE='char', CODEPAGE='raw', FIELDTERMINATOR = '|')
BULK INSERT partsupp FROM 'C:\TPC-H\source\partsupp.tbl' WITH (TABLOCK, DATAFILETYPE='char', CODEPAGE='raw', FIELDTERMINATOR = '|')
BULK INSERT supplier FROM 'c:\TPC-H\source\supplier.tbl' WITH (TABLOCK, DATAFILETYPE='char', CODEPAGE='raw', FIELDTERMINATOR = '|')
BULK INSERT lineitem FROM 'C:\TPC-H\source\lineitem.tbl' WITH (TABLOCK, DATAFILETYPE='char', CODEPAGE='raw', FIELDTERMINATOR = '|')
BULK INSERT nation FROM 'C:\TPC-H\source\nation.tbl' WITH (TABLOCK, DATAFILETYPE='char', CODEPAGE='raw', FIELDTERMINATOR = '|')
BULK INSERT region FROM 'C:\TPC-H\source\region.tbl' WITH (TABLOCK, DATAFILETYPE='char', CODEPAGE='raw', FIELDTERMINATOR = '|')
 This data was loaded in the following times
Table 1Gb Dataset 10Gb Dataset
Customers 150,000 1,500,000
Line Items 6,001,215 59,986,052
Nation 25 25
Orders 1,500,000 15,000,000
Part 200,000 2,000,000
PartSupp (Part Supplier) 800,000 8,000,000
Supplier 10,000 100,000
Region 5 5
Resulting source file size 1.03Gb 10.5Gb
Time to load 50 seconds 9 minutes 56 seconds
Resulting SQL DB Size 1.13Gb 11.38Gb

This gives us a basis from which to run the benchmarks, and do some optimisations.

Next time, I’ll look at running through some of the optimisations, and get some benchmarks out of this system.


You can also read an article, written by Neil Robbins, on this topic also, here: http://blog.semeosis.com/2011/07/03/putting-the-star-schema-benchmark-to-work-with-sql-server-2008r2/