Latest Entries »

Recently, I’ve been learning about the Statistical language R. To allow me to do some testing against it, I wanted to get a decent sized dataset into SQL Server.

Having found a good dataset, the Airlines data from the 2009 Data Expo, I downloaded it, and proceeded to try and load it into SQL Server. I failed almost straightaway since the downloaded files were UTF-8 in format, and SQL Server cannot directly load them in using Bulk Copy or BCP.

So, I looked around for another option. I could have used SSIS, however, there seemed to be an issue with the Data Conversion module, where it wasn’t saving the changes having converted the fields from UTF-8 to ASCII.

My next choice was to have a look to see if C# would help. It did. The code below allowed me to take the source file (specified in _filePath) and load it into a database table (specified in _connString).

The code works by creating a DataTable into which the data from the CSV file is loaded. The data is then fed into SQL Server in batches of 100,0000 (parameterised using _batchSize).

Here is the code, it’s pretty well performing on my machine, running at 20-25k rows a second.

using Microsoft.VisualBasic.FileIO;
using System.Data;
using System.Data.SqlClient;

    public class LoadUTF8CSVIntoSQL
    {
        protected const string _filePath = @"C:\Inbox\DataSets\Flights\extract\1987.csv";
        protected const string _connString = @"Server=localhost;Database=Datasets;Trusted_Connection=True;";
        protected const int _batchSize = 100000;

        protected static void LoadDataTable(SqlBulkCopy sqlBulkCopy, SqlConnection sqlConnection, DataTable dataTable)
        {
            try
            {
                sqlBulkCopy.WriteToServer(dataTable);

                dataTable.Rows.Clear();
                System.Console.WriteLine("{0}\t- Batch Written",System.DateTime.Now.ToString());
            }
            catch (System.Exception ex)
            {
                System.Console.WriteLine("ERROR : "+ex.Message);
            }
        }

        public static void LoadCsvDataIntoSqlServer(string sFilePath)
        {
            try
            {
                // This should be the full path
                var fileName = "";
                if (sFilePath.Length == 0)
                {
                    fileName = _filePath;
                }
                else
                {
                    fileName = sFilePath;
                }

                var createdCount = 0;

                using (var textFieldParser = new TextFieldParser(fileName))
                {
                    textFieldParser.TextFieldType = FieldType.Delimited;
                    textFieldParser.Delimiters = new[] { "," };
                    textFieldParser.HasFieldsEnclosedInQuotes = false;

                    var connectionString = _connString;

                    var dataTable = new DataTable("airlines");

                    // Specify the columns in the Data Table
                    dataTable.Columns.Add("Yr");
                    dataTable.Columns.Add("Mth");
                    dataTable.Columns.Add("DayofMonth");
                    dataTable.Columns.Add("DayOfWeek");
                    dataTable.Columns.Add("DepTime");
                    dataTable.Columns.Add("CRSDepTime");
                    dataTable.Columns.Add("ArrTime");
                    dataTable.Columns.Add("CRSArrTime");
                    dataTable.Columns.Add("UniqueCarrier");
                    dataTable.Columns.Add("FlightNum");
                    dataTable.Columns.Add("TailNum");
                    dataTable.Columns.Add("ActualElapsedTime");
                    dataTable.Columns.Add("CRSElapsedTime");
                    dataTable.Columns.Add("AirTime");
                    dataTable.Columns.Add("ArrDelay");
                    dataTable.Columns.Add("DepDelay");
                    dataTable.Columns.Add("Origin");
                    dataTable.Columns.Add("Dest");
                    dataTable.Columns.Add("Distance");
                    dataTable.Columns.Add("TaxiIn");
                    dataTable.Columns.Add("TaxiOut");
                    dataTable.Columns.Add("Cancelled");
                    dataTable.Columns.Add("CancellationCode");
                    dataTable.Columns.Add("Diverted");
                    dataTable.Columns.Add("CarrierDelay");
                    dataTable.Columns.Add("WeatherDelay");
                    dataTable.Columns.Add("NASDelay");
                    dataTable.Columns.Add("SecurityDelay");
                    dataTable.Columns.Add("LateAircraftDelay");

                    using (var sqlConnection = new SqlConnection(connectionString))
                    {
                        sqlConnection.Open();

                        // Initialise Bulk Copy Object
                        var sqlBulkCopy = new SqlBulkCopy(sqlConnection)
                        {
                            DestinationTableName = "airlines"
                        };

                        // Define column mappings between Data Table and Target Table
                        sqlBulkCopy.ColumnMappings.Add("Yr", "Yr");
                        sqlBulkCopy.ColumnMappings.Add("Mth", "Mth");
                        sqlBulkCopy.ColumnMappings.Add("DayofMonth", "DayofMonth");
                        sqlBulkCopy.ColumnMappings.Add("DayOfWeek", "DayOfWeek");
                        sqlBulkCopy.ColumnMappings.Add("DepTime", "DepTime");
                        sqlBulkCopy.ColumnMappings.Add("CRSDepTime", "CRSDepTime");
                        sqlBulkCopy.ColumnMappings.Add("ArrTime", "ArrTime");
                        sqlBulkCopy.ColumnMappings.Add("CRSArrTime", "CRSArrTime");
                        sqlBulkCopy.ColumnMappings.Add("UniqueCarrier", "UniqueCarrier");
                        sqlBulkCopy.ColumnMappings.Add("FlightNum", "FlightNum");
                        sqlBulkCopy.ColumnMappings.Add("TailNum", "TailNum");
                        sqlBulkCopy.ColumnMappings.Add("ActualElapsedTime", "ActualElapsedTime");
                        sqlBulkCopy.ColumnMappings.Add("CRSElapsedTime", "CRSElapsedTime");
                        sqlBulkCopy.ColumnMappings.Add("AirTime", "AirTime");
                        sqlBulkCopy.ColumnMappings.Add("ArrDelay", "ArrDelay");
                        sqlBulkCopy.ColumnMappings.Add("DepDelay", "DepDelay");
                        sqlBulkCopy.ColumnMappings.Add("Origin", "Origin");
                        sqlBulkCopy.ColumnMappings.Add("Dest", "Dest");
                        sqlBulkCopy.ColumnMappings.Add("Distance", "Distance");
                        sqlBulkCopy.ColumnMappings.Add("TaxiIn", "TaxiIn");
                        sqlBulkCopy.ColumnMappings.Add("TaxiOut", "TaxiOut");
                        sqlBulkCopy.ColumnMappings.Add("Cancelled", "Cancelled");
                        sqlBulkCopy.ColumnMappings.Add("CancellationCode", "CancellationCode");
                        sqlBulkCopy.ColumnMappings.Add("Diverted", "Diverted");
                        sqlBulkCopy.ColumnMappings.Add("CarrierDelay", "CarrierDelay");
                        sqlBulkCopy.ColumnMappings.Add("WeatherDelay", "WeatherDelay");
                        sqlBulkCopy.ColumnMappings.Add("NASDelay", "NASDelay");
                        sqlBulkCopy.ColumnMappings.Add("SecurityDelay", "SecurityDelay");
                        sqlBulkCopy.ColumnMappings.Add("LateAircraftDelay", "LateAircraftDelay");

                        // Loop through the CSV and load each set of 100,000 records into a DataTable
                        while (!textFieldParser.EndOfData)
                        {
                            if (createdCount == 0)
                            {
                                textFieldParser.ReadFields();
                            }
                            dataTable.Rows.Add(textFieldParser.ReadFields());

                            createdCount++;

                            if (createdCount % _batchSize == 0)
                            {
                                LoadDataTable(sqlBulkCopy, sqlConnection, dataTable);
                            }
                        }

                        // Send a final set to SQL Server
                        LoadDataTable(sqlBulkCopy, sqlConnection, dataTable);

                        sqlConnection.Close();
                    }
                }
            }
            catch (System.Exception ex)
            {
                System.Console.WriteLine("ERROR : "+ex.Message);
            }
        }

        static void Main(string[] args)
        {
            try
            {
                var filePath = "";

                try
                {
                    filePath = args[0];
                }
                catch (System.Exception)
                {
                    filePath = "";
                }


                System.Console.WriteLine("{0}\t- Starting load of {1}", System.DateTime.Now.ToString(), filePath);
                LoadUTF8CSVIntoSQL.LoadCsvDataIntoSqlServer(filePath);
                System.Console.WriteLine("{0}\t- Completed load of {1}", System.DateTime.Now.ToString(), filePath);
            }
            catch (System.Exception ex)
            {
                System.Console.WriteLine("ERROR : "+ex.Message);
            }
        }
   
}

Having seen the request by redgate to review the Tribal SQL book, I leapt at the chance. I love the idea of these books and having read the MVP Deep Dives books (Book 1 and Book 2) previously I wanted to see what this one offered.

As with the MVP Deep Dives books, Tribal SQL’s authors have donated their royalties to charity, and in this case they go to Computers For Africa. There are also alot of well known names in this book with Dave BallantyneMark Rasmussen, Bob Pusateri, Stephanie Locke and Matt Velic, among them.

Tribal SQL is a good size book, weighing in at 454 pages, with 15 chapters covering various topics relevant to DBA’s. Each of the chapters has been written by a previously  unpublished author, and all are based on their experiences. There are chapters covering most topics from Internals Data Compression, Performance tuning, Auditing and SQL Injection, through to less technical areas such as Reporting and Database Mail. Additionally, there are sections on personal skills which include Communication Skills, Project Management and an Introduction to Agile Database Development.

While some of the sections don’t go into as much detail as you’d hope, this isn’t the point of this book (from my understanding). It covers each of the areas well, leaves you interested for more, and most of the chapters include links to various sources for further reading.

I enjoyed reading this book, and would highly recommend it for others. Chapters that I got the most out of were the Internals (Mark Rasmussen), Windowing Functions (Dave Ballantyne, and particularly since I was doing the SQL 2012 Querying exam), and the SQL Injection (Kevin Feasel).

My only criticism suggestion is that I’d like to see an eBook/Kindle version of it so I can have it in the eBook library I have on my Tablet, on the grounds that I can use it as a reference book.

You can read more about this book at the books website – TribalSQL.com

It’s been an interesting few months for me, things are different, and yet the same.

At the start of the year, I didn’t think I’d run a 10 mile road race (done that), and I’d expected to have completed the SQL 2008 MCM by now (not done that, and not going to).

This week, I’m taking the first of my SQL 2012 Exams, which I’m looking forward to, and because of this (along with no longer needing to focus on 2008), I’ve been able to look deeper at the 2012 / 2014 versions of SQL Server. I’m liking the new features.

However, this post isn’t about SQL (well, maybe indirectly). It’s about why I’m doing it.

The only person who can control what you can learn, is you. However you do it, you need to grow, and gain skills, move yourself to the next level.

Professionally, I do this by attending conferences and user groups where possible (I’m attending the SQL Relay event in Reading (UK) on 11th November), and also have a Pluralsight Subscription. This is a phenomenal resource, and has given me a huge list of training courses to work my way through.

Personally, I also use a lot of the material from Eric Thomas (motivational speaker) to get my head in the right place. Also, YouTube has a great wealth of motivational videos that help, and I love the one below. Pushing myself to get to a better state physically, so be able to go from not being able to run half a mile, to running 10, has reduced my stress levels, and I feel healthier for it. Obviously, this is a good thing since, in our industry, we spend most of the time sitting and this is REALLY unhealthy (Read this: Sitting is the new Smoking).

Whatever it takes, it’s all about growing yourself, stagnation is not an option.

image Over the past couple of years I’ve been deliberately avoiding using any form of third-party add in for SQL Server Management Studio. I was doing this since I was pursuing the SQL Server Master Certification, and wouldn’t have had access to these tools during the exams. Since I’m no longer actively doing this, I’ve started looking around.

For several years, I’ve been using the DevExpress Coderush product for development work in C# and Visual Studio, having made the decision between it and ReSharper. That was a difficult choice to make.

Fortunately, the decision for SQL Server is easier; I’m only aware of one, which is Redgate’s SQL Prompt. I had a look at this a year or so back, but decided to uninstall it due to the reasons given above.

Having started looking at it again, and particularly with the new version I’m loving the experience.

What I like:

  • Tab History –  When you close Management studio (SSMS), SQL Prompt saves the contents of all your open tabs, and reopens them when you open SSMS again. I do this at least once a day, and it’s stopped me kicking myself. You can also see the history of tabs that have been open, and this also has a preview screen (shown below) which is really helpful for all those SQLQuery1.sql files.

image

  • Shortcuts – They are saving me time. You can type a shortcut, and it’ll replace it with the SQL. My favourite so far is st100, which gets replaced with SELECT TOP 100 * FROM
  • Improved Intellisense – It gives a more intelligent set of results to help you type your queries. This even goes down to the level of suggesting join criteria.
  • Performance – One of the best features is that it really doesn’t seem to impact performance of Management studio.

What I don’t like:

This was a hard one. The only thing that really stopped me using it (aside from the exam thing) is the cost.

Final Thoughts

Would I recommend it ? Yes, definitely, it’s improved my SQL, reduced code errors and allows me to concentrate on what the T-SQL is supposed to do, rather than focussing on what the syntax is.

There are a couple of features I’d like it to have: I’d quite like it to also work with MDX. I’d like it to also recommend code changes, to help with implementing best practices.

Disclaimer:

I was offered a free licence of SQL Prompt, by Redgate, for this review. However, the content of the review has been written by myself, and the benefit of the licence is that it gave me time to write the review.

Parallel SQL in C#

So, I’ve been wanting to get back to playing with C# for a while, and finally have had the opportunity.

I’ve also been wanting to play with the Task library in .NET and see if I could get it to do something interesting, well below is the result.

The code below, running in a .NET 4 project, will run two SQL SELECT statements against the AdventureWorks2012 database.

There are three tasks in here, ParallelTask 1 and 2, and a timing task. The Parallel task takes a Connection String and a query as inputs, and passes out a Status Message. One of the important points with a task is that the task has to be self contained. This is why the connection is instantiated within the task.

I also added in a Timing task (ParallelTiming) so I could pass out a ping message.

The whole thing is controlled by the code in the main section, which is used to start the three tasks, with their appropriate parameters.

After this it awaits the tasks completing, then passes out the resulting return messages.

Try it out; it’s good fun and all you need is SQL Server, AdventureWorks and something to build C# projects.

DISCLAIMER: I am by no means a C# expert… :-)

You can download the code here

Have fun!

/// Parallel_SQL demonstration code
/// From Nick Haslam
/// http://blog.nhaslam.com
/// 16/9/2013

using System;
using System.Collections.Generic;
using System.Data.SqlClient;
using System.Linq;
using System.Text;
using System.Threading.Tasks;

namespace Parallel_SQL
{
class Program
{
///
/// First Parallel task
///

///Connection string details ///Query to execute ///Status message to pass back ///
static Task ParallelTask1(string sConnString, string sQuery, Action StatusMessage)
{
return Task.Factory.StartNew(() =>
{
SqlConnection conn = new SqlConnection(sConnString);
conn.Open();

StatusMessage(“Running Query”);

SqlDataReader reader = null;
SqlCommand sqlCommand = new SqlCommand(sQuery, conn);

reader = sqlCommand.ExecuteReader();

while (reader.Read())
{
StatusMessage(reader[0].ToString());
}

return “Task 1 Complete”;
});
}

///
/// Second Parallel task
///

///Connection string details ///Query to execute ///Status message to pass back ///
static Task ParallelTask2(string sConnString, string sQuery, Action StatusMessage)
{
return Task.Factory.StartNew(() =>
{
SqlConnection conn = new SqlConnection(sConnString);
conn.Open();
StatusMessage(“Running Query”);

SqlDataReader reader = null;
SqlCommand sqlCommand = new SqlCommand(sQuery, conn);

reader = sqlCommand.ExecuteReader();

while (reader.Read())
{
StatusMessage(reader[0].ToString());
}

return “Task 2 Complete”;
});
}

///
/// Timing Task
///

///Milliseconds between ping ///Status message to pass back ///
static Task ParallelTiming(int iMSPause, Action StatusMessage)
{
return Task.Factory.StartNew(() =>
{
for (int i = 0; i < 10; i++) { System.Threading.Thread.Sleep(iMSPause); StatusMessage(“******************** PING ********************”); } return “Timing task done”; }); } static void Main(string[] args) { string sConnString = “server=.; Trusted_Connection=yes; database=AdventureWorks2012;”; try { var Task1Control = ParallelTask1(sConnString, “SELECT top 500 TransactionID FROM Production.TransactionHistory”, (update) =>
{
Console.WriteLine(String.Format(“{0} – {1}”, DateTime.Now, update));
});
var Task2Control = ParallelTask2(sConnString,
“SELECT top 500 SalesOrderDetailID FROM sales.SalesOrderDetail”, (update) =>
{
Console.WriteLine(String.Format(“{0} – \t\t{1}”, DateTime.Now, update));
});

var TimingTaskControl = ParallelTiming(250, (update) =>
{
Console.WriteLine(String.Format(“{0} – \t\t\t{1}”, DateTime.Now, update));
});

// Await Completion of the tasks
Console.WriteLine(“Task 1 Status – {0}”, Task1Control.Result);
Console.WriteLine(“Task 2 Status – {0}”, Task2Control.Result);
Console.WriteLine(“Timing Task Status – {0}”, TimingTaskControl.Result);
}
catch (Exception e)
{
Console.WriteLine(e.ToString());
}
Console.ReadKey();
}
}
}

On 12th July 2013, this blog passed 30,000 views. Thank you for contributing!!

Although, actually, it’ll be a bit more since it’s not always been on my current host (WordPress).

So, there are various stats available from WordPress on who visited from where, so here’s some interesting nuggets…

Where are people viewing from ?

image

 

 

 

 

Mostly from the US, closely followed by the UK, then India, so actually no surprises there.

 

 

 

What are the top 5 posts ?

Blog Post Views
Page Corruption in a SQL Server Database 2,906
Creating a Virtual Cluster 2,848
SQL Server ODBC on Linux 2,691
Loading Reference data from a SharePoint List using SSIS 2,219
Fun with TPC-H (part 1) 2,103

How do viewers get here ?

The vast majority (over 17,000) made it here from Google, with other methods being far lower (Bing 650, Twitter 470)

Other bits…

The busiest month was October 2012, when I had over 2,000 views. That was directly due to me hosting TSQL2sday. I’m averaging 1,500 views a month and have been for the past year, with over 1,000 distinct visitors a month since WordPress started tracking them in December.

Thank you for visiting, and I hope you continue to find this blog helpful.

On Monday 1st July 2013, the UK Police launched a hub for accessing their data. Read the announcement here: http://www.data.gov.uk/blog/the-launch-of-datapoliceuk

You can find the portal here: http://data.police.uk/

The data extracts are here: http://data.police.uk/data/

I wanted to have a look at this and see what can be done with it, but there are dozens of csv files to load in to get the full view, so this is a quick post on what I’ve been doing this evening. :-)

Having found an interesting post on loading multiple CSV files in T-SQL , I’ve updated it to load the files in. First though, we need to create a table to take the data.

Create the Crime_Info table

An initial table to receive the data is shown below:

CREATE TABLE [dbo].[Crime_Info]( 
    [Crime ID] [varchar](100) NULL, 
    [Month] [varchar](100) NULL, 
    [Reported by] [varchar](100) NULL, 
    [Falls within] [varchar](100) NULL, 
    [Longitude] [float] NULL, 
    [Latitude] [float] NULL, 
    [Location] [varchar](100) NULL, 
    [LSOA code] [varchar](100) NULL, 
    [LSOA name] [varchar](100) NULL, 
    [Crime type] [varchar](200) NULL, 
    [Last outcome category] [varchar](max) NULL, 
    [Context] [varchar](max) NULL 
) ON [PRIMARY] WITH (DATA_COMPRESSION = PAGE )

Note that I Page compressed the table as I wanted to reduce the space it took up. It went down from 3GB to around 1.5GB.

Load the Crime files

I’ve modified the bulk load script mentioned above, and you can download it here:

https://dl.dropboxusercontent.com/u/2765900/Load_UK_Crime_Data.sql

Once loaded, I had 15,668,549 rows of data, sitting in a 1.2GB database.

Then, by playing with PowerPivot you can create ‘fun’ visualisations like this.

image

Have fun, and as they say on Crimewatch, “don’t have nightmares, do sleep well”.

So, about 10 months ago, I did a blog article about my experiences moving from a Blackberry to an iPhone 4s (Read it here).

Well, I’ve moved off it now. It was a combination of reasons, including:

  1. I can’t write stuff for it. I’m starting to do more programming at the moment, and it’s a nice idea to be able to write stuff for the phone I’m using. I can’t do this on an iPhone.
  2. The screen is small. I’ve also got an iPad2 and still happily use that. However, the screen on the iPhone is small compared with alot of newer smart phones.
  3. Magpie syndrome. I like shiny things, while my Blackberry 9700 kept me happy for 2+years, the Blackberry 9900 didn’t, and it turns out that the iPhone hasn’t either.
  4. Using the iPhone feels like I’m in a glass box, with all the fun things being outside. I know this sounds ridiculous, but I’m not sure how else to explain it.

So, what have I moved to ? A very kind person has let me have their previous phone, a Galaxy Nexus. This is lovely to use, and easily cover the items above. In addition to this, it lets me:

  1. Use almost all the apps I was using on the iPhone (Evernote, RunKeeper, Pluralsight, Netflix, Flipboard and iPlayer).
  2. I discovered some new apps I could use now, including Falcon Pro (a really nice Twitter client) and Ingress. Ingress is really interesting, and doesn’t appear to have anything like it on the iPhone.

I’m not saying that I’m not using the iPhone anymore. I am, I’m using it as a GPS device for running. It’s great that the RunKeeper app is available on both platforms, and also, the iPhone is smaller so fits nicely in my pocket when out and about.

I’ve also started a separate blog (separation of anxieties, or something), RunningNick.com, which will be covering my training for the Great South Run, which I’m doing in October (10 MILES!!!). If you’d like to sponsor me, there is a link on the right, funds go to Diabetes UK.

This is the third article I’ve done on the TPC-H benchmark, and part 2 on AWS RedShift. Read the first article here, and the first part on AWS Redshift here.

Previously, I covered loading data into a database in RedShift, and discovered (not unsurprisingly) that the performance of SQL Server on a laptop is pretty much the same as that on RedShift, for a 1GB dataset. No great surprises there.

This time, I’m using a 100GB data set. Having spent quite a while generating the dataset, and then getting it uploaded to S3, I’m now in a position to start the loads.

GZIP is King!

One of the great features of RedShift is that it’ll load GZIP’d datasets directly into the database, you just need to add the gzip parameter at the end. So the load statements below are the same as before, they now have gzip at the end, and I’m loading from a gzip file rather than the straight text version.

copy customer from ‘s3://oldnick-tpch/customer.tbl.gz’ CREDENTIALS ‘aws_access_key_id=<Your-Access-Key-ID>;aws_secret_access_key=<Your-Secret-Access-Key>’ delimiter ‘|’ gzip;
copy orders from ‘s3://oldnick-tpch/orders.tbl.gz’ CREDENTIALS ‘aws_access_key_id=<Your-Access-Key-ID>;aws_secret_access_key=<Your-Secret-Access-Key>’ delimiter ‘|’ gzip;
copy lineitem from ‘s3://oldnick-tpch/lineitem.tbl.gz’ CREDENTIALS ‘aws_access_key_id=<Your-Access-Key-ID>;aws_secret_access_key=<Your-Secret-Access-Key>’ delimiter ‘|’ gzip;
copy nation from ‘s3://oldnick-tpch/nation.tbl.gz’ CREDENTIALS ‘aws_access_key_id=<Your-Access-Key-ID>;aws_secret_access_key=<Your-Secret-Access-Key>’ delimiter ‘|’ gzip;
copy part from ‘s3://oldnick-tpch/part.tbl.gz’ CREDENTIALS ‘aws_access_key_id=<Your-Access-Key-ID>;aws_secret_access_key=<Your-Secret-Access-Key>’ delimiter ‘|’ gzip;
copy partsupp from ‘s3://oldnick-tpch/partsupp.tbl.gz’ CREDENTIALS ‘aws_access_key_id=<Your-Access-Key-ID>;aws_secret_access_key=<Your-Secret-Access-Key>’ delimiter ‘|’ gzip;
copy region from ‘s3://oldnick-tpch/region.tbl.gz’ CREDENTIALS ‘aws_access_key_id=<Your-Access-Key-ID>;aws_secret_access_key=<Your-Secret-Access-Key>’ delimiter ‘|’ gzip;
copy supplier from ‘s3://oldnick-tpch/supplier.tbl.gz’ CREDENTIALS ‘aws_access_key_id=<Your-Access-Key-ID>;aws_secret_access_key=<Your-Secret-Access-Key>’ delimiter ‘|’ gzip;

I quite like the AWS interface for loading the data, so while the loads above are running, I can go into the AWS RedShift management console and see the progress of the loads:

image

I can also see various performance metrics while the jobs are running. Shown below are some of the more interesting ones. It’s particularly interesting that the CPU Utilisation is pegged at 100% while the load is running. I’m guessing that this is due to the loads being GZIP’d, so there’ll be an overhead of decompression in there, aside from the overhead of the load itself.

image

image

image

image

image

image

image

The Results

Having then run the same query, it took longer to run (as expected).

100gb Dataset Time to Return (sec)
Redshift (1 node cluster) cold 3min 11 sec
Redshift (1 node cluster) warm 2min 47 sec

So, with 100 times the data, the time to execute is slightly over 100 times the time. However, based on the maxing out of the CPU and the IOPS, the spec of the Redshift environment probably needs to be a higher spec for a 100gb dataset.

Comments are welcome as I’m aware that this is a specific test, and should not be taken as a rounded evaluation of Redshift.

So, this is the second article I’ve written against the TPC-H Benchmark (Part one here). Recently, Amazon announced that their ‘fast, fully managed petabyte-scale data warehouse service’ was available for public consumption. Having finally had some time to play, I thought I’d take it for a spin.

I was able to get a single node cluster up and running pretty quickly, and installed their sample data set easily. You can read how to go about this in their Getting Started Guide.

The initial issue I had with the sample data set was, well, it was pretty small. Ok, it got the concepts over, but I wanted more. I wanted to get an idea of performance and how it compared across the different levels. I wanted more data.

So, I decided to dump my set of test data (1Gb TPC-H, see part 1 for creating this) into it, and covered here is how I did it.

Getting Started

I’m going to assume that you’ve made it through steps 1-4 of the Getting Started guide above (which covers Prerequisites, Launching the Cluster, Security setup and Connecting to the cluster).

Shown below are the statements used to create the TPC-H tables, within the Redshift environment. You’ll need to create a connection to the Redshift environment, use SQL Workbench to connect to it, and copy and paste this into the SQL window.

CREATE TABLE customer(
C_CustKey int ,
C_Name varchar(64) ,
C_Address varchar(64) ,
C_NationKey int ,
C_Phone varchar(64) ,
C_AcctBal decimal(13, 2) ,
C_MktSegment varchar(64) ,
C_Comment varchar(120) ,
skip varchar(64)
);

CREATE TABLE lineitem(
L_OrderKey int ,
L_PartKey int ,
L_SuppKey int ,
L_LineNumber int ,
L_Quantity int ,
L_ExtendedPrice decimal(13, 2) ,
L_Discount decimal(13, 2) ,
L_Tax decimal(13, 2) ,
L_ReturnFlag varchar(64) ,
L_LineStatus varchar(64) ,
L_ShipDate datetime ,
L_CommitDate datetime ,
L_ReceiptDate datetime ,
L_ShipInstruct varchar(64) ,
L_ShipMode varchar(64) ,
L_Comment varchar(64) ,
skip varchar(64)
);
CREATE TABLE nation(
N_NationKey int ,
N_Name varchar(64) ,
N_RegionKey int ,
N_Comment varchar(160) ,
skip varchar(64)
);
CREATE TABLE orders(
O_OrderKey int ,
O_CustKey int ,
O_OrderStatus varchar(64) ,
O_TotalPrice decimal(13, 2) ,
O_OrderDate datetime ,
O_OrderPriority varchar(15) ,
O_Clerk varchar(64) ,
O_ShipPriority int ,
O_Comment varchar(80) ,
skip varchar(64)
);

CREATE TABLE part(
P_PartKey int ,
P_Name varchar(64) ,
P_Mfgr varchar(64) ,
P_Brand varchar(64) ,
P_Type varchar(64) ,
P_Size int ,
P_Container varchar(64) ,
P_RetailPrice decimal(13, 2) ,
P_Comment varchar(64) ,
skip varchar(64)
);
CREATE TABLE partsupp(
PS_PartKey int ,
PS_SuppKey int ,
PS_AvailQty int ,
PS_SupplyCost decimal(13, 2) ,
PS_Comment varchar(200) ,
skip varchar(64)
);
CREATE TABLE region(
R_RegionKey int ,
R_Name varchar(64) ,
R_Comment varchar(160) ,
skip varchar(64)
);
CREATE TABLE supplier(
S_SuppKey int ,
S_Name varchar(64) ,
S_Address varchar(64) ,
S_NationKey int ,
S_Phone varchar(18) ,
S_AcctBal decimal(13, 2) ,
S_Comment varchar(105) ,
skip varchar(64)
);

Next up, we need to get some data into it. I’ve had a copy of the TPC-H files sitting on my S3 account for a while, so I was hoping to just point Redshift at that (just like the sample code does). This was where I ran into my first issue. There may be an easier way, but I wanted to do it quickly. The problem was that I couldn’t get the S3 URL syntax to work, and this appears to be because my S3 Buckets are sitting in Ireland (EU). The S3 syntax looks to only work if you are using ‘US Standard’ as your S3 storage. I could be wrong, but I’m not an S3 expert.

Anyway, having created an S3 bucket in US Standard, and transferred the files over, I used the following to copy the contents from these files into the tables created in Redshift.

copy customer from ‘s3://oldnick-tpch/customer.tbl’ CREDENTIALS ‘aws_access_key_id=<Your-Access-Key-ID>;aws_secret_access_key=<Your-Secret-Access-Key>’ delimiter ‘|';
copy orders from ‘s3://oldnick-tpch/orders.tbl’ CREDENTIALS ‘aws_access_key_id=<Your-Access-Key-ID>;aws_secret_access_key=<Your-Secret-Access-Key>’ delimiter ‘|';
copy lineitem from ‘s3://oldnick-tpch/lineitem.tbl’ CREDENTIALS ‘aws_access_key_id=<Your-Access-Key-ID>;aws_secret_access_key=<Your-Secret-Access-Key>’ delimiter ‘|';
copy nation from ‘s3://oldnick-tpch/nation.tbl’ CREDENTIALS ‘aws_access_key_id=<Your-Access-Key-ID>;aws_secret_access_key=<Your-Secret-Access-Key>’ delimiter ‘|';
copy part from ‘s3://oldnick-tpch/part.tbl’ CREDENTIALS ‘aws_access_key_id=<Your-Access-Key-ID>;aws_secret_access_key=<Your-Secret-Access-Key>’ delimiter ‘|';
copy partsupp from ‘s3://oldnick-tpch/partsupp.tbl’ CREDENTIALS ‘aws_access_key_id=<Your-Access-Key-ID>;aws_secret_access_key=<Your-Secret-Access-Key>’ delimiter ‘|';
copy region from ‘s3://oldnick-tpch/region.tbl’ CREDENTIALS ‘aws_access_key_id=<Your-Access-Key-ID>;aws_secret_access_key=<Your-Secret-Access-Key>’ delimiter ‘|';
copy supplier from ‘s3://oldnick-tpch/supplier.tbl’ CREDENTIALS ‘aws_access_key_id=<Your-Access-Key-ID>;aws_secret_access_key=<Your-Secret-Access-Key>’ delimiter ‘|';

You’ll need to replace <Your-Access-Key-ID> with your Amazon access key and <Your-Secret-Access-Key> with your secret key, though I bet you’d guessed that. Also, note that it’s possible to load from a gzipped file by adding the gzip parameter to the  copy statement, though I didn’t discover this till after the load.

After waiting a little while, though not too long, for Redshift to bring the data in from S3, you can use these queries to check the counts.

select count(*) from customer;
select count(*) from orders;
select count(*) from lineitem;
select count(*) from nation;
select count(*) from part;
select count(*) from partsupp;
select count(*) from region;
select count(*) from supplier;

Next, the Developer Guide section covering loading data into Redshift say you should run the following statements after loading. Analyze updates the database statistics, and Vacuum then reclaims storage space.

analyze;
vacuum;

So, there we go, now we’ve got a Redshift cluster running the TPC-H tables. So next I thought I’d do a basic test to compare results.

My test query for this is shown below, and just does some aggregation against the lineitem table (6 million or so rows).

select l_returnflag, l_linestatus, sum(l_quantity) as sum_qty,
sum(l_extendedprice) as sum_base_price, sum(l_extendedprice*(1-l_discount)) as sum_disc_price,
sum(l_extendedprice*(1-l_discount)*(1+l_tax)) as sum_charge, avg(l_quantity) as avg_qty,
avg(l_extendedprice) as avg_price, avg(l_discount) as avg_disc,  count(*) as count_order
from lineitem
group by l_returnflag, l_linestatus
order by l_returnflag,l_linestatus;

So I ran this on my laptop (i7, 12 Gb RAM, 512GB SSD) a couple of times, once as a straight query, and once with a Columnstore index on it, cold (after restart) and warm (2nd time).

SQL times are shown based on SET STATISTICS TIME ON times.

Analysing Redshift was interesting. Since I’ve not done much with Postgres-SQL, I had a look through the Redshift documentation to see what is going on. I found an interesting page showing how to determine if a query is running from disk . Working through this I saw that, once I got the query id from the query below, I could get the query details including memory used and times.

Getting the Query Id

select query, elapsed, substring
from svl_qlog
order by query
desc limit 5;

select *
from svl_query_summary
where query = 5931

image

So, having seen those figures, I had a look at the cluster details.

Initially I was using 1 node, so I went up a notch, to a 2 node cluster of the more powerful nodes.

Single Node Testing
image
Multi Node Testing
image

The Results

Time to Return (sec)
Laptop – SQL 2012 (Cold) 24515ms CPU time, 6475ms elapsed
Laptop – SQL 2012 (Warm) 24016ms CPU time, 6060ms elapsed.
Laptop – SQL 2102 Columnstore (Cold) 531ms CPU time, 258ms elapsed
Laptop – SQL 2102 Columnstore (Warm) 389ms CPU time, 112ms elapsed
Redshift (1 node cluster) 1.24 sec
Redshift (2 node cluster 1.4 sec

So, obviously, I’m not stretching the performance of the Redshift cluster.

Part 2b of this will cover similar tests, though I’ll be doing it with a 100GB TPC-H test data set.

Keep ‘em peeled for the next post!

Follow

Get every new post delivered to your Inbox.

Join 573 other followers