Latest Entries »

So it’s TSQL Tuesday once more, and hosted by Jen McCown this month. Thanks Jen!
The topic is around tips, tactics and strategies for managing an Enterprise…

Well, having been in my current role for nearly a year now, which involves a degree of managing a number of SQL Servers, it’s a great opportunity to share what helps me.

1 – SQL prompt

This is an Add-in for SSMS produced by Redgate. While it does a wide range of awesome things, one of the main things that help me, from an enterprise perspective, is the ability to colour code tabs. This means that it is really easy to see what environment I’m running code against. For me, Green is dev, Purple is staging (and not quite live), and Red is Production (or requires caution!).

SQLPrompt_Tabs

The other, really nice feature is that when you close SSMS down (which occasionally happens accidentally, or when Windows Update does it’s thing, cos why would you deliberately close it!), SQL Prompt remembers all the tabs you had open, and reopens them (and connects them) on restart.

For me, this is great as it means that I can concentrate on the actual work, rather than waste time trying to think of a filename for some random query that may be of use at some point..

SSMS_Registered_Servers2 – Registered Servers

This is a really nice feature feature, that allows me to run the same query on multiple servers. One of the main uses of this is to get a ‘snapshot’ of what is running on the servers, with sp_whoisactive. Having a Query connected to one of these groups runs the query against each member server, and adds a column to the start of the result set showing the server name.

3 – Extended Events

I’ve had chance to get to grips more with this over the past few months, and I’m really starting to find lots of great uses for it. One of the really helpful (Enterprise-related) features that I’m using it for is to monitor execution times for a specific stored procedure. This is just for one SP, and it’s called thousands of times an hour. Using Extended events for this allows tracking of these times, with bordering on no overhead.

I’m in the process of doing a separate post on this, so keep em peeled.

Thanks again to Jen for hosting.

Recently, I’ve been learning about the Statistical language R. To allow me to do some testing against it, I wanted to get a decent sized dataset into SQL Server.

Having found a good dataset, the Airlines data from the 2009 Data Expo, I downloaded it, and proceeded to try and load it into SQL Server. I failed almost straightaway since the downloaded files were UTF-8 in format, and SQL Server cannot directly load them in using Bulk Copy or BCP.

So, I looked around for another option. I could have used SSIS, however, there seemed to be an issue with the Data Conversion module, where it wasn’t saving the changes having converted the fields from UTF-8 to ASCII.

My next choice was to have a look to see if C# would help. It did. The code below allowed me to take the source file (specified in _filePath) and load it into a database table (specified in _connString).

The code works by creating a DataTable into which the data from the CSV file is loaded. The data is then fed into SQL Server in batches of 100,0000 (parameterised using _batchSize).

Here is the code, it’s pretty well performing on my machine, running at 20-25k rows a second.

using Microsoft.VisualBasic.FileIO;
using System.Data;
using System.Data.SqlClient;

    public class LoadUTF8CSVIntoSQL
    {
        protected const string _filePath = @"C:\Inbox\DataSets\Flights\extract\1987.csv";
        protected const string _connString = @"Server=localhost;Database=Datasets;Trusted_Connection=True;";
        protected const int _batchSize = 100000;

        protected static void LoadDataTable(SqlBulkCopy sqlBulkCopy, SqlConnection sqlConnection, DataTable dataTable)
        {
            try
            {
                sqlBulkCopy.WriteToServer(dataTable);

                dataTable.Rows.Clear();
                System.Console.WriteLine("{0}\t- Batch Written",System.DateTime.Now.ToString());
            }
            catch (System.Exception ex)
            {
                System.Console.WriteLine("ERROR : "+ex.Message);
            }
        }

        public static void LoadCsvDataIntoSqlServer(string sFilePath)
        {
            try
            {
                // This should be the full path
                var fileName = "";
                if (sFilePath.Length == 0)
                {
                    fileName = _filePath;
                }
                else
                {
                    fileName = sFilePath;
                }

                var createdCount = 0;

                using (var textFieldParser = new TextFieldParser(fileName))
                {
                    textFieldParser.TextFieldType = FieldType.Delimited;
                    textFieldParser.Delimiters = new[] { "," };
                    textFieldParser.HasFieldsEnclosedInQuotes = false;

                    var connectionString = _connString;

                    var dataTable = new DataTable("airlines");

                    // Specify the columns in the Data Table
                    dataTable.Columns.Add("Yr");
                    dataTable.Columns.Add("Mth");
                    dataTable.Columns.Add("DayofMonth");
                    dataTable.Columns.Add("DayOfWeek");
                    dataTable.Columns.Add("DepTime");
                    dataTable.Columns.Add("CRSDepTime");
                    dataTable.Columns.Add("ArrTime");
                    dataTable.Columns.Add("CRSArrTime");
                    dataTable.Columns.Add("UniqueCarrier");
                    dataTable.Columns.Add("FlightNum");
                    dataTable.Columns.Add("TailNum");
                    dataTable.Columns.Add("ActualElapsedTime");
                    dataTable.Columns.Add("CRSElapsedTime");
                    dataTable.Columns.Add("AirTime");
                    dataTable.Columns.Add("ArrDelay");
                    dataTable.Columns.Add("DepDelay");
                    dataTable.Columns.Add("Origin");
                    dataTable.Columns.Add("Dest");
                    dataTable.Columns.Add("Distance");
                    dataTable.Columns.Add("TaxiIn");
                    dataTable.Columns.Add("TaxiOut");
                    dataTable.Columns.Add("Cancelled");
                    dataTable.Columns.Add("CancellationCode");
                    dataTable.Columns.Add("Diverted");
                    dataTable.Columns.Add("CarrierDelay");
                    dataTable.Columns.Add("WeatherDelay");
                    dataTable.Columns.Add("NASDelay");
                    dataTable.Columns.Add("SecurityDelay");
                    dataTable.Columns.Add("LateAircraftDelay");

                    using (var sqlConnection = new SqlConnection(connectionString))
                    {
                        sqlConnection.Open();

                        // Initialise Bulk Copy Object
                        var sqlBulkCopy = new SqlBulkCopy(sqlConnection)
                        {
                            DestinationTableName = "airlines"
                        };

                        // Define column mappings between Data Table and Target Table
                        sqlBulkCopy.ColumnMappings.Add("Yr", "Yr");
                        sqlBulkCopy.ColumnMappings.Add("Mth", "Mth");
                        sqlBulkCopy.ColumnMappings.Add("DayofMonth", "DayofMonth");
                        sqlBulkCopy.ColumnMappings.Add("DayOfWeek", "DayOfWeek");
                        sqlBulkCopy.ColumnMappings.Add("DepTime", "DepTime");
                        sqlBulkCopy.ColumnMappings.Add("CRSDepTime", "CRSDepTime");
                        sqlBulkCopy.ColumnMappings.Add("ArrTime", "ArrTime");
                        sqlBulkCopy.ColumnMappings.Add("CRSArrTime", "CRSArrTime");
                        sqlBulkCopy.ColumnMappings.Add("UniqueCarrier", "UniqueCarrier");
                        sqlBulkCopy.ColumnMappings.Add("FlightNum", "FlightNum");
                        sqlBulkCopy.ColumnMappings.Add("TailNum", "TailNum");
                        sqlBulkCopy.ColumnMappings.Add("ActualElapsedTime", "ActualElapsedTime");
                        sqlBulkCopy.ColumnMappings.Add("CRSElapsedTime", "CRSElapsedTime");
                        sqlBulkCopy.ColumnMappings.Add("AirTime", "AirTime");
                        sqlBulkCopy.ColumnMappings.Add("ArrDelay", "ArrDelay");
                        sqlBulkCopy.ColumnMappings.Add("DepDelay", "DepDelay");
                        sqlBulkCopy.ColumnMappings.Add("Origin", "Origin");
                        sqlBulkCopy.ColumnMappings.Add("Dest", "Dest");
                        sqlBulkCopy.ColumnMappings.Add("Distance", "Distance");
                        sqlBulkCopy.ColumnMappings.Add("TaxiIn", "TaxiIn");
                        sqlBulkCopy.ColumnMappings.Add("TaxiOut", "TaxiOut");
                        sqlBulkCopy.ColumnMappings.Add("Cancelled", "Cancelled");
                        sqlBulkCopy.ColumnMappings.Add("CancellationCode", "CancellationCode");
                        sqlBulkCopy.ColumnMappings.Add("Diverted", "Diverted");
                        sqlBulkCopy.ColumnMappings.Add("CarrierDelay", "CarrierDelay");
                        sqlBulkCopy.ColumnMappings.Add("WeatherDelay", "WeatherDelay");
                        sqlBulkCopy.ColumnMappings.Add("NASDelay", "NASDelay");
                        sqlBulkCopy.ColumnMappings.Add("SecurityDelay", "SecurityDelay");
                        sqlBulkCopy.ColumnMappings.Add("LateAircraftDelay", "LateAircraftDelay");

                        // Loop through the CSV and load each set of 100,000 records into a DataTable
                        while (!textFieldParser.EndOfData)
                        {
                            if (createdCount == 0)
                            {
                                textFieldParser.ReadFields();
                            }
                            dataTable.Rows.Add(textFieldParser.ReadFields());

                            createdCount++;

                            if (createdCount % _batchSize == 0)
                            {
                                LoadDataTable(sqlBulkCopy, sqlConnection, dataTable);
                            }
                        }

                        // Send a final set to SQL Server
                        LoadDataTable(sqlBulkCopy, sqlConnection, dataTable);

                        sqlConnection.Close();
                    }
                }
            }
            catch (System.Exception ex)
            {
                System.Console.WriteLine("ERROR : "+ex.Message);
            }
        }

        static void Main(string[] args)
        {
            try
            {
                var filePath = "";

                try
                {
                    filePath = args[0];
                }
                catch (System.Exception)
                {
                    filePath = "";
                }


                System.Console.WriteLine("{0}\t- Starting load of {1}", System.DateTime.Now.ToString(), filePath);
                LoadUTF8CSVIntoSQL.LoadCsvDataIntoSqlServer(filePath);
                System.Console.WriteLine("{0}\t- Completed load of {1}", System.DateTime.Now.ToString(), filePath);
            }
            catch (System.Exception ex)
            {
                System.Console.WriteLine("ERROR : "+ex.Message);
            }
        }
   
}

Having seen the request by redgate to review the Tribal SQL book, I leapt at the chance. I love the idea of these books and having read the MVP Deep Dives books (Book 1 and Book 2) previously I wanted to see what this one offered.

As with the MVP Deep Dives books, Tribal SQL’s authors have donated their royalties to charity, and in this case they go to Computers For Africa. There are also alot of well known names in this book with Dave BallantyneMark Rasmussen, Bob Pusateri, Stephanie Locke and Matt Velic, among them.

Tribal SQL is a good size book, weighing in at 454 pages, with 15 chapters covering various topics relevant to DBA’s. Each of the chapters has been written by a previously  unpublished author, and all are based on their experiences. There are chapters covering most topics from Internals Data Compression, Performance tuning, Auditing and SQL Injection, through to less technical areas such as Reporting and Database Mail. Additionally, there are sections on personal skills which include Communication Skills, Project Management and an Introduction to Agile Database Development.

While some of the sections don’t go into as much detail as you’d hope, this isn’t the point of this book (from my understanding). It covers each of the areas well, leaves you interested for more, and most of the chapters include links to various sources for further reading.

I enjoyed reading this book, and would highly recommend it for others. Chapters that I got the most out of were the Internals (Mark Rasmussen), Windowing Functions (Dave Ballantyne, and particularly since I was doing the SQL 2012 Querying exam), and the SQL Injection (Kevin Feasel).

My only criticism suggestion is that I’d like to see an eBook/Kindle version of it so I can have it in the eBook library I have on my Tablet, on the grounds that I can use it as a reference book.

You can read more about this book at the books website – TribalSQL.com

It’s been an interesting few months for me, things are different, and yet the same.

At the start of the year, I didn’t think I’d run a 10 mile road race (done that), and I’d expected to have completed the SQL 2008 MCM by now (not done that, and not going to).

This week, I’m taking the first of my SQL 2012 Exams, which I’m looking forward to, and because of this (along with no longer needing to focus on 2008), I’ve been able to look deeper at the 2012 / 2014 versions of SQL Server. I’m liking the new features.

However, this post isn’t about SQL (well, maybe indirectly). It’s about why I’m doing it.

The only person who can control what you can learn, is you. However you do it, you need to grow, and gain skills, move yourself to the next level.

Professionally, I do this by attending conferences and user groups where possible (I’m attending the SQL Relay event in Reading (UK) on 11th November), and also have a Pluralsight Subscription. This is a phenomenal resource, and has given me a huge list of training courses to work my way through.

Personally, I also use a lot of the material from Eric Thomas (motivational speaker) to get my head in the right place. Also, YouTube has a great wealth of motivational videos that help, and I love the one below. Pushing myself to get to a better state physically, so be able to go from not being able to run half a mile, to running 10, has reduced my stress levels, and I feel healthier for it. Obviously, this is a good thing since, in our industry, we spend most of the time sitting and this is REALLY unhealthy (Read this: Sitting is the new Smoking).

Whatever it takes, it’s all about growing yourself, stagnation is not an option.

image Over the past couple of years I’ve been deliberately avoiding using any form of third-party add in for SQL Server Management Studio. I was doing this since I was pursuing the SQL Server Master Certification, and wouldn’t have had access to these tools during the exams. Since I’m no longer actively doing this, I’ve started looking around.

For several years, I’ve been using the DevExpress Coderush product for development work in C# and Visual Studio, having made the decision between it and ReSharper. That was a difficult choice to make.

Fortunately, the decision for SQL Server is easier; I’m only aware of one, which is Redgate’s SQL Prompt. I had a look at this a year or so back, but decided to uninstall it due to the reasons given above.

Having started looking at it again, and particularly with the new version I’m loving the experience.

What I like:

  • Tab History –  When you close Management studio (SSMS), SQL Prompt saves the contents of all your open tabs, and reopens them when you open SSMS again. I do this at least once a day, and it’s stopped me kicking myself. You can also see the history of tabs that have been open, and this also has a preview screen (shown below) which is really helpful for all those SQLQuery1.sql files.

image

  • Shortcuts – They are saving me time. You can type a shortcut, and it’ll replace it with the SQL. My favourite so far is st100, which gets replaced with SELECT TOP 100 * FROM
  • Improved Intellisense – It gives a more intelligent set of results to help you type your queries. This even goes down to the level of suggesting join criteria.
  • Performance – One of the best features is that it really doesn’t seem to impact performance of Management studio.

What I don’t like:

This was a hard one. The only thing that really stopped me using it (aside from the exam thing) is the cost.

Final Thoughts

Would I recommend it ? Yes, definitely, it’s improved my SQL, reduced code errors and allows me to concentrate on what the T-SQL is supposed to do, rather than focussing on what the syntax is.

There are a couple of features I’d like it to have: I’d quite like it to also work with MDX. I’d like it to also recommend code changes, to help with implementing best practices.

Disclaimer:

I was offered a free licence of SQL Prompt, by Redgate, for this review. However, the content of the review has been written by myself, and the benefit of the licence is that it gave me time to write the review.

Parallel SQL in C#

So, I’ve been wanting to get back to playing with C# for a while, and finally have had the opportunity.

I’ve also been wanting to play with the Task library in .NET and see if I could get it to do something interesting, well below is the result.

The code below, running in a .NET 4 project, will run two SQL SELECT statements against the AdventureWorks2012 database.

There are three tasks in here, ParallelTask 1 and 2, and a timing task. The Parallel task takes a Connection String and a query as inputs, and passes out a Status Message. One of the important points with a task is that the task has to be self contained. This is why the connection is instantiated within the task.

I also added in a Timing task (ParallelTiming) so I could pass out a ping message.

The whole thing is controlled by the code in the main section, which is used to start the three tasks, with their appropriate parameters.

After this it awaits the tasks completing, then passes out the resulting return messages.

Try it out; it’s good fun and all you need is SQL Server, AdventureWorks and something to build C# projects.

DISCLAIMER: I am by no means a C# expert… 🙂

You can download the code here

Have fun!

/// Parallel_SQL demonstration code
/// From Nick Haslam
/// http://blog.nhaslam.com
/// 16/9/2013

using System;
using System.Collections.Generic;
using System.Data.SqlClient;
using System.Linq;
using System.Text;
using System.Threading.Tasks;

namespace Parallel_SQL
{
class Program
{
///
/// First Parallel task
///

///Connection string details ///Query to execute ///Status message to pass back ///
static Task ParallelTask1(string sConnString, string sQuery, Action StatusMessage)
{
return Task.Factory.StartNew(() =>
{
SqlConnection conn = new SqlConnection(sConnString);
conn.Open();

StatusMessage(“Running Query”);

SqlDataReader reader = null;
SqlCommand sqlCommand = new SqlCommand(sQuery, conn);

reader = sqlCommand.ExecuteReader();

while (reader.Read())
{
StatusMessage(reader[0].ToString());
}

return “Task 1 Complete”;
});
}

///
/// Second Parallel task
///

///Connection string details ///Query to execute ///Status message to pass back ///
static Task ParallelTask2(string sConnString, string sQuery, Action StatusMessage)
{
return Task.Factory.StartNew(() =>
{
SqlConnection conn = new SqlConnection(sConnString);
conn.Open();
StatusMessage(“Running Query”);

SqlDataReader reader = null;
SqlCommand sqlCommand = new SqlCommand(sQuery, conn);

reader = sqlCommand.ExecuteReader();

while (reader.Read())
{
StatusMessage(reader[0].ToString());
}

return “Task 2 Complete”;
});
}

///
/// Timing Task
///

///Milliseconds between ping ///Status message to pass back ///
static Task ParallelTiming(int iMSPause, Action StatusMessage)
{
return Task.Factory.StartNew(() =>
{
for (int i = 0; i < 10; i++) { System.Threading.Thread.Sleep(iMSPause); StatusMessage(“******************** PING ********************”); } return “Timing task done”; }); } static void Main(string[] args) { string sConnString = “server=.; Trusted_Connection=yes; database=AdventureWorks2012;”; try { var Task1Control = ParallelTask1(sConnString, “SELECT top 500 TransactionID FROM Production.TransactionHistory”, (update) =>
{
Console.WriteLine(String.Format(“{0} – {1}”, DateTime.Now, update));
});
var Task2Control = ParallelTask2(sConnString,
“SELECT top 500 SalesOrderDetailID FROM sales.SalesOrderDetail”, (update) =>
{
Console.WriteLine(String.Format(“{0} – \t\t{1}”, DateTime.Now, update));
});

var TimingTaskControl = ParallelTiming(250, (update) =>
{
Console.WriteLine(String.Format(“{0} – \t\t\t{1}”, DateTime.Now, update));
});

// Await Completion of the tasks
Console.WriteLine(“Task 1 Status – {0}”, Task1Control.Result);
Console.WriteLine(“Task 2 Status – {0}”, Task2Control.Result);
Console.WriteLine(“Timing Task Status – {0}”, TimingTaskControl.Result);
}
catch (Exception e)
{
Console.WriteLine(e.ToString());
}
Console.ReadKey();
}
}
}

On 12th July 2013, this blog passed 30,000 views. Thank you for contributing!!

Although, actually, it’ll be a bit more since it’s not always been on my current host (WordPress).

So, there are various stats available from WordPress on who visited from where, so here’s some interesting nuggets…

Where are people viewing from ?

image

 

 

 

 

Mostly from the US, closely followed by the UK, then India, so actually no surprises there.

 

 

 

What are the top 5 posts ?

Blog Post Views
Page Corruption in a SQL Server Database 2,906
Creating a Virtual Cluster 2,848
SQL Server ODBC on Linux 2,691
Loading Reference data from a SharePoint List using SSIS 2,219
Fun with TPC-H (part 1) 2,103

How do viewers get here ?

The vast majority (over 17,000) made it here from Google, with other methods being far lower (Bing 650, Twitter 470)

Other bits…

The busiest month was October 2012, when I had over 2,000 views. That was directly due to me hosting TSQL2sday. I’m averaging 1,500 views a month and have been for the past year, with over 1,000 distinct visitors a month since WordPress started tracking them in December.

Thank you for visiting, and I hope you continue to find this blog helpful.

On Monday 1st July 2013, the UK Police launched a hub for accessing their data. Read the announcement here: http://www.data.gov.uk/blog/the-launch-of-datapoliceuk

You can find the portal here: http://data.police.uk/

The data extracts are here: http://data.police.uk/data/

I wanted to have a look at this and see what can be done with it, but there are dozens of csv files to load in to get the full view, so this is a quick post on what I’ve been doing this evening. 🙂

Having found an interesting post on loading multiple CSV files in T-SQL , I’ve updated it to load the files in. First though, we need to create a table to take the data.

Create the Crime_Info table

An initial table to receive the data is shown below:

CREATE TABLE [dbo].[Crime_Info]( 
    [Crime ID] [varchar](100) NULL, 
    [Month] [varchar](100) NULL, 
    [Reported by] [varchar](100) NULL, 
    [Falls within] [varchar](100) NULL, 
    [Longitude] [float] NULL, 
    [Latitude] [float] NULL, 
    [Location] [varchar](100) NULL, 
    [LSOA code] [varchar](100) NULL, 
    [LSOA name] [varchar](100) NULL, 
    [Crime type] [varchar](200) NULL, 
    [Last outcome category] [varchar](max) NULL, 
    [Context] [varchar](max) NULL 
) ON [PRIMARY] WITH (DATA_COMPRESSION = PAGE )

Note that I Page compressed the table as I wanted to reduce the space it took up. It went down from 3GB to around 1.5GB.

Load the Crime files

I’ve modified the bulk load script mentioned above, and you can download it here:

https://dl.dropboxusercontent.com/u/2765900/Load_UK_Crime_Data.sql

Once loaded, I had 15,668,549 rows of data, sitting in a 1.2GB database.

Then, by playing with PowerPivot you can create ‘fun’ visualisations like this.

image

Have fun, and as they say on Crimewatch, “don’t have nightmares, do sleep well”.

So, about 10 months ago, I did a blog article about my experiences moving from a Blackberry to an iPhone 4s (Read it here).

Well, I’ve moved off it now. It was a combination of reasons, including:

  1. I can’t write stuff for it. I’m starting to do more programming at the moment, and it’s a nice idea to be able to write stuff for the phone I’m using. I can’t do this on an iPhone.
  2. The screen is small. I’ve also got an iPad2 and still happily use that. However, the screen on the iPhone is small compared with alot of newer smart phones.
  3. Magpie syndrome. I like shiny things, while my Blackberry 9700 kept me happy for 2+years, the Blackberry 9900 didn’t, and it turns out that the iPhone hasn’t either.
  4. Using the iPhone feels like I’m in a glass box, with all the fun things being outside. I know this sounds ridiculous, but I’m not sure how else to explain it.

So, what have I moved to ? A very kind person has let me have their previous phone, a Galaxy Nexus. This is lovely to use, and easily cover the items above. In addition to this, it lets me:

  1. Use almost all the apps I was using on the iPhone (Evernote, RunKeeper, Pluralsight, Netflix, Flipboard and iPlayer).
  2. I discovered some new apps I could use now, including Falcon Pro (a really nice Twitter client) and Ingress. Ingress is really interesting, and doesn’t appear to have anything like it on the iPhone.

I’m not saying that I’m not using the iPhone anymore. I am, I’m using it as a GPS device for running. It’s great that the RunKeeper app is available on both platforms, and also, the iPhone is smaller so fits nicely in my pocket when out and about.

I’ve also started a separate blog (separation of anxieties, or something), RunningNick.com, which will be covering my training for the Great South Run, which I’m doing in October (10 MILES!!!). If you’d like to sponsor me, there is a link on the right, funds go to Diabetes UK.

This is the third article I’ve done on the TPC-H benchmark, and part 2 on AWS RedShift. Read the first article here, and the first part on AWS Redshift here.

Previously, I covered loading data into a database in RedShift, and discovered (not unsurprisingly) that the performance of SQL Server on a laptop is pretty much the same as that on RedShift, for a 1GB dataset. No great surprises there.

This time, I’m using a 100GB data set. Having spent quite a while generating the dataset, and then getting it uploaded to S3, I’m now in a position to start the loads.

GZIP is King!

One of the great features of RedShift is that it’ll load GZIP’d datasets directly into the database, you just need to add the gzip parameter at the end. So the load statements below are the same as before, they now have gzip at the end, and I’m loading from a gzip file rather than the straight text version.

copy customer from ‘s3://oldnick-tpch/customer.tbl.gz’ CREDENTIALS ‘aws_access_key_id=<Your-Access-Key-ID>;aws_secret_access_key=<Your-Secret-Access-Key>’ delimiter ‘|’ gzip;
copy orders from ‘s3://oldnick-tpch/orders.tbl.gz’ CREDENTIALS ‘aws_access_key_id=<Your-Access-Key-ID>;aws_secret_access_key=<Your-Secret-Access-Key>’ delimiter ‘|’ gzip;
copy lineitem from ‘s3://oldnick-tpch/lineitem.tbl.gz’ CREDENTIALS ‘aws_access_key_id=<Your-Access-Key-ID>;aws_secret_access_key=<Your-Secret-Access-Key>’ delimiter ‘|’ gzip;
copy nation from ‘s3://oldnick-tpch/nation.tbl.gz’ CREDENTIALS ‘aws_access_key_id=<Your-Access-Key-ID>;aws_secret_access_key=<Your-Secret-Access-Key>’ delimiter ‘|’ gzip;
copy part from ‘s3://oldnick-tpch/part.tbl.gz’ CREDENTIALS ‘aws_access_key_id=<Your-Access-Key-ID>;aws_secret_access_key=<Your-Secret-Access-Key>’ delimiter ‘|’ gzip;
copy partsupp from ‘s3://oldnick-tpch/partsupp.tbl.gz’ CREDENTIALS ‘aws_access_key_id=<Your-Access-Key-ID>;aws_secret_access_key=<Your-Secret-Access-Key>’ delimiter ‘|’ gzip;
copy region from ‘s3://oldnick-tpch/region.tbl.gz’ CREDENTIALS ‘aws_access_key_id=<Your-Access-Key-ID>;aws_secret_access_key=<Your-Secret-Access-Key>’ delimiter ‘|’ gzip;
copy supplier from ‘s3://oldnick-tpch/supplier.tbl.gz’ CREDENTIALS ‘aws_access_key_id=<Your-Access-Key-ID>;aws_secret_access_key=<Your-Secret-Access-Key>’ delimiter ‘|’ gzip;

I quite like the AWS interface for loading the data, so while the loads above are running, I can go into the AWS RedShift management console and see the progress of the loads:

image

I can also see various performance metrics while the jobs are running. Shown below are some of the more interesting ones. It’s particularly interesting that the CPU Utilisation is pegged at 100% while the load is running. I’m guessing that this is due to the loads being GZIP’d, so there’ll be an overhead of decompression in there, aside from the overhead of the load itself.

image

image

image

image

image

image

image

The Results

Having then run the same query, it took longer to run (as expected).

100gb Dataset Time to Return (sec)
Redshift (1 node cluster) cold 3min 11 sec
Redshift (1 node cluster) warm 2min 47 sec

So, with 100 times the data, the time to execute is slightly over 100 times the time. However, based on the maxing out of the CPU and the IOPS, the spec of the Redshift environment probably needs to be a higher spec for a 100gb dataset.

Comments are welcome as I’m aware that this is a specific test, and should not be taken as a rounded evaluation of Redshift.