Category: T-SQL Tuesday


Thanks to everyone who posted on T-SQL Tuesday this month. Below is a summary of the posts, so have a look through if you’ve not had chance yet.

As an aside, if you’ve not watched the film yet, it’s available here (Google | AmazonUK | AmazonUS)

There were some really interesting, and terrifying posts here, so pull up a chair, grab some whiskey, turn the lights down, and have a read through.

Don’t forget to keep an eye out for the next TSQL2sDay post, in a couple of weeks time.

The Posts!

Rob FarleyWhen someone deletes a shared data source in SSRS

Thomas RushtonSQL Wildcards

Rick KruegerNightmare on TSQL Street – The Case of the Missing Cache

Matthew VelicSoylent Growth

Ted KruegerHorrify Me!

Ken WatsonSoylent Green

Chris ShawAre you kidding me ?

Thomas Rushton (Again!) – Soylent Inbox

Jason BrimhallHigh Energy Plankton

Jes BorlandSoylent Green SQL Server

Bob PusateriA Horror Story

Steve Jones (Voice of the DBA)Soylent Green

Chris Yates – Soylent Green

Jeffrey VerheulSoylent Green

 

image

20121003-200545.jpg Welcome to TSql2sday issue #35, this time hosted by me…

It’s a bit last minute, as I stepped in to help Adam out, so bear with me. As always, thanks to Adam for starting this off, I’ve posted a few articles on previous runs, and have found other people’s posts to be really interesting. I hope this follows in the same way.

Over the past couple of days I’ve been attending a training course in Paris, and one evening, to relax I watched ‘Soylent Green‘, a classic science fiction film. If you’ve not seen it, I recommend it, and go and watch it …

So, what I’d like to know is, what is your most horrifying discovery from your work with SQL Server?

We all like to read stories of other people’s misfortunes and, in some ways they help to make us better people by learning from them. Hopefully, there is nothing as bad as Charlton Heston’s discovery, but there may be in its own way.

A couple of extra thoughts for motivational thinking…

Soylent Brown – You did a post, Great Job!!

Soylent Orange – You did a post, it made me wince!

Soylent Green  – You did a post, it made me wince, and it included some T-SQL.

Do you have the words straight?

Here are the rules as usual: If you would like to participate in T-SQL Tuesday please be sure to follow the rules below:

  • Your blog post must be published between Tuesday, October 9th 2012 00:00:00 GMT and Wednesday, October 10th 2012 00:00:00 GMT.
  • Include the T-SQL Tuesday logo (above) and hyperlink it back to this post.
  • If you don’t see your post in trackbacks, add the link to the comments below.
  • If you are on Twitter please tweet your blog using the #TSQL2sDay hashtag. I can be contacted there as @nhaslam, in case you have questions or problems with comments/trackback.

Thank you all for participating, and special thanks to Adam Machanic (b|t) for all his help and for continuing this series!

Thanks for posting, and I’ll have a follow-up post listing all the contributions as soon as I can.

It’s another TSQL2sday post, this time hosted by Rob Volk (b | t ). Thanks for hosting Rob.

So this month, it’s about how we fixed a problem, or found help when we couldn’t fix a problem, with a theme based on ‘Help’ by The Beatles

I chose the 2nd verse…

When I was younger, so much younger than today

So, many years ago, when I started out with SQL Server, back in the heady days of 6.5, there was much less of a SQL Community, actually, I don’t even remember one. The only way I could get help, was either through using MSDN, or by emailed colleagues I met on a SQL training course.

I never needed anybody’s help in any way.

Though that’s primarily due to stopping using SQL for a while, just a year or so, but still.

Everyone needs help, at some point, with something. It’s not a weakness, it’s a strength.

But now these days are gone, I’m not so self assured.

In the past few years, I’ve started working more and more with SQL, and found that it is such a huge product that no one can know the whole thing (SSAS, SSIS, SSRS included), and because of that, I’ve found several ways to get help if I need it.

Though, before I get into that, I need to say something about the community. There is a huge SQL Community out there, though the first community event I attended wasn’t a SQL One. It was a Developer event, Remix Uk, back in 2008 (http://www.microsoft.com/uk/remix08/default.aspx). It was a great event and I got to meet some great people there, including Scott Guthrie! Getting to this event was pretty much solely due to an ex-colleague, Jes Kirkup. Thanks Jes!

Since then I’ve started attending community events where I can, including the local DevEvening events (where I’ve done a couple of short presentations), and SQL community events (SQLMaidenhead, SQL in the Evening, and SQLBits of course!). I’ve found that these are a great way of getting a great insight into what skills others in the industry have, and so where I should be targeting my learning. Following on from that, I’ve met some great people, and there are people who I know I could ask for help if I needed to.

Not to mention the #SQLHelp hash tag on twitter, where there is help, pretty much 24hours a day, the only restriction being the need to phrase your question in 150 characters (160-hash tag).

Now I find I’ve changed my mind and opened up the doors.

Now I find that I am helping people where I get the opportunity, am publishing blog articles (here, like this one!) and am hoping to do more Community presentations. Furthermore, I’m doing internal training courses (next month I’m doing one on SSAS), and have recently started mentoring a colleague in SQL.

It’s great to be able to share knowledge and experience.

Thanks for listening, and reading, and thanks again to Rob for hosting.

T-SQL Tuesday

Thanks to Erin Stellato for hosting this months #TSQL2sday. Erin wanted to know all about what we do every day!

Interestingly, when I was much younger, I wanted to be a Fire-fighter or a Pilot. I’m still quite keen on learning to fly, but that’s looking less likely as time is going by (Eyesight, time, age and cost in that order).

Now though, and for the past 12 years or so, I work as a Consultant. It’s a nice, vague title. It started out as ‘Technical Consultant’, moved through Systems Consultant, and CRM Consultant. It’s currently bouncing between BI Consultant and Data Warehousing Consultant depending on the project I’m working on.

2012-07-11 07.31.33
My Journey to Work

2012-07-12 07.33.32
The Office

My Day!

The day started by sitting in a traffic jam. Pretty common that, unfortunately.

However, when I made it to my desk, I did a couple of checks of a server that I was running maintenance jobs on overnight. All was well, so I dived into email.

A couple of interesting items in there, one was a link about a Pigeon with a USB stick being faster than UK broadband (BBC link here). Also, was an invitation to the Microsoft Hadoop on Azure trial, which looks really interesting, and something I’ll have a look at next week (link here).

The Morning

Then, I started work on a Customer project that I’m working on this week. It’s effectively adding two additional country feeds (Spain and France, since you asked), to a data warehouse. The customer is using WhereScape RED, so it was a pretty straightforward matter of dragging and dropping the tables from the DB2 source system, into the ETL tool. WhereScape RED then generates the stored procedures to allow the ETL process to run, to get the data into the DWH.

Sounds a pretty straightforward process, however, there are 91 tables, and a couple of minor modifications to each one. So that took up all of my morning.

The Afternoon

The afternoon was pretty much taken up by an interesting problem with a BusinessObjects (XI4)environment. It was apparently continually running a query against the SQL Server database. We managed to prove it was the BO server doing this by changing the service account it was running as. The query could be seen in sp_whoisactive (thank you @AdamMachanic) to be run by a different user. The query was proceeding to take the server utilisation to 100%, which meant that the other databases on the server couldn’t effectively service user queries.

To temporarily resolve this issue, we put Resource Governor on, which restricted the BusinessObjects service to 25% of the CPU power, thereby letting the other users have some resources.

I found a really helpful query that helped me to find the queries that were being run. The query (from SQLAuthority, is copied here).

SELECT sqltext.TEXT, req.session_id, req.status,
req.command, req.cpu_time, req.total_elapsed_time
FROM sys.dm_exec_requests req
CROSS APPLY sys.dm_exec_sql_text(sql_handle) AS sqltext

Copied from http://blog.sqlauthority.com/2009/01/07/sql-server-find-currently-running-query-t-sql/

The final solution to the issue was to apply the BO XI4 SP4 patch, which appears to have resolved the issue.

There was also a couple of questions on licencing, to which both answers were ‘If it looks to be too good to be true, it probably is’.

Sadly, I didn’t get any pictures of the Red Arrows flying around the Farnborough Airshow, which is just up the road from us, or any pictures of the White-tailed Kite we saw flying over the motorway.

And that, is pretty much my day; a comparatively quiet one, and for a change, I made it out the door and home at a reasonable time. I hope you found this interesting, and I look forward to reading about your day.

Thanks again to Erin for hosting.

It’s T-SQL Tuesday again, and this time hosted by Nigel Sammy. Thanks for hosting Nigel, enjoy the post.

Not so long ago, I was lucky enough to go to SQL Bits X. It was a great few days, an I highly recommend it to you!

The Keynote session, given by Conor Cunningham, was a 400 level session on the ColumnStore index, which is a new feature in SQl Server 2012.

The demo was, unsurprisingly, really good, and it made me wonder ‘is it really that good ?’ So I thought I’d give it a go and see.

Having Googled around a bit, I found a useful blog article by Sacha Tomey, that went through a few examples. With permission, I’m going to run through a similar process, add a few bits in, and use a different data set.

Part of me really hates the AdventureWorks demo database, so you can imagine my delight when I discovered that there is now a bigger Retail data set, structured as a DataWarehouse. This is the Contoso BI set, and I like it.

Getting down to it

After installing the ContosoBI  database, you’ll end up with a fact table, factOnlineSales, with approx. 12.6 million rows in it.

First off, I want to try and get a level playing field, so we’ll be running with Statistics IO and Statistics Time on, and we’ll be clearing the buffers before each query

set statistics IO on;
set statistics time on;
dbcc dropcleanbuffers;

The Clustered Index

Just to get a comparison, I ran the test query, shown below, to get an idea of the speed against the supplied Clustered Index.

dbcc dropcleanbuffers;
go
SELECT
StoreKey ,SUM(SalesAmount) AS SalesAmount
FROM   factOnlineSales
GROUP BY StoreKey
ORDER BY StoreKey

This gave the following results:

Table ‘Worktable’. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table ‘FactOnlineSales’. Scan count 5, logical reads 46821, physical reads 1, read-ahead reads 46532, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times:
CPU time = 8377 ms,  elapsed time = 3476 ms

Just a Heap

Next, I wanted to get rid of the Clustered index, but since I didn’t really want to lose the original table, I ran this code to insert the contents of the factOnlineSales table into factCleanSales.

select * into factCleanSales from FactOnlineSales

That gave me 12 million rows, I wanted more, so next I ran this:

insert into factCleanSales
select dateadd(yy,3,DateKey), StoreKey, ProductKey, PromotionKey, CurrencyKey, CustomerKey, SalesOrderNumber, SalesOrderLineNumber, SalesQuantity, SalesAmount,
ReturnQuantity, ReturnAmount, DiscountQuantity, DiscountAmount, TotalCost,
UnitCost, UnitPrice, ETLLoadID, dateadd(yy,3,LoadDate), dateadd(yy,3,UpdateDate) from factOnlineSales

This gave me approx. 25 million records, and no Clustered Index. So I ran the test query again. It took a little longer this time.

dbcc dropcleanbuffers;
go
SELECT
StoreKey ,SUM(SalesAmount) AS SalesAmount
FROM   factCleanSales
GROUP BY StoreKey
ORDER BY StoreKey

Table ‘Worktable’. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table ‘factCleanSales’. Scan count 5, logical reads 505105, physical reads 0, read-ahead reads 504823, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times:
CPU time = 14976 ms,  elapsed time = 33987 ms.

Nearly 10 times longer to run, and more than 10 times the I/O, but that wasn’t surprising since we had no indexes.

Add one Non-Clustered

So, following Sacha’s lead, I added a compressed, nonclustered index into the pot.

CREATE NONCLUSTERED INDEX [IX_StoreKey] ON [dbo].factCleanSales
(    StoreKey ASC    )
INCLUDE ([SalesAmount]) WITH
(
PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON,
ALLOW_PAGE_LOCKS = ON, FILLFACTOR = 100, DATA_COMPRESSION = PAGE
) ON [PRIMARY]
GO

Clearing the buffers and running the query now, resulted in a better experience.

Table ‘factCleanSales’. Scan count 5, logical reads 43144, physical reads 1, read-ahead reads 42999, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times:
CPU time = 18877 ms,  elapsed time = 5785 ms.

The query time was down to a more reasonable level, though still longer than the Clustered Index.

ColumnStore Time!

Adding the ColumnStore index took a while, just over 2 minutes. The definition is below, so I ran it. Note that the ColumnStore index has all the columns in the definition. You can’t have Include Columns, and by having all the columns in there, you gain huge flexibility for the Index.

Create nonclustered columnstore index [IX_ColumnStore] on  [dbo].factCleanSales
(    OnlineSalesKey, DateKey, StoreKey, ProductKey,
PromotionKey, CurrencyKey, CustomerKey, SalesOrderNumber,
SalesOrderLineNumber, SalesQuantity, SalesAmount, ReturnQuantity,
ReturnAmount, DiscountQuantity, DiscountAmount, TotalCost, UnitCost,
UnitPrice, ETLLoadID, LoadDate, UpdateDate
) with (Drop_Existing = OFF) on [PRIMARY];

Next I ran the test query.

Table ‘factCleanSales’. Scan count 4, logical reads 6378, physical reads 27, read-ahead reads 13347, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table ‘Worktable’. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times:
CPU time = 515 ms,  elapsed time = 378 ms.

That’s less than a tenth of the time the Clustered index took, and the great thing is, because it’s got all the columns in there, you can create more complicated queries, and still get amazing speed. By running the query below, we still got great speed!

dbcc dropcleanbuffers;
go
SELECT
year(DateKey), storekey ,SUM(SalesAmount) AS SalesAmount
FROM   factCleanSales with (index ([IX_ColumnStore]))
GROUP BY year(DateKey), storekey
ORDER BY year(DateKey)

Table ‘factCleanSales’. Scan count 4, logical reads 8156, physical reads 78, read-ahead reads 16224, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table ‘Worktable’. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times:
CPU time = 4603 ms,  elapsed time = 1522 ms.

Is there a Downside ?

Yes. Two actually.

Firstly, it’s an Enterprise only feature. This is annoying, however, it is linked to the second downside. You cannot insert, update or delete directly, when a ColumnStore index is present.

Msg 35330, Level 15, State 1, Line 1
UPDATE statement failed because data cannot be updated in a table with a columnstore index. Consider disabling the columnstore index before issuing the UPDATE statement, then rebuilding the columnstore index after UPDATE is complete.

This means that if you are using it on a Data Warehouse, you’ll need to disable the index on the fact table, insert/update the data, then rebuild the index to get it back online. This isn’t ideal, however, there is an alternative. You can use Partition Switching to switch data in and out of the table.

Effectively, what you’ll be doing to insert data, is to load data into a partition table, with the same schema as the fact table, and switch it in. For updating or deleteing, you’d switch the appropriate partition out, update/delete the data, then switch it back in again. It’s more complicated (obviously), but the performance improvement gained by ColumnStore indexes should be worth it. Given that Table Partitioning is an Enterprise feature, it makes sense (kind of) that ColumnStore indexes should be too.

Partition Switching

To demonstrate how inserting into a table with a ColumnStore index on it was working, I dropped the indexes against the factCleanSales table, and partitioned and clustered it using the following:

CREATE PARTITION FUNCTION [myPartFunc](int) AS RANGE RIGHT
FOR VALUES (N’2003′, N’2004′, N’2005′, N’2006′, N’2007′, N’2008′, N’2009′,
N’2010′, N’2011′, N’2012′, N’2013′, N’2014′, N’2015′)

CREATE PARTITION SCHEME [myPartScheme] AS PARTITION [myPartFunc] TO
([PRIMARY], [PRIMARY], [PRIMARY], [PRIMARY], [PRIMARY], [PRIMARY],
[PRIMARY], [PRIMARY], [PRIMARY], [PRIMARY], [PRIMARY], [PRIMARY],
[PRIMARY], [PRIMARY])

CREATE CLUSTERED INDEX [ClusteredIndex_on_myPartScheme_634694274321586358] ON [dbo].[factCleanSales]
( [YearPart] )WITH (SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, ONLINE = OFF) ON [myPartScheme]([YearPart])

Then, added the ColumnStore back into the table, and this is automatically matched to the Partitioning function and scheme above.

CREATE NONCLUSTERED COLUMNSTORE INDEX [IX_ColumnStore] ON [dbo].[factCleanSales] (    [OnlineSalesKey],    [DateKey],    [StoreKey],    [ProductKey],    [PromotionKey],    [CurrencyKey],    [CustomerKey],    [SalesOrderNumber],    [SalesOrderLineNumber],    [SalesQuantity],    [SalesAmount],    [ReturnQuantity],    [ReturnAmount],  [DiscountQuantity],    [DiscountAmount],    [TotalCost],    [UnitCost],
[UnitPrice],    [ETLLoadID],    [LoadDate],    [UpdateDate],    [YearPart]
)WITH (DROP_EXISTING = OFF)

Next, I created a table to switch the data in from, then loading it up, adding the ColumnStore index, and then switching the partition in using this:

CREATE TABLE [dbo].[factCleanSales_Part](
[OnlineSalesKey] [int] IDENTITY(1,1) NOT NULL,
[DateKey] [datetime] NOT NULL,
[StoreKey] [int] NOT NULL,
[ProductKey] [int] NOT NULL,
[PromotionKey] [int] NOT NULL,
[CurrencyKey] [int] NOT NULL,
[CustomerKey] [int] NOT NULL,
[SalesOrderNumber] [nvarchar](20) NOT NULL,
[SalesOrderLineNumber] [int] NULL,
[SalesQuantity] [int] NOT NULL,
[SalesAmount] [money] NOT NULL,
[ReturnQuantity] [int] NOT NULL,
[ReturnAmount] [money] NULL,
[DiscountQuantity] [int] NULL,
[DiscountAmount] [money] NULL,
[TotalCost] [money] NOT NULL,
[UnitCost] [money] NULL,
[UnitPrice] [money] NULL,
[ETLLoadID] [int] NULL,
[LoadDate] [datetime] NULL,
[UpdateDate] [datetime] NULL,
[YearPart] [int] NULL
)

alter table [factCleanSales_Part] with check add constraint chk2006 check (yearPart=2006)

CREATE CLUSTERED INDEX [ClusteredIndex_on_myPartScheme_634694274321586358] ON [dbo].[factCleanSales_Part] (    [YearPart]
)WITH (SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, ONLINE = OFF) ON [myPartScheme]([YearPart])

insert into factCleanSales_Part
select dateadd(yy,-1,DateKey), StoreKey, ProductKey, PromotionKey, CurrencyKey, CustomerKey, SalesOrderNumber, SalesOrderLineNumber, SalesQuantity, SalesAmount,
ReturnQuantity, ReturnAmount, DiscountQuantity, DiscountAmount, TotalCost,
UnitCost, UnitPrice, ETLLoadID, dateadd(yy,-1,LoadDate),
dateadd(yy,-1,UpdateDate) , year(dateadd(yy,-1,DateKey)) from factOnlineSales
where year(dateadd(yy,-1,DateKey))=2006

CREATE NONCLUSTERED COLUMNSTORE INDEX [IX_ColumnStore] ON [dbo].factCleanSales_Part (
[OnlineSalesKey],    [DateKey],    [StoreKey],    [ProductKey],    [PromotionKey],
[CurrencyKey],    [CustomerKey],    [SalesOrderNumber],    [SalesOrderLineNumber],
[SalesQuantity],    [SalesAmount],    [ReturnQuantity],    [ReturnAmount],
[DiscountQuantity],    [DiscountAmount],    [TotalCost],    [UnitCost],
[UnitPrice],    [ETLLoadID],    [LoadDate],    [UpdateDate],    [YearPart]
)WITH (DROP_EXISTING = OFF)

Next, to check that there are no records in the partition already for 2006, I ran this:

SELECT YearPart, $PARTITION.myPartFunc(YearPart) AS Partition,
COUNT(*) AS [COUNT] FROM factCleanSales
GROUP BY YearPart, $PARTITION.myPartFunc(YearPart)
ORDER BY Partition

image

Next, I switched the data in using this, and then checked the partition values using the statement above.

alter table [factCleanSales_Part] with check add constraint chk2006 check (yearPart=2006)

image

Delightfully, the fact table now has another partition, and all without removing the ColumnStore index on it.

For Extra credit…

Now, should you want to get more details out of the columnstore index, there are a couple of new DMV’s that can be used. They are:

  • sys.column_store_dictionaries
  • sys.column_store_segments

To see useful information like the sizing or number of rows per column, you can use this query:

select object_name(p.object_id) as ‘TableName’, p.partition_number,p.data_compression_desc,
c.name, csd.entry_count, csd.on_disk_size
from sys.column_store_dictionaries csd
join sys.partitions p on p.partition_id = csd.partition_id
join sys.columns c on c.object_id = p.object_id and c.column_id= csd.column_id
order by p.partition_number, c.column_id

which will return the following data. Summing the on_disk_size will give you the size in bytes of the index.

My Demo Environment

Just for transparency, the timings I was getting above weren’t on any huge server. They were on a virtual machine, running in VMWare Workstation v8.0.2 on Windows 7 SP1. SQL Server is 2012 (obviously), Developer Edition in 64bit.

image

Wrapping up..

I think it’s reasonably safe to say that this is the longest (in size and time) blog post I’ve written, so I apologise if it rambles a bit, but I hope you get the importance of ColumnStore indexes, and I hope you get the chance to use them.

TSQL2sDay150x150Thanks to Argenis Fernandez for hosting this month.

A Random Start

I’ve had an interesting (I hope) experience in my career this far. I started out, working in a Pharmacy.. It was back there that I wrote my first computer program, well actually it was at college, but it was for them. Many eons ago, back in the early 90’s, I’d been learning Pascal at college, and things are easier to learn, if you have an aim, so I wrote a program to assist will the filling in of paperwork. It worked, and it was good. It stored data in a pipe-delimited file, rather than any kind of database. Not good, for many reasons, but it was only a single user application, and didn’t have a huge amount of data.

Know your limits…

After that, I went to university, where they tried to teach me C++, assembly and Haskell. They failed with assembly and Haskell, though the theory is still there. C++ I love, and keep going back to. In fact, with any luck, you may see me present on it at DDD Southwest in May, and at DevEvening in April… Sadly, while I do enjoy C++, I’m not good enough to do that as a career, and I don’t think I’d enjoy it as much if it was my bread+butter.

Part of the Degree I did (Computer Systems, since you asked) included a year work placement. I did this at a small IT Company, where I worked on the Support Desk. This was a pretty small company, so while I did Application support for customers, I also managed the internal infrastructure, created their website and a few other bits. It was a great experience and I really enjoyed it. So much so, that I went back there to do consultancy after I’d completed the degree.

While I was there, I learnt Windows NT, Visual Basic, Btrieve and had my first introductions to SQL Server (6.5 and 7). It was also here that I took my first Microsoft Certifications, covering Windows NT, SQL Server and assorted Networking topics.

Know when it’s time to move

After four years, I was starting to feel claustrophobic, and needed more of a challenge. At the start of 2000, I moved on, and went to work for a Siebel Consultancy. This was a big change, as while I’d done some consultancy work before, I really had to up my game. Not a bad thing, and I really found my feet with Siebel as it was based on SQL Server, and had the ability to have custom components written in Siebel VB or eScript.

More Certifications

After a great couple of years, with big Siebel implementations, including a great system linking Siebel to SAP, via XML integrations (my first real experience with enterprise-grade XML), the Siebel market for us dropped off after Oracle bought Siebel ($5.8 Billion!).

I then moved my skills to Microsoft CRM, starting with V1.2 (unpleasant), then v3.0 (much better), and also SharePoint, all of which had associated MS Certifications which I completed, and all of which were based on SQL Server.

Try to see the Obvious…

At some point, and I can’t remember when, I realised that I’d been working with SQL Server, for over 12 years, and now it’s nearing 15. I hadn’t really noticed.

For the past two years I’ve been working as a Consultant, building Data Warehouses primarily, though I also do some C# for SQL CLR work, and C++ for fun. I’ve done a ridiculous number of Certifications (mostly Microsoft) and, my motivation is to get validation of my skills. I’m working on the MCM: SQL Server certification at the moment, and have the final Lab exam in May.

What Next ?

I don’t know. I’m pretty sure there will be SQL Server involved though. Fortunately the new version of SQL Server is out now, so the new Certifications will be out soon, and that’ll keep me occupied for a while.

From reading this though, the one thing that strikes me, is that I’ve been very lucky to be in a career that keeps my brain occupied. If there are less taxing times, then I have C++ to stretch the grey matter.

One other thing; I’ve also found that it is good to keep pushing yourself. Always try to work just outside your comfort zone. If everything is easy, then you need to push yourself more.

To end, a couple of thoughts from wiser people than me.

image

T-SQL Tuesday again, and this month it’s hosted by Amit Banerjee at TroubleshootingSQL.

One of the things that I’ve become more aware of, due to preparation for the MCM certification and working on larger data warehousing projects is that multiple database file can always give you a performance improvement.

From testing that I’ve done, this is even apparent on small local databases.

As part of a series of blog posts that I’m doing, around the TPC-H benchmarks, I’ve been loading and and reloading a dataset of approx. 8.6 million records (in 8 entities). This dataset is an example of the default, 1Gb set from the TPC-H benchmark (downloadable here (approx 276mb), or you can read my previous blog article on creating it yourself)

To get some decent figures to show how how much of an improvement you can get with multiple files, I’ve created a script which does the following steps.

    1. Creates the database (2Gb per database file, and 512mb for the log file)
    2. Creates the tables
    3. Bulk loads data using a set of flat files
    4. Gives a count of each of the tables

A copy of the script is available here.

I carried out a few different tests, based on :

  • Single or Multiple files
  • Different Media
    • Running on the C drive (5400rpm SATA drive)
    • Running on USB Pen Drives
    • Running on an eSATA drive
  • Splitting over multiple media
  • Having the Transaction log stored separately

The results I found are shown below (times are shown in Minutes, Seconds, milliseconds (mm:ss:ms) )

image

The benchmark is the run on a single file, on my internal drive.

DB Build is the time to create the database, note that I’m using Instant File Initialisation, and so should you (unless you have a very good reason not to!)

Data load is the time to build the tables and load them

Conclusions

Effectively, you can see the following:

    1. Regardless of media, Multiple files always give a performance improvement
    2. USB Pen Drives are rubbish. Don’t use them for databases
    3. A fast drive, separate to the O/S, and separate from the System database will give a significant improvement

The best performance I managed to achieve was with the multiple database files, using the eSata drive.

However, given that I work primarily on a laptop, the fact that I can get a huge improvement (over a third!) by using multiple database files on the internal drive is impressive.

I’d be interested to know how much of an improvement you get on this, how much does your mileage vary ?

Thanks for reading, and thanks to Amit for hosting.

So, it’s another T-SQL Tuesday! This time, it’s around Disaster Recovery, and is hosted by Allen Kinsel. Thanks Allen.

For this month, I thought I’d share the experiences I had on a project a few years ago. This was for a multi-site retail organisation, where all the servers were hosted in the head office.

We were in the process of consolidating a number of SQL Server’s onto a more powerful, and up-to-date SQL Server environment, and the process was going well. We’d got buy-in from the board, and had started the process of consolidation and upgrades.

However (isn’t there always a however ?) a couple of weeks before Christmas, there was a power outage in the building. All power was gone. Due to the way the communications was configured, all network access on all remote sites went through the head office. Power going down meant no access to the internet for the sites. It also meant no access to the corporate servers (email, intranet, ERP systems). This also meant that the corporate website went down, since that was hosted in-house.

Once power came back up, 4 or 5 hours later, the servers restarted, and all was well again.

In the aftermath, a number of issues were raised. The main ones being:

  • Inability for customers to access the website
  • Corporate systems being offline

The website was the easiest, and quickest to resolve since that could be outsourced. The Corporate systems were more complicated, and required a little more thought.

We needed to get a Disaster Recovery plan. Having spent a good long time going through everything that needed to be covered, and how to ensure that systems were up, running and resilient enough, but not outrageously overpriced (we thought…), we submitted the plan to the board.

It included things like:

  • Remote SQL Server & mirroring of a dozen or so databases
  • Remote Exchange Server
  • Backup Internet connection
  • Support for all this, so there is someone to call if it does all go wrong…

The first time through it was rejected due to price, so we trimmed a few bits (slower backup line, changes to SQL Server licencing), and resubmitted. It then went into ‘consideration’ limbo. By the time I moved off the project (8 months later), it was still being considered. Having said that, there hasn’t been another power-outage since then (2.5 years ago), so maybe they are lucky, but then again….

So what I’m saying is, make the investment in planning and ensure it’s all up and running, because you never know…

Aggregate Functions are the topic of this months T-SQL Tuesday. An interesting one, and it made me think about what I’ve done, that could be considered interesting, with relation to Aggregation.

One thing that sprung to mind was some work I did on a Data Warehouse. I worked on a Project a while back (a few years now), that included a data source from an ERP system, that was effectively a table populated from a series of Excel worksheets. The table was setup so that each cell in the worksheet had it’s own row. This had resulted in 6,435 (cells A1 to I715) rows, per project, per financial period, so 6435*200 (and then some) * 12 (so 15,444,000) per year. The code and table samples below are representative of the process that we followed, and the table structures have been appropriately anonymised, but you get the general idea.

It wasn’t necessary to load all the source data into the data warehouse, since there was alot of information that we didn’t need. Effectively, this was the process that we had.

image

To get the values out for the project, in the correct form, the following T-SQL was used:

SELECT project_id,xl_month,xl_year,
    MAX(CASE WHEN xl_cellref ='A1' THEN xl_value END) AS 'A1',
    MAX(CASE WHEN xl_cellref ='A2' THEN xl_value END) AS 'A2'
FROM dbo.xl_test
GROUP BY project_id, xl_month,xl_year

After a bit of time with this running, we made some changes, and ended up with the following:

SELECT project_id, xl_month, xl_year,  [A1], [A2]
    FROM (
        SELECT project_id, xl_month, xl_year, xl_cellref, xl_value
        FROM dbo.xl_test) AS xl_test
    PIVOT( MAX(xl_value) FOR  xl_cellref IN
    ([A1],[A2])
    ) AS aPivot

This (and some of the other changes we made) actually improved the performance of the DWH load by approximately 25%, however, I’d imagine a fair chunk of that was down to the fact that Pivot is quicker than a dozen or so case statements.

This is my first (of many, hopefully) post  for T-SQL Tuesday. I’ve been watching for the next invite to come out, and I’m glad to see it’s on something that I can comment on, Automation.

I’ve been doing alot of work recently on Data Warehousing, and thought it’d be useful, and relevant, to run through one of the tools we use to automate the loading of Data Warehouses.

While we could (and do, sometimes) use SSIS to load the Data Warehouse, we primarily use a tool called WhereScape RED (Trial versions, whitepapers, etc. available here: http://www.wherescape.com). This imageproduct makes the process of creating, prototyping, building, and the on-going running and maintenance of a Data Warehouse much more manageable, and efficient.

While you can read all about how to use this product online, I’m going to run through some of the key points, and how it actually interacts with SQL Server.

On the left you’ll see a list of the different types of data that RED can use and generate. If you’ve done any work with Data warehousing, the majority of these will be familiar to you.

Connection

These are the connections that are used to link RED with the data warehouse. This would include items such as flat files (on Windows or Unix), Analysis Services (2000 or better) or a database (using any form of ODBC connection).

Load Tables

Load tables are the source for all the data, so whether your data is coming from a file, another database or wherever, it’ll be brought into the system, through here, and into a table called load_<something>. These are created in a nice straightforward manner, imageby dragging the file from a Source Browser (on the right), into the main window (as seen below). This then creates a table, with a set of fields, which are then used to populate the automatically generated documentation!

It’s worth pointing out that, which each of the steps allow you to customise the T-SQL generated by RED, the Load process is the only one that doesn’t allow you to see the initial script, though you can add T-SQL to run before and after the load process.

Stage Tables

Stage tables are where all the work actually takes place. This is where you can modify the SQL to change the data from the Load tables into another form, add key’s, reformat fields, and so on. It’s possible to have multiple stage tables daisy-chaining into one another, so your final stage table would be the data as it would be prior to the Dimension or Fact Table.

Dimensions and Fact Tables

Finally, you have the dimension and fact tables, and these are populated by the prior stage tables. By this point, you should have all your data ready to go.

Get to the Point!

imageBy this point, you’re probably thinking, this has little to do with T-SQL, or Automating it.… Well, I’m coming to that. What RED does is generate T-SQL for you, handle it all using a Source control system, and allow you to automate the processing using a Scheduler.

From my experience, the Generated SQL is very performant, and the wizards that are used to get the information required to generate them, give you plenty of rope to hang yourself with.

As you can see from the screen on the left, you can choose how the data would be accessed, and add Hints if need be.

Jobs and Scheduling

One of the really nice things about RED is the scheduling function. Through this, you can setup imagemultiple tasks (such as Load_customer, load_product, stage_customer_1, stage_customer_2, etc.), define the threads that can be used, and how they interact. Therefore you can load the customer and products tables from the source system simultaneously, and similarly with any other sets of tasks. While RED doesn’t suggest how this would be achieved (it doesn’t know your source systems, or how which jobs are likely to interact), it will allow you to specify the parallelism to a high degree of granularity.

You can also set dependencies (so, don’t run this job until this other one has completed), and you can run commands based on the success or failure of the jobs also. This means you can send custom emails based on the results. All logging of the jobs is also written to a database, so you can query on it to see if tasks are taking longer over time, and if there are any spikes in time taken.

What’s Next ??

Hopefully, you’ve found this interesting. I’d highly recommend trying out RED as it’s a great tool for automating the loading of data. I’d welcome any feedback you have about this article.

Happy T-SQL Tuesday!

Follow

Get every new post delivered to your Inbox.

Join 573 other followers