TSQL | Dev Blog

Tag Archive: TSQL

T-SQL Tuesday #29 – ColumnStore–Faster than a speeding Clustered Index… #TSQL2sDay

Filed under: Community, SQL Server, T-SQL Tuesday — 1 Comment

10/04/2012

It’s T-SQL Tuesday again, and this time hosted by Nigel Sammy. Thanks for hosting Nigel, enjoy the post.

Not so long ago, I was lucky enough to go to SQL Bits X. It was a great few days, an I highly recommend it to you!

The Keynote session, given by Conor Cunningham, was a 400 level session on the ColumnStore index, which is a new feature in SQl Server 2012.

The demo was, unsurprisingly, really good, and it made me wonder ‘is it really that good ?’ So I thought I’d give it a go and see.

Having Googled around a bit, I found a useful blog article by Sacha Tomey, that went through a few examples. With permission, I’m going to run through a similar process, add a few bits in, and use a different data set.

Part of me really hates the AdventureWorks demo database, so you can imagine my delight when I discovered that there is now a bigger Retail data set, structured as a DataWarehouse. This is the Contoso BI set, and I like it.

Getting down to it

After installing the ContosoBI database, you’ll end up with a fact table, factOnlineSales, with approx. 12.6 million rows in it.

First off, I want to try and get a level playing field, so we’ll be running with Statistics IO and Statistics Time on, and we’ll be clearing the buffers before each query

set statistics IO on;
set statistics time on;
dbcc dropcleanbuffers;

The Clustered Index

Just to get a comparison, I ran the test query, shown below, to get an idea of the speed against the supplied Clustered Index.

dbcc dropcleanbuffers;
go
SELECT
StoreKey ,SUM(SalesAmount) AS SalesAmount
FROM factOnlineSales
GROUP BY StoreKey
ORDER BY StoreKey

This gave the following results:

Table ‘Worktable’. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table ‘FactOnlineSales’. Scan count 5, logical reads 46821, physical reads 1, read-ahead reads 46532, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times:
CPU time = 8377 ms, elapsed time = 3476 ms

Just a Heap

Next, I wanted to get rid of the Clustered index, but since I didn’t really want to lose the original table, I ran this code to insert the contents of the factOnlineSales table into factCleanSales.

select * into factCleanSales from FactOnlineSales

That gave me 12 million rows, I wanted more, so next I ran this:

insert into factCleanSales
select dateadd(yy,3,DateKey), StoreKey, ProductKey, PromotionKey, CurrencyKey, CustomerKey, SalesOrderNumber, SalesOrderLineNumber, SalesQuantity, SalesAmount,
ReturnQuantity, ReturnAmount, DiscountQuantity, DiscountAmount, TotalCost,
UnitCost, UnitPrice, ETLLoadID, dateadd(yy,3,LoadDate), dateadd(yy,3,UpdateDate) from factOnlineSales

This gave me approx. 25 million records, and no Clustered Index. So I ran the test query again. It took a little longer this time.

dbcc dropcleanbuffers;
go
SELECT
StoreKey ,SUM(SalesAmount) AS SalesAmount
FROM factCleanSales
GROUP BY StoreKey
ORDER BY StoreKey

Table ‘Worktable’. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table ‘factCleanSales’. Scan count 5, logical reads 505105, physical reads 0, read-ahead reads 504823, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times:
CPU time = 14976 ms, elapsed time = 33987 ms.

Nearly 10 times longer to run, and more than 10 times the I/O, but that wasn’t surprising since we had no indexes.

Add one Non-Clustered

So, following Sacha’s lead, I added a compressed, nonclustered index into the pot.

CREATE NONCLUSTERED INDEX [IX_StoreKey] ON [dbo].factCleanSales
( StoreKey ASC )
INCLUDE ([SalesAmount]) WITH
(
PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON,
ALLOW_PAGE_LOCKS = ON, FILLFACTOR = 100, DATA_COMPRESSION = PAGE
) ON [PRIMARY]
GO

Clearing the buffers and running the query now, resulted in a better experience.

Table ‘factCleanSales’. Scan count 5, logical reads 43144, physical reads 1, read-ahead reads 42999, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times:
CPU time = 18877 ms, elapsed time = 5785 ms.

The query time was down to a more reasonable level, though still longer than the Clustered Index.

ColumnStore Time!

Adding the ColumnStore index took a while, just over 2 minutes. The definition is below, so I ran it. Note that the ColumnStore index has all the columns in the definition. You can’t have Include Columns, and by having all the columns in there, you gain huge flexibility for the Index.

Create nonclustered columnstore index [IX_ColumnStore] on [dbo].factCleanSales
( OnlineSalesKey, DateKey, StoreKey, ProductKey,
PromotionKey, CurrencyKey, CustomerKey, SalesOrderNumber,
SalesOrderLineNumber, SalesQuantity, SalesAmount, ReturnQuantity,
ReturnAmount, DiscountQuantity, DiscountAmount, TotalCost, UnitCost,
UnitPrice, ETLLoadID, LoadDate, UpdateDate
) with (Drop_Existing = OFF) on [PRIMARY];

Next I ran the test query.

Table ‘factCleanSales’. Scan count 4, logical reads 6378, physical reads 27, read-ahead reads 13347, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table ‘Worktable’. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times:
CPU time = 515 ms, elapsed time = 378 ms.

That’s less than a tenth of the time the Clustered index took, and the great thing is, because it’s got all the columns in there, you can create more complicated queries, and still get amazing speed. By running the query below, we still got great speed!

dbcc dropcleanbuffers;
go
SELECT
year(DateKey), storekey ,SUM(SalesAmount) AS SalesAmount
FROM factCleanSales with (index ([IX_ColumnStore]))
GROUP BY year(DateKey), storekey
ORDER BY year(DateKey)

Table ‘factCleanSales’. Scan count 4, logical reads 8156, physical reads 78, read-ahead reads 16224, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table ‘Worktable’. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times:
CPU time = 4603 ms, elapsed time = 1522 ms.

Is there a Downside ?

Yes. Two actually.

Firstly, it’s an Enterprise only feature. This is annoying, however, it is linked to the second downside. You cannot insert, update or delete directly, when a ColumnStore index is present.

Msg 35330, Level 15, State 1, Line 1
UPDATE statement failed because data cannot be updated in a table with a columnstore index. Consider disabling the columnstore index before issuing the UPDATE statement, then rebuilding the columnstore index after UPDATE is complete.

This means that if you are using it on a Data Warehouse, you’ll need to disable the index on the fact table, insert/update the data, then rebuild the index to get it back online. This isn’t ideal, however, there is an alternative. You can use Partition Switching to switch data in and out of the table.

Effectively, what you’ll be doing to insert data, is to load data into a partition table, with the same schema as the fact table, and switch it in. For updating or deleteing, you’d switch the appropriate partition out, update/delete the data, then switch it back in again. It’s more complicated (obviously), but the performance improvement gained by ColumnStore indexes should be worth it. Given that Table Partitioning is an Enterprise feature, it makes sense (kind of) that ColumnStore indexes should be too.

Partition Switching

To demonstrate how inserting into a table with a ColumnStore index on it was working, I dropped the indexes against the factCleanSales table, and partitioned and clustered it using the following:

CREATE PARTITION FUNCTION [myPartFunc](int) AS RANGE RIGHT
FOR VALUES (N’2003′, N’2004′, N’2005′, N’2006′, N’2007′, N’2008′, N’2009′,
N’2010′, N’2011′, N’2012′, N’2013′, N’2014′, N’2015′)

CREATE PARTITION SCHEME [myPartScheme] AS PARTITION [myPartFunc] TO
([PRIMARY], [PRIMARY], [PRIMARY], [PRIMARY], [PRIMARY], [PRIMARY],
[PRIMARY], [PRIMARY], [PRIMARY], [PRIMARY], [PRIMARY], [PRIMARY],
[PRIMARY], [PRIMARY])

CREATE CLUSTERED INDEX [ClusteredIndex_on_myPartScheme_634694274321586358] ON [dbo].[factCleanSales]
( [YearPart] )WITH (SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, ONLINE = OFF) ON [myPartScheme]([YearPart])

Then, added the ColumnStore back into the table, and this is automatically matched to the Partitioning function and scheme above.

CREATE NONCLUSTERED COLUMNSTORE INDEX [IX_ColumnStore] ON [dbo].[factCleanSales] ( [OnlineSalesKey], [DateKey], [StoreKey], [ProductKey], [PromotionKey], [CurrencyKey], [CustomerKey], [SalesOrderNumber], [SalesOrderLineNumber], [SalesQuantity], [SalesAmount], [ReturnQuantity], [ReturnAmount], [DiscountQuantity], [DiscountAmount], [TotalCost], [UnitCost],
[UnitPrice], [ETLLoadID], [LoadDate], [UpdateDate], [YearPart]
)WITH (DROP_EXISTING = OFF)

Next, I created a table to switch the data in from, then loading it up, adding the ColumnStore index, and then switching the partition in using this:

CREATE TABLE [dbo].[factCleanSales_Part](
[OnlineSalesKey] [int] IDENTITY(1,1) NOT NULL,
[DateKey] [datetime] NOT NULL,
[StoreKey] [int] NOT NULL,
[ProductKey] [int] NOT NULL,
[PromotionKey] [int] NOT NULL,
[CurrencyKey] [int] NOT NULL,
[CustomerKey] [int] NOT NULL,
[SalesOrderNumber] [nvarchar](20) NOT NULL,
[SalesOrderLineNumber] [int] NULL,
[SalesQuantity] [int] NOT NULL,
[SalesAmount] [money] NOT NULL,
[ReturnQuantity] [int] NOT NULL,
[ReturnAmount] [money] NULL,
[DiscountQuantity] [int] NULL,
[DiscountAmount] [money] NULL,
[TotalCost] [money] NOT NULL,
[UnitCost] [money] NULL,
[UnitPrice] [money] NULL,
[ETLLoadID] [int] NULL,
[LoadDate] [datetime] NULL,
[UpdateDate] [datetime] NULL,
[YearPart] [int] NULL
)

alter table [factCleanSales_Part] with check add constraint chk2006 check (yearPart=2006)

CREATE CLUSTERED INDEX [ClusteredIndex_on_myPartScheme_634694274321586358] ON [dbo].[factCleanSales_Part] ( [YearPart]
)WITH (SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, ONLINE = OFF) ON [myPartScheme]([YearPart])

insert into factCleanSales_Part
select dateadd(yy,-1,DateKey), StoreKey, ProductKey, PromotionKey, CurrencyKey, CustomerKey, SalesOrderNumber, SalesOrderLineNumber, SalesQuantity, SalesAmount,
ReturnQuantity, ReturnAmount, DiscountQuantity, DiscountAmount, TotalCost,
UnitCost, UnitPrice, ETLLoadID, dateadd(yy,-1,LoadDate),
dateadd(yy,-1,UpdateDate) , year(dateadd(yy,-1,DateKey)) from factOnlineSales
where year(dateadd(yy,-1,DateKey))=2006

CREATE NONCLUSTERED COLUMNSTORE INDEX [IX_ColumnStore] ON [dbo].factCleanSales_Part (
[OnlineSalesKey],    [DateKey],    [StoreKey],    [ProductKey],    [PromotionKey],
[CurrencyKey],    [CustomerKey],    [SalesOrderNumber],    [SalesOrderLineNumber],
[SalesQuantity],    [SalesAmount],    [ReturnQuantity],    [ReturnAmount],
[DiscountQuantity],    [DiscountAmount],    [TotalCost],    [UnitCost],
[UnitPrice],    [ETLLoadID],    [LoadDate],    [UpdateDate],    [YearPart]
)WITH (DROP_EXISTING = OFF)

Next, to check that there are no records in the partition already for 2006, I ran this:

SELECT YearPart, $PARTITION.myPartFunc(YearPart) AS Partition,
COUNT(*) AS [COUNT] FROM factCleanSales
GROUP BY YearPart, $PARTITION.myPartFunc(YearPart)
ORDER BY Partition

Next, I switched the data in using this, and then checked the partition values using the statement above.

alter table [factCleanSales_Part] with check add constraint chk2006 check (yearPart=2006)

Delightfully, the fact table now has another partition, and all without removing the ColumnStore index on it.

For Extra credit…

Now, should you want to get more details out of the columnstore index, there are a couple of new DMV’s that can be used. They are:

sys.column_store_dictionaries
sys.column_store_segments

To see useful information like the sizing or number of rows per column, you can use this query:

select object_name(p.object_id) as ‘TableName’, p.partition_number,p.data_compression_desc,
c.name, csd.entry_count, csd.on_disk_size
from sys.column_store_dictionaries csd
join sys.partitions p on p.partition_id = csd.partition_id
join sys.columns c on c.object_id = p.object_id and c.column_id= csd.column_id
order by p.partition_number, c.column_id

which will return the following data. Summing the on_disk_size will give you the size in bytes of the index.

My Demo Environment

Just for transparency, the timings I was getting above weren’t on any huge server. They were on a virtual machine, running in VMWare Workstation v8.0.2 on Windows 7 SP1. SQL Server is 2012 (obviously), Developer Edition in 64bit.

Wrapping up..

I think it’s reasonably safe to say that this is the longest (in size and time) blog post I’ve written, so I apologise if it rambles a bit, but I hope you get the importance of ColumnStore indexes, and I hope you get the chance to use them.

Tags: SQL Server, T-SQL, TSQL

Comment

Dev Blog

Tag Archive: TSQL

T-SQL Tuesday #29 – ColumnStore–Faster than a speeding Clustered Index… #TSQL2sDay

Getting down to it

The Clustered Index

Just a Heap

Add one Non-Clustered

ColumnStore Time!

Is there a Downside ?

Partition Switching

For Extra credit…

My Demo Environment

Wrapping up..

Disclaimer

Recommended Reading

Tag Cloud

My Tweets

Follow Blog via Email

Friends & links

Pages

Monthly archives