I’ve recently been looking at optimising a data load that I’m working on, and have got a quite a nice example of how the performance can be improved significantly by creating a CLR object.
To do this, I created a set of data (which is downloadable from here, is approx. 3mb and is extractable using 7-Zip ) which has approx. 200k records in it. The file has two columns (pipe delimited), which are a row id, and a set of events in a string (in the format “1,2,3,4,5”). Each string has a variable number of events in it (between 1 and 75), which looks something like this.
What I’m wanting to do is split this out, to have an Event Id, Sequence_Number and Event with the EventList above being split over multiple records, and a Sequence number showing the position of the event code in the list. Sounds straightforward.
I’m effectively looking for a quick way to generate this data.
T-SQL – Custom_Split
A custom script to split the table was created, looking like this:
create function [dbo].[Split] |
This was executed in Management Studio, with this script:
SELECT EventId, IDENTITY (int, 1, 1) as SequenceNo, Value as Event INTO dbo.SQL_Events FROM dbo.LargeFile CROSS APPLY dbo.split(dbo.LargeFile.EventList, ‘,’) AS t ORDER BY dbo.LargeFile.EventId |
Running this generated a table with 9.8 million rows and took 6 minutes 8 seconds, so handling approx. 26.6k records a second.
CLR – SplitEvents
To split the EventList in a CLR object, I did the following in Visual Studio 2010.
1. Choose the .NET framework, CLR Project
2. Right click the project when it’s created, and choose ‘Add –> Stored Procedure’.
In the body of the class, I used the following code:
using System; using System.Data; using System.Data.SqlClient; using Microsoft.SqlServer.Server;public partial class StoredProcedures { [Microsoft.SqlServer.Server.SqlProcedure] public static void CLR_Split() { // Put your code here char[] delimiters = new char[] { ‘,’, ‘-‘ }; using (SqlConnection conn = new SqlConnection(“Context Connection=true”)) conn.Open(); }; |
Having compiled and deployed this to SQL Server, it could be run using this :
INSERT INTO CLR_Events (EventId, Sequence_No, Event) EXEC CLR_Split |
This process created the 9.8 million rows in approx. 2 minutes 59 seconds, so handling approx. 54.7k records a second
Summary
So, having run these two processes, you can see that the CLR process is much faster.
Process | Time Taken | Rec / Sec |
T-SQL Stored Procedure | 6 minutes 8 sec | 26.6k |
CLR Procedure | 2 minutes 59 sec | 54.7k |
Notes
When running these tests, I was running SQL Server 2008 R2, on a i5-430m laptop, with 4Gb RAM.
The database had plenty of space pre-allocated to it, so there was no performance hit based on growing the database.
1 Trackback or Pingback for this entry:
[…] This post was Twitted by nhaslam […]