Pages

Friday, October 28, 2011

Easy performance burst for large data

This week I started importing data from a new source. That means I take around 5-6 times as many data into the system as before. Going from handling around 10.000.000 (10M ) records a day to almost 60M. This is, lucky me, still in development so its possible to use all my tips and tricks I learn at SQL PASS this year.

I found a couple of things which quite cheap gave me a boost in performance. Data types! I have always just used int, datetime and float regardless of actual size of data. Never gave that much thought - we just bought bigger disks.

But as PASS I attended several sessions discovered two main issues with that approach: Memory spills (think that was the buzzword) and pages. Never mind I will explain.

Most of the operation on SQL Server starts out in memory but when data gets to large to keep up there it goes down under to the hard drive in tempdb or the OS paging file. This means a huge difference in speed as performance of the hard drive is drastic reduced compared to RAM. The lesson learn was keep data as small as possible which postpone or reduces the times when data has to go to disk to be sorted, joined or what ever.

I also realized another thing which I actually knew but didn't put into this context (shame on me). Table data are stored on disk in 8Kb pages.  And it always reads a full page at the time. So if you have a record which is below 8Kb then the server needs only to read this one page to access this record. But if the record is 8.1Kb (well max allowed for a row is 8060 bytes but stay with me) then it needs to read two pages. And as IO to disk is the slowest thing we have on the server, the more pages it needs to read/write every time, the slower it is. It's very easy to see how much IO you actually use by turning on statistics io. Let my give you a simple example.

Create a table with char(8000) just to show the point


create table test(c1 char(8000))

Then turn on statistics: set statistics io on
Run the following: insert test values('horse')
And then: select * from test
In the message window you should see:

Table 'test'. Scan count 0, logical reads 1, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

The thing to look for in this case is the logical reads 1 which means that the server had to read one page.
Insert another row: insert test values('cow')
And: select from test
This time you see that logical reads says 2! It had to read two pages.

Drop the table (drop table test) and then lets reduce the records size by half.

create table test(c1 char(4000))

Run the two inserts again and do a select between each of them. You should now see that logical reads are 1 even after two insert. If you insert one more record then it will go to two reads.

Well it is not often we can reduce a record by this much. But if you have a int field which only are used for storing 1 to 10 then a tinyint would do just fine and give you 3 bytes in return. In one of my tables I found that my using tinyint, smallint and smalldatetime instead of int and datetime I could reduce the record by 54 bytes. In round numbers that means I can spare a page for each 150 records. And in a table with almost 1.000.000.000 records that is somewhat a lot :) Almost 50Gb less data.

Of cause I never get all those records as once. But I always gets way more than 150 records at the time. And this is done a lot. So that's a lot of io reduction - for free.

I found another table which had 9 columns of ValidSomething fields which was 0 if that condition was not valid and 1 if it was. They where all int. That is 9x4 bytes (36) which can be reduces to 2 bytes. If the column only holds 0 and 1 (what a developer would call a boolean) then it can be stored as a bit column. If you only have one bit column in a record the tinyint and bit would be the same namely 1 byte. But the server stores up to 8 bit columns in one byte so with 9 bit columns only 2 bytes are need. And the smart thing is that adding 7 extra bit fields later will not cost me anything - I like that.

So working with large data every byte counts. I am currently redesigning my data collection flow. I tried to load the new daily data burst the old way at once. It took 13 hours. By reducing the data size and then I did some T-SQL redesign as well I had it down to around half an hour. Not bad for a couple of days work.

Maybe I should optimize me - by that rate I only need to go to work half of Monday then my 40+ hour week would be over :) But where is the fun in that, right? :D 

I will return with a few queries "not to" tips I discovered when re-writing some of the code.
Happy optimizing,

No comments:

Post a Comment