Pages

Wednesday, September 28, 2011

What a clusterfuck

Clustered index on large tables - first blog in my series "from newbie to DBA"

Well I geuss it's not nice to start your blog by swearing :) But as a former soldier clusterfuck is the correct term for my use of clusters on the SQL Server.

Uptil a year ago I didn't really know what the cluster did or was. But I was about to find out.

As mention in my first blog, our data has grown fast during the last couple of year. I'm managing a stack of SQL Servers which is used to collect data from stockexchanges. My main problem was the table holding the prices.

 This table had close to 900.000.000 records. 8 columns in the table is key fields. The order of those keys was nearly random as they where set up in logical reading order. And I couldn't understand why inserting around 1.000.000 new records every day took so long time - until I understood the concept of clusters.

So if you have the same problem this is where you put the smile back on your face. If you don't have that problem then the smile should already be there :)

When you create a table in SQL Management Studio (MS), using the design editor, your actually really don't pay attention to whats really happend behind the scenes. And as you can see from my CV - in most cases you actually don't need to. I made a good living letting MS do the hard work. But when the complexity increases and you need to find that DBA hat - you are a bit lost.

This was the first time I learned the hard way to use scripts for everything. I still often uses MS, also for designing my tables, but instead of pressing the OK button I always select the little script sign. And I always makes sure I totally understand what the script does.

But back to the cluster. So what happens when you design a table, select som keys an presses the OK button. Lets take a simpel version of my price table. For those prices I need an id (called ISIN), a trade date and a value. So for me the logical order is like I just descriped. Then I select ISIN and TradeDate as keys and I'm done. The table is created just like that and because I have selected the two keys the table has a cluster index on the order ISIN, TradeDate. See the scripts a the bottom of the post.

So when I select * from Prices where ISIN = 'DK00000111' and TradeDate = '20110927' it matches right up with the index and because its a cluster index the data is physically stored in order by ISIN, TradeDate sort of speak. Thats all very nice and quick.

Until.... you have a lot of data. See what happens when I each day stores the prices from today. I have around 500.000 different ISIN's and let say one TradeDate. And because the cluster says the data should be order by ISIN, TradeDate it (SQL Server) makes a lot of moving around to get those 500K of prices in order.

The solution is simple. I should decide the order of the cluster myself. And if I switch around on the two keys and use the order TradeDate, ISIN instead what would that mean? Well then I don't move data around but insert directly order by ISIN for todays prices. And now instead of using an hour for a few million records I just use minuts. And it does not affect my select * from Prices where ISIN = 'DK00000111' and TradeDate = '20110927' as it can still use the cluster index. See the changed script at the bottom of the post.

So whats important when designing tables which will hold large amount of data is design the clustered index meet the best crossing between inserts and selection for your job. Especially if its a detail table which often have more than one key - the natural order of columns is almost never the natural order of the cluster. And about the selectivity of the index I'll write a blog soon about that - as I just found out a few weeks ago that this can be optimized using a certain order of the columns in the index :) More about that later.

As I mention in my real live table for prices I have 8 columns which is the key fields. And sadly enough my prices don't come just for one day as I get prices from all over the world. So some prices may be 5-10 days old when I get them. So there was still a lot of moving going on.

That let me to thing about why did I have a keys on - and why the cluster? For this table we rarely do selects as all the prices we use for calculation is in a upfront table which only holds the last week of prices for quick processing. So I only used the cluster index to insure that records was unique and when a rare select was performed that it would not take ages.

That means I didn't really need the cluster. I could insure the uniqueness and the selection with a normal non-clustered index. So I removed the cluster and just made a non-cluster index. And my insertion of new prices is faster than ever and no data moving or ordering are done. Just into the pin and away we go. And the selection was still fast as the cluster didn't matter for that anyway. The many columns for the cluster and the nature of the data made sure that the SQL Server couldn't just spool data in from the order of the cluster. It had to go digging all the same.

I guess I could have made a non-clustered constraint to insure the uniqueness of that but I figured that the non-clustered index did the same trick and gave me an index all in one.

This is a very simple explanation of cluster and what to do. But the best lesson this gave me was always go the script way. Below you see the above sample table th MS way and the way I would create it.

MS Way


CREATE TABLE dbo.Prices
(
ISIN varchar(12) NOT NULL,
TradeDate datetime NOT NULL,
Value float(53) NULL
)  ON [PRIMARY]
GO
ALTER TABLE dbo.PricesADD CONSTRAINT
PK_Prices PRIMARY KEY CLUSTERED
(
ISIN,
TradeDate
) WITH( STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]


My way 


CREATE TABLE dbo.Prices
(
ISIN varchar(12) NOT NULL,
TradeDate datetime NOT NULL,
Value float(53) NULL
)
GO
ALTER TABLE dbo.PricesADD CONSTRAINT
PK_Prices PRIMARY KEY CLUSTERED
(
TradeDate,
ISIN
)

No comments:

Post a Comment