T-SQL Speed Efficiencies with Double Negatives
I left work last night somewhere between 10:00pm and 10:30pm. I tend to lose track of time when I'm absorbed in a subject and there are very few subjects more absorbing than query optimisation IMHO. I guess I really am a geek (even though I fought that label for some time).
The problem I was working on last night was an aweful query, curtesy of a developer at the organisation I work for, who shall remain nameless (just in case anyone actually reads this blog). He is writing an app that deals with meeting room bookings and this particular piece of code was supposed to be checking for meeting request overlaps in any given room. His query looked something like this (from memory):
SELECT r1.RoomID, m1.MeetingID, m1.StartTime, m1.EndTime FROM dbo.Room as r1
CROSS JOIN dbo.Meeting as m1
WHERE m1.MeetingName = 'My Test Meeting'
AND r1.RoomID not in
(
-- Get all the meetings that currently overlap with that room & timeslot
SELECT ISNULL(r2.RoomID,0) FROM dbo.Room as r2
INNER JOIN dbo.Meeting as m2 on m2.RoomID = r2.RoomID
CROSS JOIN dbo.Meeting as this
WHERE this.MeetingName = 'My Test Meeting'
AND m2.RoomID = r2.RoomID
-- Here begins the datetime checking fun/fiasco
AND
(
this.StartTime BETWEEN m2.StartTime AND m2.EndTime
OR this.EndTime BETWEEN m2.StartTime AND m2.EndTime
OR
(
m2.StartTime BETWEEN this.StartTime AND this.EndTime
OR m2.EndTime BETWEEN this.StartTime AND this.EndTime
)
)
AND r2.SeatingLayout = .... etc.
)
AND r1.SeatingLayout = ... etc.
Of course, there were no comments in his code and it wasn't anywhere near as clear as my paraphrase/simplification above (not that my simplification above is all that clear but it's more readable in terms of indenting, etc.), so it took me a good hour to figure out what he was really trying to achieve (and that was the first step in optimising his query - trying to figure out what it was attempting do so I could rephrase it).
Anyway, he was basically getting a list of all the RoomIDs that had meetings already scheduled that conflicted with the time slot of the new meeting. He was then picking all the rooms that weren't in that list and presenting them as potential candidates for the new meeting. When he commented out the subquery it was "uber-fast" according to him (apparently a 2 second response time is "uber-fast"). When he uncommented the subquery "it took forever" (which means it ran for 3 minutes & 20 seconds). He wanted to know what was wrong with the subquery that was killing it (how about the fact that with the subquery it was doing just under 3 million logical reads; my final rewrite did about 4,000 logical reads and was logically equivalent).
So, the first step (well actually the second, because the first step was trying to decipher the goal behind his cryptic code) was to run the subquery on its own and try to find a way to rewrite it to make it more efficient. He'd written the whole query in such a way so that you could just cut and paste the subquery to another window and run it on its own (at least that was helpful). I could see from the actual execution plan in Query Analyzer (and also just from looking at the code) that the I/O on dbo.Meeting was fairly horrendous, requiring clustered index scans and attributed to about 90% of the execution cost. So the obvious place to start optimising was the date range logic.
I was lucky enough about a month ago to attend the Sydney SQL User Group meeting when Itzik Ben-Gan was talking about SQL & Logic (basically how to solve difficult SQL problems by "thinking outside the box"), and the following week a 1 day Itzik seminar on advanced T-SQL. One of the things Itzik proposed in order to solve some problems (it's not applicable to every situation) was rephrasing an expression in the negative and then negating the result to come up with the logic equivalent as the original query. "X = - -X" if you like. For example, instead of finding all the rooms with meetings booked that conflicted with the time slot in question, how about finding all the rooms that had no meetings that conflicted with the time slot? And then, instead of saying show me the rooms not in this list, you can say show me the rooms that match?
You can negate that whole date checking logic mess (which comprises no less than 8 comparisons, half of which are combined with OR operators!) by two simple comparisons:
WHERE m2.StartTime > this.EndTime
OR m2.EndTime < this.StartTime
In other words, the existing meeting finishes before the new one starts or starts after the new one finishes. No overlap. So the subquery becomes:
SELECT ISNULL(r2.RoomID,0) FROM dbo.Room as r2
INNER JOIN dbo.Meeting as m2 on m2.RoomID = r2.RoomID
CROSS JOIN dbo.Meeting as this
WHERE this.MeetingName = 'My Test Meeting'
AND m2.RoomID = r2.RoomID
-- Date checking logic now becomes
AND (m2.StartTime > this.EndTime OR m2.EndTime < this.StartTime)
AND r2.SeatingLayout = .... etc.
Much nicer. And less expensive in I/O too...much. Then the condition in the outer query changes from NOT IN (...) to IN (...). So that made a huge improvement, but I wasn't finished - there were still clustered index scans to try to get rid of and improvements to make in the joins & correlation between the subqueries. The clustered index scans disappeared when I created a couple nonclustered covering indexes on a few strategic columns - yeah, lookin good.
Now, what's the point of including the dbo.Meeting table a second time in the subquery (just for the StartTime & EndTime to check against) when you already have all the data you need in the outer query? Needless I/O. So get rid of the cross join in the subquery (that is the dbo.Meeting table aliased as 'this') and just use the StartTime & EndTime from the outer query. And the IN (...) predicate can be expressed in a slightly more efficient way by using the EXISTS (...) predicate, which stops its scan as soon as it finds a matching row.
My final query looked something like this (the actual code is slightly off (this is from memory) but the double-negative is the tactic I used):
SELECT r1.RoomID, m1.MeetingID, m1.StartTime, m1.EndTime FROM dbo.Room as r1
CROSS JOIN dbo.Meeting as m1
WHERE m1.MeetingName = 'My Test Meeting'
AND EXISTS
(
-- Get all the rooms that have no meeting overlap with that timeslot
SELECT * FROM dbo.Room as r2
INNER JOIN dbo.Meeting as m2 on m2.RoomID = r2.RoomID
WHERE m2.RoomID = r2.RoomID
AND r1.RoomID = COALESCE(r2.RoomID,'0')
AND (m2.StartTime > this.EndTime OR m2.EndTime < this.StartTime)
AND r2.SeatingLayout = .... etc.
)
AND r1.SeatingLayout = ... etc.
The result: same resultset, 380ms, 4000 logical reads. That's 0.19% the response time (an improvement factor of over 500) and 0.14% as much I/O (an improvement factor of about 750). Now, if 2 seconds for a query that does none of the time slot conflict checking is "uber-fast" then a query that is logically equivalent except that it does all the time slot checking as well and runs in 380ms (with a whole lot less I/O), I think it's fair to say, screams! And mostly due to negating the date checking logic and then reversing it again in the outer query - thanks Itzik!
A late night at work, yes, but one well spent.
TechEd 2005 Australia coming soon...
TechEd 2005 Australia on the Gold Coast is coming soon:- Aug 30 - Sep 2. Apparently, Ron Talmage, Greg Linwood & Greg Low will be presenting (according to Kalen) but no news of Kalen or Itzik, who are both in Australia at the moment (at least I think Itzik is still here), presenting.
Should be good anyway. There's gobs of stuff, particularly SQL 2005 & BI stuff, getting presented in the sessions (here's a rough agenda).
I just hope my employer is willing to subsidise me a little...make that 100%. <g>
Index Fill Factor & Performance Considerations
A developer colleague (let's call him Pete because that's his name) asked me today about fill factors and the performance considerations behind choosing a fill factor for an index...so I thought, since it was fresh in my mind, I'd blog it. His basic situation was he's developing a middle-tier feature around a new table that has 200,000 rows in it and it took him about 2 minutes to delete those 200,000 rows, which seemed excessive. He suspected it was something to do with the fill factor he was choosing (I don't actually know what fill factor he chose) and so he asked me to explain a little about fill factors.
Well...
The fill factor applied to an index defines how full each leaf level page is when the index is created (or rebuilt with DBCC DBREINDEX(), which, for all intents & purposes, is basically the same as doing a DROP INDEX followed by a CREATE INDEX). If you specify a fill factor on 90 then that means each leaf page in the index will be 90% full of data initially. Fill factors are more relevant with clustered indexes (the actual table data) simply because clustered indexes tend to be much wider than your non-clustered indexes but the same principles apply to both clustered and non-clustered indexes.
So, why leave empty space in an index page? Well, what happens when a page becomes 100% full and SQL Server needs to insert another row in that page? What SQL Server does is take the full page, split it in 2 roughly equally filled pages (i.e 50% full) and then insert the new row into the appropriate of those 2 new pages, both of which should now have enough free space for the new row. This is known as a page split. As you can imagine, page splits increase the I/O activity on an index (allocating new pages, moving data, shuffling pages in the b-tree to put them in the right place, in addition to the actual insert you originally wanted to do) and so hurt performance. So you want to try to minimise page splits.
So does that mean I should just create my indexes with a fill factor of, say, 5%? (This was how the actual conversation with Pete went.) Well, if each page was only 5% full (as opposed to 100% full) then you'd need 20 times as many pages (each of which is 8k) to store the data. So we're talking about a pretty big index which is mostly full of empty space. That's great in terms of inserts & modifications (assuming you've got enough disk space to handle indexes that are suddenly 20 times larger than before) - you wouldn't expect page splits very often. However, if you're retrieving a range of data from that index with a SELECT statement then that data is most likely going to be spread out across 20 times more pages than before. So your range scans are going to have to hit 20 times more pages to return the same amount of data as before, hence 20 times more I/O.
Generally speaking, read performance of an index is directly proportional to the fill factor and write performance is inversely proportional to the fill factor. Damned if you do, damned if you don't. This is one of the reasons you've got to know the data you're playing with. Do you optimise the index for reads or writes? Generally speaking, once again, an index will be used much more for reads than writes (Microsoft estimates that reads typically outnumber writes by at least a factor of 5 to 10). So, will this index (or the table in general) mostly get used for reads or writes? Also, when new rows get inserted into the index are they going to be allocated at the end of the index (in sequential order, like an IDENTITY column for example) or are they going to be all over the place (like new uniqueidentifier values assigned with NEWID())? The answers to these two questions will be very important in determining what initial fill factor to set when creating your index (or rebuilding it later), assuming you don't want to stick with the server default (set with sp_configure).
Firstly, since indexes are generally read more than written to, you should probably err on the side of larger fill factors (for example, at least greater than 50%). If you're expecting a large percentage of writes on the index then you'll want a lower fill factor to provide some room for future adds. On the other hand if you're not expecting many writes in the index then you should fill the pages more (if you don't change the server default fill factor and you don't specify what fill factor you want when creating an index then SQL Server will fill each page as much as it can, i.e. 100%). That's pretty straight forward.
However, you may also be able to get away with a very high fill factor if new rows will almost always go at the end of the index (not in the middle) and existing rows will rarely, if ever, get changed. This will create a hotspot at the end of the index and page splits will be rare because almost all new rows will go at the end of the index resulting in simple new page allocations rather than page splits. The classic example is a clustered index on an identity column; this is one argument for the surrogate key side of the famous (or should I say infamous?) surrogate key versus natural key debate. But I won't get into that can of worms here.
The basic theory behind fill factors that I've described here should stand you in good steed to come up with general rules of thumb when designing your indexes. But, since it's essentially a trade-off between read performance and write performance, that will really only give you a starting point. From there it's a case of trial and error, bearing in mind how changes in fill factor affect performance. If you're getting excessive page splits then rebuild your index with a smaller fill factor. If I/Os are high during SELECT statements and it's affecting performance then you may be able to rebuild your index with a bigger fill factor. (Fill factors are only a very minor point to consider during query optimisation; good query design and table normalisation will usually play a much bigger role.)
One more thing to think about is that with page splits comes index fragmentation. Fragmentation is really pretty unavoidable but excessive page splits will lead to greater index fragmentation and reduced I/O performance, both read performance and write performance (and I/O is typically the bottleneck in RDBMSs). But index fragmentation is a topic for a whole other blog. And, by the way, I think Pete's slow deletes problem had more to do with waits/locks on the resources he was trying to delete than fill factors...but anyway...
A few neat SQL tricks
While reading Advanced Transact-SQL for SQL Server 2000 by Itzik Ben-Gan & Tom Moreau this week I came across a few tips about dealing with dates. Most of them, but not all, I had already figured out previously but I thought they'd be handy to note anyway.
Left Padding Numerics with Leading Zeros
How many times have you needed to present numeric data with a fixed length? If you cast an int to a string datatype you often get a bunch of trailing spaces. What if you convert the int to a fixed length string with STR(), thereby right-aligning the int data, and then change the leading spaces to zero characters with REPLACE(). Like this example to display the integer 52 as a five character fixed length string with leading zeros:
DECLARE @i int
SET @i = 52
SELECT REPLACE(STR(@i, 5, 0), ' ', '0')
Finding the First Day of the Month
The CONVERT() function will convert a datetime value into a string (among other uses for the function) and a 3rd style parameter can deal with many different date & time formats. Format 112 (yyyymmdd) is particularly useful with extracting date information from a datetime or smalldatetime value, and it's useful for finding the first day of a given month. The first 6 characters gets us the year & month and then all we need is to concatenate '01' to the end of that string as follows:
SELECT CAST(CONVERT(char(6),GETDATE(),112) + '01' as smalldatetime)
Finding the Last Day of the Month
This is really a variation on the previous theme. The last day of a given month is also equal to the day before the first day of the next month. So if we can get the first day of the next month, then we should be able to easily get the day before that (a.k.a. the last day of the month). We can use the DATEADD() function to add and subtract components of a date. So we get the first day in the given month (as above), add 1 month and then subtract 1 day as follows (the CAST() in the previous example is unnecessary because the varchar result from the CONVERT() function will be implicitly converted to a datetime datatype):
SELECT DATEADD(dd, -1, DATEADD(mm, 1, CONVERT(char(6),GETDATE(),112) + '01'))
Unambiguous Date Formats
This is not exactly a tip but since it's a question that regularly gets asked in the public sqlserver newsgroups (although usually indirectly and unintentionally) it's worth while noting. Many date formats are ambiguous. Formats such as 1/6/2005 or even 2005-06-01 are ambiguous; both can be interpreted as June 1, 2005 and Jan 6, 2006 depending on the locale. However, the ISO format (styles 12: yymmdd, and 112: yyyymmdd) is always unambiguous. So when converting dates for use in calculations you should always use the ISO format. Likewise, when embedding constant datetime values in code you should always use the yyyymmdd hh:mi:ss.msec format as it's always unambiguous. For example:
DECLARE @d datetime
SET @d = '20050601 11:35:00.000'
SELECT @d
I hope these little tips are as helpful to you as they have been to me.
ANSI SQL99 Standard
Given how important the ANSI SQL grammar is to the current SQL RDBMS products, such as Oracle 10g, Microsoft SQL Server 2000, Sybase ASE 12.5.3 & DB2 ESE, I find it surprising how difficult it is to obtain a copy of the standard online (for free that is <g>). I've been looking on and off over the last week or so for a copy online so I can verify which parts of Microsoft's T-SQL implementation are ANSI standard and which aren't. For example, I asked the question in a news://microsoft.public.sqlserver.server thread today (as a tangent to the actual subject of the thread) if the COALESCE() and/or ISNULL() functions were part of the ANSI SQL99 grammar. Tibor Karaszi, SQL Server MVP and associate mentor at Solid Quality Learning (a training/mentoring company that specialises in acquiring the top names in SQL Server), responded that COALESCE() was in the ANSI standard but ISNULL() was a Microsoft specific addition.
How did he know that? Did he fork out the cash to buy the documents from the ISO? Actually, being an MVP he's got access to a little more info than the rest of us (like the developers at Redmond for example) and his colleagues at Solid Quality Learning are gurus, if not legends, in the MSSQL world. He could always have books in his collection such as SQL:1999 - Understanding Relational Language Components by Jim Melton (who I'm sure would have had a genuine copy of the grammar documentation from the ISO). As far as I can tell, if you want a copy of the SQL-99 standards documentation you have to be willing to shell out the Swiss Francs for it.
I think it's a bit disappointing that Microsoft have not included this kind of info in SQL Books Online. (Yes, I found the insert hyperlink button in the blogger wysiwyg editor today.) Just a footnote at the bottom of various topics, such as ISNULL(), to let us know if that language feature is part of the ANSI standard or not. Is that too much to ask? I think SQL Books Online is a brilliant online reference for SQL Server and is always my first port of call when I need to check something SQL Server related, but it's not perfect and I think this kind of "ANSI standard or not" footnote info would be a helpful addition in the next version of BOL.
Looks like I'll be adding a couple Jim Melton titles to my personal library. They'll go well with my Kalen Delaney, Ken Henderson and Jeff Shapiro tomes. I might even pick up a bit of Joe Celko while I'm at it. Now where did I put that corporate AMEX card...
Transaction Logs and File Growth/Shrinkage
I've spent a bit of time this year answering public SQL newsgroup questions and reading the answers that the MVPs give. Occasionally they are the same, which is encouraging. It occurs to me that many of the questions are based on common themes. The most common questions I've seen tend to be variations on "how does the transaction log work? when should I back it up? how do I shrink the file?"
The transaction log is a very important part of the DB puzzle. The subject is far too big to cover in any depth in a single blog (entire chapters or even books have been written about the transaction log). However, I'll give it a quick bash.
Any change to the data is written to the transaction log before being written to the data file(s). This is known as a write-ahead log. When data is modified the change is made to a copy of the page in the buffer cache (memory) and also to the log cache in memory. Before the data page is flushed (written to disk) the log page must be flushed. This is to guarantee that the change can be rolled forward if the server should fail before the modified data page is flushed (or rolled back if part of an incomplete transaction).
As you can see the transaction log is vital in any data modification operation (as is smooth memory management and a lack of contention for memory). That's why log performance & backups are so important. If the log is performing poorly then all data change operations to the database are performing poorly. If changes to the database have to be safe from corruption and/or loss then you have to backup the transaction log - the potential data loss window, in the case of a failure, is the same as the transaction log backup interval.
And guess what happens when you grow or shrink a data or log file in SQL Server? All I/O to that particular file must stop until the file growth/shrink is finished. So that means if you shrink a log file then no data modification can happen until that log file shrink has finished. If the log file got that big in the first place then chances are that log file will need to grow that big again some time in the future. And when it needs to grow what will happen to all I/O to the log file? And while the transaction log is paused during the log file growth operation what will happen to any data modification operations.
So you can see why fiddling with the size of the transaction log is a very expensive thing to do. The best thing to do is make a generous estimation of the biggest size you think the transaction log will need to be and create it at this size when you first create the database. This, of course, depends on the amount of modification activity you expect the database to undergo and the frequency at which you intend to backup the transaction log (the transaction log is truncated each time you back it up).
Of course, if you're not overly concerned about losing data, or full DB backups will suffice, then you can simply set the database to use the "simple" recovery model, which will truncate the transaction log at each checkpoint (the timing on automatic checkpoints is dependent on the log activity for the DB and the recovery interval set for the database) - so you can set the initial log size to something fairly small. The recovery mode for a database is set with the ALTER DATABASE statement:
ALTER DATABASE <db_name> SET RECOVERY SIMPLE
Anyway, that's enough for the time being on transaction logs and why you should try to minimise file growth/shrink operations. I may harp on about it a bit at a later date.