Thursday, June 30, 2005

T-SQL Speed Efficiencies with Double Negatives

I left work last night somewhere between 10:00pm and 10:30pm. I tend to lose track of time when I'm absorbed in a subject and there are very few subjects more absorbing than query optimisation IMHO. I guess I really am a geek (even though I fought that label for some time).

The problem I was working on last night was an aweful query, curtesy of a developer at the organisation I work for, who shall remain nameless (just in case anyone actually reads this blog). He is writing an app that deals with meeting room bookings and this particular piece of code was supposed to be checking for meeting request overlaps in any given room. His query looked something like this (from memory):
SELECT r1.RoomID, m1.MeetingID, m1.StartTime, m1.EndTime FROM dbo.Room as r1
CROSS JOIN dbo.Meeting as m1
WHERE m1.MeetingName = 'My Test Meeting'
AND r1.RoomID not in
(
-- Get all the meetings that currently overlap with that room & timeslot
SELECT ISNULL(r2.RoomID,0) FROM dbo.Room as r2
INNER JOIN dbo.Meeting as m2 on m2.RoomID = r2.RoomID
CROSS JOIN dbo.Meeting as this
WHERE this.MeetingName = 'My Test Meeting'
AND m2.RoomID = r2.RoomID
-- Here begins the datetime checking fun/fiasco
AND
(
this.StartTime BETWEEN m2.StartTime AND m2.EndTime
OR this.EndTime BETWEEN m2.StartTime AND m2.EndTime
OR
(
m2.StartTime BETWEEN this.StartTime AND this.EndTime
OR m2.EndTime BETWEEN this.StartTime AND this.EndTime
)
)
AND r2.SeatingLayout = .... etc.
)
AND r1.SeatingLayout = ... etc.
Of course, there were no comments in his code and it wasn't anywhere near as clear as my paraphrase/simplification above (not that my simplification above is all that clear but it's more readable in terms of indenting, etc.), so it took me a good hour to figure out what he was really trying to achieve (and that was the first step in optimising his query - trying to figure out what it was attempting do so I could rephrase it).

Anyway, he was basically getting a list of all the RoomIDs that had meetings already scheduled that conflicted with the time slot of the new meeting. He was then picking all the rooms that weren't in that list and presenting them as potential candidates for the new meeting. When he commented out the subquery it was "uber-fast" according to him (apparently a 2 second response time is "uber-fast"). When he uncommented the subquery "it took forever" (which means it ran for 3 minutes & 20 seconds). He wanted to know what was wrong with the subquery that was killing it (how about the fact that with the subquery it was doing just under 3 million logical reads; my final rewrite did about 4,000 logical reads and was logically equivalent).

So, the first step (well actually the second, because the first step was trying to decipher the goal behind his cryptic code) was to run the subquery on its own and try to find a way to rewrite it to make it more efficient. He'd written the whole query in such a way so that you could just cut and paste the subquery to another window and run it on its own (at least that was helpful). I could see from the actual execution plan in Query Analyzer (and also just from looking at the code) that the I/O on dbo.Meeting was fairly horrendous, requiring clustered index scans and attributed to about 90% of the execution cost. So the obvious place to start optimising was the date range logic.

I was lucky enough about a month ago to attend the Sydney SQL User Group meeting when Itzik Ben-Gan was talking about SQL & Logic (basically how to solve difficult SQL problems by "thinking outside the box"), and the following week a 1 day Itzik seminar on advanced T-SQL. One of the things Itzik proposed in order to solve some problems (it's not applicable to every situation) was rephrasing an expression in the negative and then negating the result to come up with the logic equivalent as the original query. "X = - -X" if you like. For example, instead of finding all the rooms with meetings booked that conflicted with the time slot in question, how about finding all the rooms that had no meetings that conflicted with the time slot? And then, instead of saying show me the rooms not in this list, you can say show me the rooms that match?

You can negate that whole date checking logic mess (which comprises no less than 8 comparisons, half of which are combined with OR operators!) by two simple comparisons:
WHERE m2.StartTime > this.EndTime
OR m2.EndTime < this.StartTime
In other words, the existing meeting finishes before the new one starts or starts after the new one finishes. No overlap. So the subquery becomes:
SELECT ISNULL(r2.RoomID,0) FROM dbo.Room as r2
INNER JOIN dbo.Meeting as m2 on m2.RoomID = r2.RoomID
CROSS JOIN dbo.Meeting as this
WHERE this.MeetingName = 'My Test Meeting'
AND m2.RoomID = r2.RoomID
-- Date checking logic now becomes
AND (m2.StartTime > this.EndTime OR m2.EndTime < this.StartTime)
AND r2.SeatingLayout = .... etc.
Much nicer. And less expensive in I/O too...much. Then the condition in the outer query changes from NOT IN (...) to IN (...). So that made a huge improvement, but I wasn't finished - there were still clustered index scans to try to get rid of and improvements to make in the joins & correlation between the subqueries. The clustered index scans disappeared when I created a couple nonclustered covering indexes on a few strategic columns - yeah, lookin good.

Now, what's the point of including the dbo.Meeting table a second time in the subquery (just for the StartTime & EndTime to check against) when you already have all the data you need in the outer query? Needless I/O. So get rid of the cross join in the subquery (that is the dbo.Meeting table aliased as 'this') and just use the StartTime & EndTime from the outer query. And the IN (...) predicate can be expressed in a slightly more efficient way by using the EXISTS (...) predicate, which stops its scan as soon as it finds a matching row.

My final query looked something like this (the actual code is slightly off (this is from memory) but the double-negative is the tactic I used):
SELECT r1.RoomID, m1.MeetingID, m1.StartTime, m1.EndTime FROM dbo.Room as r1
CROSS JOIN dbo.Meeting as m1
WHERE m1.MeetingName = 'My Test Meeting'
AND EXISTS
(
-- Get all the rooms that have no meeting overlap with that timeslot
SELECT * FROM dbo.Room as r2
INNER JOIN dbo.Meeting as m2 on m2.RoomID = r2.RoomID
WHERE m2.RoomID = r2.RoomID
AND r1.RoomID = COALESCE(r2.RoomID,'0')
AND (m2.StartTime > this.EndTime OR m2.EndTime < this.StartTime)
AND r2.SeatingLayout = .... etc.
)
AND r1.SeatingLayout = ... etc.
The result: same resultset, 380ms, 4000 logical reads. That's 0.19% the response time (an improvement factor of over 500) and 0.14% as much I/O (an improvement factor of about 750). Now, if 2 seconds for a query that does none of the time slot conflict checking is "uber-fast" then a query that is logically equivalent except that it does all the time slot checking as well and runs in 380ms (with a whole lot less I/O), I think it's fair to say, screams! And mostly due to negating the date checking logic and then reversing it again in the outer query - thanks Itzik!

A late night at work, yes, but one well spent.

3 Comments:

At 14/7/05 10:24, Anonymous Anonymous said...

..sniff.. i worked so long on that query!

 
At 23/6/06 00:04, Anonymous Anonymous said...

declare @EndTime datetime, @StartTime datetime
select @EndTime = m.EndTime, @StartTime = m.StartTime
from dbo.Meeting m where m1.MeetingName = 'My Test Meeting'

select r2.*
from dbo.Room as r2
left join dbo.Meeting as m2
on m2.RoomID = r2.RoomID
and m2.StartTime < @EndTime
and m2.EndTime > @StartTime
where m2.StartTime is null

 
At 26/9/07 03:03, Blogger 釈情 said...

this article on event overlap is very valuable. it's not easy to find much on the topic.

I tried to adapt your double-negative idea to a related problem: retreiving only the latest version of a meeting - that is get all non-overlapping meetings for a location, and the latest overlapping meeting for a location.

CREATE TABLE SBSSM.EventSegment
(
EventSegment_ID int IDENTITY (1, 1) NOT NULL,
StartTime datetime NOT NULL,
EndTime datetime NOT NULL,
Activity_ID int NOT NULL,
Location_code nvarchar(20) NOT NULL,
[DBTimeStamp] [timestamp] NULL,
CONSTRAINT IUC220 PRIMARY KEY(EventSegment_ID),
constraint ETGTST check (EndTime > StartTime)
)
GO

insert into sbssm.eventsegment (StartTime, EndTime, Activity_ID, Location_code) values
('2007-09-24 17:00','2007-09-24 18:00',1,'PL')
insert into sbssm.eventsegment (StartTime, EndTime, Activity_ID, Location_code) values
('2007-09-24 16:00','2007-09-24 17:00',1,'PL')
insert into sbssm.eventsegment (StartTime, EndTime, Activity_ID, Location_code) values
('2007-09-24 16:00','2007-09-24 17:00',1,'PL')
insert into sbssm.eventsegment (StartTime, EndTime, Activity_ID, Location_code) values
('2007-09-24 17:00','2007-09-24 18:00',1,'PL')
insert into sbssm.eventsegment (StartTime, EndTime, Activity_ID, Location_code) values
('2007-09-24 16:30','2007-09-24 17:30',1,'PL')
insert into sbssm.eventsegment (StartTime, EndTime, Activity_ID, Location_code) values
('2007-09-24 15:00','2007-09-24 16:00',1,'PL')
insert into sbssm.eventsegment (StartTime, EndTime, Activity_ID, Location_code) values
('2007-09-24 15:00','2007-09-24 16:00',1,'PL2')
insert into sbssm.eventsegment (StartTime, EndTime, Activity_ID, Location_code) values
('2007-09-24 15:00','2007-09-24 16:00',2,'PL')

Select * from sbssm.EventSegment
-- EventSegment_ID 5,7,8 are the ones needed

--my having clause causes yack:
select max(es.EventSegment_ID) esID, es.Activity_ID, es.StartTime, es.Endtime, es.Location_code
from sbssm.EventSegment es
join sbssm.EventSegment es2
on es.Location_code = es2.Location_code
and es.Activity_ID = es2.Activity_ID
and (es.starttime >= es2.endtime or es.endtime <= es2.starttime)
Group by es.Activity_ID, es.StartTime, es.Endtime, es.Location_code
having (es.starttime >= es2.endtime or es.endtime <= es2.starttime)


---
getting severe brain damage on this one. if you can see what's wrong, you're not only helping me, you're helping the world! TKS

 

Post a Comment

<< Home