Deleting backup/restore history
Have you ever come across some problem in a programming language or application and thought "Surely somebody must have come across this before."? I have. In fact I do surprisingly often, at least it suprises me how often I have these little dilemmas. A few years ago I found a couple fairly siginificant inefficiencies in some basic features of Microsoft SQL Server 2000 (eg. reading the errorlog through the GUI and deleting backup/restore history in the msdb database). In the last couple days I have "re-found" the backup/restore history culling problem with the system tables in Microsoft SQL Server 2005.
First a little background. SQL Server retains history information about backup & restore events that occur on a SQL instance. This data is inserted into a handful of system tables in the msdb database (such as dbo.backupset, dbo.backupmediaset, dbo.restorefile, dbo.restorehistory, etc.). On SQL instances that continually do backups and/or restores, such as DR instances involved in a log shpiping topology for example, this historic information can really add up. One of our DR instances is continually restoring transaction log backups for about 200 databases every 15 minutes, 24 hours a day, 7 days a week. If you look at our scenario from a very simplistic perspective (ie. 1 insert into 1 system table for each event) you're talking almost 7 million new rows of history data per year. When you take into account that there are several tables involved in this history recording process then you can understand how the size of your msdb database can quickly get out of control.
Microsoft have kindly provided a couple stored procs in the msdb database for culling backup/restore history - dbo.sp_delete_database_backuphistory (for culling the history for an entire database) and dbo.sp_delete_backuphistory (for culling all backup/restore history older than a given date across all databases). When you drop a database with SSMS (SQL Server Management Studio) there is an option to also delete backup/restore history. If you check this option then SQL Server will do a "drop database MyDatabase" and also "exec msdb.dbo.sp_delete_database_backuphistory 'MyDatabase'". If you've ever tried this then you'll know that this can take a long, looong, loooooong time if your DB has been involved in many backup/restore events (even if the DB hasn't been backed up/restored very much but the SQL instance has done many backups and/or restores, ie. if the history tables are large). You will also probably have noticed that when this backup/restore history is being deleted there can be huge blocking chains for all the backup and restore jobs that are waiting to write their new rows into these history tables (there are a few locking issues here as far as concurrency is concerned). The main reason for this is there are a couple fairly strategic indexes missing from a couple fairly core tables.
A couple days ago I was trying to tidy up this restore history for one of our DR instances. I wrote a tiny script using dbo.sp_delete_backuphistory to delete one day's worth of history at a time (but only if there was at least 3 month's history still left), and I scheduled this to run via SQLAgent ever half hour (I did it in such small chunks to try to give the log shipping restore jobs a fair go at these history tables so they didn't get all jammed up waiting for the history culling to finish). This was an OK approach but it was sooooo sloooooow. I took the T-SQL code and put in a few little debug statements so I could time how long the operation was taking, how many rows it was deleting and the rate that rows were being deleted. Fairly simple stuff. I ran it a few times and, in my case, the deletion rate was about 0.4 - 0.5 rows per second. Appalling! That's about 1hr 20min to delete 2500 rows (which was about 1 day's worth of history on this SQL instance a year & a half ago).
So I did what I always do when troubleshooting/optimising query performance: I turned on "set statistics io", "set statistics time" and the actual execution plan. When I ran it again it took just as long but now I could see everything it was doing. There were 3 main chunks to the batch and they accounted for about 99% of the cost of the query. The first bit was deleting from dbo.restorehistory (and in my case doing about 1 million logical IOs against the dbo.restorefile table). The second bit was deleting from dbo.backupmediafamily (and doing about 260 million IOs against dbo.backupset). The third main bit was deleting from dbo.backupmediaset (and doing about 340 million IOs against dbo.backupset). Obviously there was something obscenely wrong with the access methods against these tables. I checked the execution plans and could see table scans and clustered index scans against these tables in the relevant sections of the stored proc. After about 1 minute of analysing the plans I decided to create 2 nonclustered indexes. They took about 5-10 seconds to create. When I ran the batch again, it deleted another day's worth of history in 8 seconds. For the next minute or two I gleefully kept running the batch over & over again deleting a day's history in a few seconds each time. The deletion rate has gone from 0.5 rows per second on average to about 1500 rows per second on average. That's a performance increase factor of 3000! And all by just creating 2 simple indexes that took about 10 seconds to do. Yippee!!
From memory this was exactly the same problem I found with SQL Server 2000 about 5 years ago. Good to see Microsoft had discovered this issue themselves and taken steps to rectify it between versions. (Please excuse my sarcasm.) I'd be interested to know if they've added those indexes to SQL 2008 or changed the history culling process to avoid the issue.
For those who are curious about the T-SQL statements I was running to do this stuff (please excuse the Blogger formatting of the code fragments)...
This is the T-SQL batch to delete the oldest day of backup/restore history (with a little debug info built in):
set statistics io off
set statistics time off
declare @retention_date_cutoff datetime;
declare @msg varchar(1000);
declare @start_time datetime;
declare @start_rows int;
declare @deleted_rows int;
declare @deletion_time int; -- seconds
select @retention_date_cutoff =
dateadd(dd,
datediff(dd,current_timestamp,min(backup_start_date))+1,
current_timestamp)
from msdb.dbo.backupset
where backup_start_date < dateadd(mm,-3,current_timestamp);
if @retention_date_cutoff is not null
begin
select @start_time = current_timestamp, @start_rows = count(*)
from dbo.backupset;
exec msdb.dbo.sp_delete_backuphistory @retention_date_cutoff;
select @deleted_rows = @start_rows - count(*) from dbo.backupset;
select @deletion_time =
case
when @start_time = current_timestamp then 1
else datediff(ss, @start_time, current_timestamp)
end;
select @msg = 'Deleted backup history older than '
+ convert(varchar(15), @retention_date_cutoff, 106)
+ '. ' + cast(@deleted_rows as varchar(7)) + ' rows'
+ ' in ' + cast(@deletion_time as varchar(5)) + ' seconds'
+ ' (' + cast(cast(@deleted_rows as float) / @deletion_time
as varchar(8)) + ' rows/sec)';
print @msg;
end
else
print 'No retention date to use';
And this is the T-SQL to create the 2 necessary indexes to hit the nitrous oxide button:
use msdb;
go
create nonclustered index IX_backupset_media_set_id on dbo.backupset (media_set_id);
create nonclustered index IX_restorefile_restore_history_id on dbo.restorefile (restore_history_id);
go
Easy as! And it makes for much happier DBAs who are trying to tidy up old backup/restore history for their log shipping secondary DR instances in less than a few decades.