The SQL Dude!

On-Disk & In-Memory Pages that teach you the SQL way of life!

Data Collector (Management Data Warehouse) – Issues, Tips and Solutions

Posted by Sudarshan Narasimhan on August 3, 2011


Somehow I ended up working on data collector issues quite a bit in the last 6 months or so. It’s certainly a useful feature but not an easy one to work with and definitely not an easy one to troubleshoot for you DBA’s. I’m listing down some issues that I faced and solutions for these issues. If you come up with or face any new issues in data collector, please drop a mail to sudarn.

1. Data Collector Upload Job Timeouts

The Data Collector jobs were getting timeout errors on for data upload job ONLY AT THE TIME WHEN PURGE JOB IS RUNNING. You notice that once the purge job completes, the data upload job also starts succeeding, but till that time it simply keeps failing. Here is what you would see in job history for these upload jobs.

02/08/2011 05:25:00,collection_set_3_upload,Error,0,SERVERXYZ\INSTANCEXYZ,collection_set_3_upload,(Job outcome),,The job failed. 
The Job was invoked by Schedule 2 (CollectorSchedule_Every_5min).  The last step to run was step 2 (collection_set_3_upload_upload).,01:00:01,0,0,,,,0

02/08/2011 05:25:00,collection_set_3_upload,Error,2,SERVERXYZ\INSTANCEXYZ,collection_set_3_upload,collection_set_3_upload_upload,,Executed as user: STARWARS\Yoda
The thread "ExecMasterPackage" has timed out 3600 seconds after being signaled to stop.  Process Exit Code 259.  The step failed.,01:00:01,0,0,,,,0

Since we know that the purge job running was the only time when these uploads were failing, we have a simple solution

Solution:
Schedule the purge and upload to run at different schedules. You can use the SSMS UI to define a new schedule for the collection set, just make sure it doesn’t fall under the schedule of the purge job.

 

2. Data Collector Upload Job Deadlocks intermittently

The collection set upload job is running into deadlocks now and then (aka intermittent). This is again related to the purge jobs. Why?

There have been multiple reports of this issue on Connect & MSDN Forums and I’ve had the “pleasure” of talking to customers about this issue. Here are some,

Deadlock in MDW Upload Purge Logs Job Step
http://connect.microsoft.com/SQLServer/feedback/details/539547/deadlock-in-mdw-upload-purge-logs-job-step

Management Data Warehouse Data Collector upload job deadlocks
http://connect.microsoft.com/SQLServer/feedback/details/504557/management-data-warehouse-data-collector-upload-job-deadlocks

Here is a sample output of a failed Upload job that reported the deadlock.

Log Job History (collection_set_3_upload)
Step ID 1
Job Name collection_set_3_upload
Step Name collection_set_3_upload_purge_logs

Message
Executed as user: STARWARS\Yoda. Transaction (Process ID 457) was deadlocked on lock | communication buffer resources with another process and has been chosen as the deadlock victim. Rerun the transaction. [SQLSTATE 40001] (Error 1205). The step failed.

Solution:

Stagger the two collector jobs. E.g. collection_set_2_upload Server Activity and collection_set_3_upload Query Statistics to run a couple of minutes apart. Now, here is a catch! Don’t change the schedule for the SQL Server Agent job, you need to change the schedule using the collector “pick schedule” option in SSMS.

To do this, right click on the collector agent (not the SQL Agent job) select properties and then uploads in the left hand pane. Click on "New" to create a new schedule. If you do it this way you can create two new schedules that are independent of each other.

3. Unable to change/define Schedules for MDW Collection Sets

I did run into another quirky issue when attempting to define a new schedule for the collection sets. i.e I was not able to define a new schedule for this and kept getting this error.

clip_image002[4]

These are the collection sets that are present by default,

  1. Disk Usage
  2. Server Activity
  3. Query Statistics
  4. Utility Information

Of these, the Utility Information is disabled and does not have a schedule defined by default. The reason for the above error was the Collection set “Query Statistics” mentioned above, did not have a valid schedule UID stored. You can confirm this by doing these queries.

1. In the context of MSDB database, run the following query and note down the schedule_uid value for Query Statistics collection set

select * from dbo.syscollector_collection_sets where collection_set_id=3

e.g output. A575FFD0-98A0-4D0E-B43C-B63482FB5B00

2. Again in the MSDB context, run the following,

SELECT schedule_id from sysschedules_localserver_view WHERE ‘XYZ’ = schedule_uid

                — XYZ is the value obtained from step#1. You will see that there is NO value returned for step#2

 

3. As I mentioned, Utility Info has no schedule by default, so we need to use this to get out of this situation. So I used the Utility Information collection set and defined a new schedule for it for every 10 minutes.

4. Next, I queried the schedule_id for this in dbo.syscollector_collection_sets and used this schedule_id to map to the collection set that was failing.

declare @schedule_uid uniqueidentifier

select @schedule_uid = schedule_uid from dbo.syscollector_collection_sets where collection_set_id = 5 – whichever is the Utility Info collection set ID

exec dbo.sp_syscollector_update_collection_set @collection_set_id = 3, @schedule_uid = @schedule_uid — whichever is the Query Statistics collection set ID (change according to the one failing at your end)

 

5. This fixed the issue with Query Stats and I was able to change/define a schedule for that. But, since we created a schedule for Utility Info and you don’t want that to run, I tried to disable it by setting it to “On Demand”. But this failed. Oops!

6. So I enabled the Utility Info collection set and only then did it create a valid job_id for it, but I got another error when trying to remove the schedule.

7. I stopped the collection and then deleted the job manually. To get things back to old state, I updated the metadata using like this,

UPDATE [msdb].[dbo].[syscollector_collection_sets_internal]

SET [collection_job_id] = NULL, [upload_job_id] = NULL

WHERE collection_set_id = 7 – whichever is the Utility Info collection set ID

8. Now, you can stop the Utility Info collection set and also get the other collection sets schedule changed to fix issue #2 mentioned aboveSmile.

 

4. Data Collector Purge Job (Clean-up job) takes a long time to complete

This is actually the root cause of issue #1 and #2 listed above. The purge procedure is complicated and is responsible for cleaning up the metadata tables of old entries. This work is done by the core.sp_purge_data stored procedure. As a troubleshooting step, I captured the execution plan of the procedure and noticed a missing index recommendation in the XML Showplan.

<MissingIndexes>
  <MissingIndexGroup Impact="92.2913">
    <MissingIndex Database="[MDW]" Schema="[snapshots]" Table="[query_stats]">
      <ColumnGroup Usage="EQUALITY">
        <Column Name="[sql_handle]" ColumnId="1" />
      </ColumnGroup>
    </MissingIndex>

If you were to translate this into a CREATE INDEX statement this is how it would look,

CREATE NONCLUSTERED INDEX [Ix_query_stats_sql_handle]

ON [snapshots].[query_stats] ([sql_handle] ASC)

WITH (PAD_INDEX  = OFF, STATISTICS_NORECOMPUTE  = OFF, SORT_IN_TEMPDB = OFF, IGNORE_DUP_KEY = OFF,

DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS  = ON, ALLOW_PAGE_LOCKS  = ON)

ON [PRIMARY]

 

Now, creating this requires modifying the MDW system table and of course this isn’t supported! This same thing is mentioned in this blog as well and is this blog post by the SQL Server Development Team. Don’t do it!  Patience, I will explain.

 

Here are some facts:-

1. Purge job can get slow on large MDW databases (40+ GB).

2. The DELETE TOP statement on snapshots.notable_query_plan is the one where most execution time is spent.

 

 

Solution:

Like I mentioned earlier, don’t modify system Stored Procedure code unless guided by Microsoft Support. Luckily, for the slow purge procedure a fix has been released in SQL Server 2008 R2 Service Pack 1, which can be downloaded here.

 

This fix updates the Purge procedure TSQL code and the purge has been broken down and re-written in an optimized way. The runtime will come down drastically once you update to SP1. The new procedures doing the purge is called “[core].[sp_purge_orphaned_notable_query_plan]” and “[core].[sp_purge_orphaned_notable_query_text]”

 

Hang on, it’s not over yet!

 

 

AFTER you apply SP1, you will need to modify the Stored Procedure sp_purge_orphaned_notable_query_text as shown below. These changes are required to correct the text of the sp_purge_orphaned_notable_query_text stored procedure because the delete statement incorrectly references the snapshots.notable_query_plan table after you apply Service Pack 1.

 

 

SP1 Code

        — Deleting TOP N orphaned rows in query plan table by joining info from temp table variable

        — This is done to speed up delete query.

        DELETE TOP (@delete_batch_size) snapshots.notable_query_plan

        FROM snapshots.notable_query_plan AS qp , #tmp_notable_query_plan AS tmp

        WHERE tmp.[sql_handle] = qp.[sql_handle]

 

Change this to following once you apply SP1

— Deleting TOP N orphaned rows in query text table by joining info from temp table
       
— This is done to speed up delete query.

       
DELETE TOP (@delete_batch_size) snapshots.notable_query_text 
        
FROM snapshots.notable_query_text AS qt, #tmp_notable_query_text AS
tmp
       
WHERE tmp.[sql_handle] = qt.[sql_handle]

 

Hopefully, this code change will be included in a future cumulative update post-SP1, so that you don’t have to manually change the code. The same applies for SQL Server 2008 as well, where I am hopeful these changes will be included in a future Service Pack. This fix mentioned above is at present once valid for SQL Server 2008 R2 (as of Aug 2, 2011 when I wrote this). With these, the slow purge issues should be put to bed, once and for all!


UPDATE (August 3rd, 2011)

After working with our KB team, we have published an official KB article that talks about this issue. For all those running into slow purge issues, please follow the resolution given in this KB article,

FIX: Data Collector job takes a long time to clear data from a MDW database in SQL Server 2008 R2
http://support.microsoft.com/kb/2584903

 

Other Useful Links

FIX: The Management Data Warehouse database grows very large after you enable the Data Collector feature in SQL Server 2008
http://support.microsoft.com/kb/970014

Data Collector’s sp_purge_data may run for hours if there are many orphaned plans to purge
http://blogs.msdn.com/b/sqlagent/archive/2011/04/12/data-collector-s-sp-purge-data-may-run-for-hours-if-there-are-many-orphaned-plans-to-purge.aspx
Advertisements

8 Responses to “Data Collector (Management Data Warehouse) – Issues, Tips and Solutions”

  1. Praveen said

    Sudarshan,

    Thanks for your detailed information on MDW. I do have one quick questions. How does the snapshots.query_stats table get purged? we are collecting close to 30 SQL Server. This table size is 70GB.

    Please help me.

    Regards
    Praveen

    • Sudarshan Narasimhan said

      Hi Praveen,

      When you setup MDW, by default a purge job is created (e.g. mdw_purge_data_[MDW] would be the job name). This job by default Runs every day and it calls the stored procedure core.sp_purge_data to cleanup entries based on the retention interval you have configured. snapshot.query_stats is one of the tables this job purges data from. If you are running into disk space issues with old data, have a look at the 2 KB articles I have mentioned in the post. if you are using a SQL 2008 R2 instance as MDW repository, then I would suggest you apply Service Pack 1 and the latest CU since the logic for the purge SP has changed to fix some cleanup and blocking issues. If you are on SQL2008, have a look at KB 970014 which talks about similar issue.

      In fact you can even run the purge procedure specifying a particular retention value using the parameter @retention_days.
      use
      EXECUTE core.sp_purge_data @retention_days = 7

      Make sure that you run this during less activity time and not during any collection upload set. HTH.

      -Sudarshan

  2. […] in August 2011, I had posted quite a lengthy blog post on Management Data Warehouse (Data Collector). I made an update to that post today to include a new […]

  3. Mike T. said

    Hi Sudarshan,

    Thank you for all this very useful information. At my workplace we are using SQL Server 2008 x64 SP3. Do you know how I might be able to fix these types of issue in that environment?

    I had to setup the actual Data Collector on our VMware SQL Server install (same version), to collect stats from our live/production one. So I thought the VM slowness might be contributing to this issue? I’ve found that when looking at the SQL Agent Job Activity Monitor on the live/production server, that collection_set_2_collection & collection_set_3_collection jobs seem to be continually executing (running) and their last run shows ‘cancelled’ and the next run shows ‘not scheduled’. Also, the history for those jobs seems to show blank (maybe because they’re running?) The errors that I’m sometimes seeing in the Windows Server 2008 R2 SP1 Event Viewer – Application log are:

    1)
    SQL Server Scheduled Job ‘collection_set_2_upload’ (0x7E0AB3C29FDB2840B0F1F85F020F75D0) – Status: Failed – Invoked on: 2012-04-11 12:00:00 – Message: The job failed. The Job was invoked by Schedule 4 (CollectorSchedule_Every_15min). The last step to run was step 2 (collection_set_2_upload_upload

    2)
    SQL Server Scheduled Job ‘collection_set_3_upload’ (0x7CF2CE6F317F6E4B80B2F135AC53EECF) – Status: Failed – Invoked on: 2012-04-11 12:00:00 – Message: The job failed. The Job was invoked by Schedule 5 (CollectorSchedule_Every_30min). The last step to run was step 2 (collection_set_3_upload_upload).

    My schedules set on the live/production server’s Data Collection are:

    Disk Usage: every 6 hrs.
    Query Statistics: Cached
    Server Activity: Cached

    I could not find ‘sp_purge_orphaned_notable_query_text’ anywhere on the 2008 x64 databases. I did find ‘core.sp_purge_data’ on the VM’s data collection database’s stored procedure though. Seems similar to ‘sp_purge_orphaned_notable_query_text’ but not exactly the same.

    I will say though, that I have not really been using the reports for Query Statistics or Server Activity much yet, but of course the Disk Usage is such a useful tool, that I don’t know how DBA’s get by without it!

    Any help would be greatly appreciated,

    Mike

  4. […] by Sudarshan Narasimhan on June 5, 2012 I had posted on this topic before in my previous blog. SQL Server 2008 is also affected by slow purge procedure core.sp_purge_data as well as deadlocks […]

  5. Eran said

    We are hitting a strange issue on 2008R2, that by increasing the retention days of the “Query Statistics” to more than 30 days, statistics are not being collected any more. Even if you perform “collect and upload now” manually. We see this both in the reports (empty) and the snapshot table shows it stop inserting rows (snapshots.query_stats). Restarting the collection or the whole instance does not help.
    Any ideas?

  6. Demetrius said

    Cheers ! very useful article! Thanks a lot!

  7. Diane Trcka said

    Do you know an easy way to move your MDW repository from one server to another?

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

 
%d bloggers like this: