The SQL Dude!

On-Disk & In-Memory Pages that teach you the SQL way of life!

Posts Tagged ‘search’

How to identify the row or document that failed to index during a fulltext index population

Posted by Sudarshan Narasimhan on July 6, 2012


I wanted to write this post for a long time. If any of you out there are using full-text indexes, especially on those tables what contains documents like Office Documents (doc/docx/xls/xlsx) or Adobe PDF files, then you might have come across scenarios when some of the documents (or) rows do not get indexed. Usually as an end-user or developer or DBA, you typically come across these situations when you do not find the document you are searching for in your website or application. A good example of this would be a Job Database like Naukri or Monster, where you search for candidate resume/profile based on a set of keywords containing skillset or languages known.

For E.g. Lets say, my search string is “TSQL 4 years experience”. When you know there are documents that match this and they don’t show up in the search results, that’s when you start to look into the full-text index and its population.

A thumb rule in FTS; A row in table is called as a document (DocID) in a full-text index. Any properties for the full-text index in SSMS or anywhere else, always refers to rows as documents. Like Documents processed, Documents indexed, Documents failed etc. Just keep this in mind. Even if you have just text data (varchar/char etc), the row is still called a document. The role of this DocumentID/DocID is to map a full-text key value in the full-text index to the corresponding row in the base table. A full-text key value is in varbinary(900) format. So for us to find the corresponding row, we need to convert the hex string to the native format of your full-text key.

Sample Scenario

1. Assume I have a table called dbo.tblCandidateProfiles that has a primary key on the candidate_id column which is of INT data-type.

2. I have a full-text index created on this table to search columns like candidate_name, candidate_resume etc.

3. Now, when you create a full-text index, you will get an option in SSMS GUI to choose the full-text Key Column (TableFulltextKeyColumn). Many people don’t pay close attention to this column, because the GUI usually picks the PK Index or any Unique Index present on the base table.

The reason this is important is because, SQL Server uses the KEY INDEX (of which the KEY COLUMN is a part) to uniquely identify a row in the base table, for full-text index operations. So if any document/row succeeds or fails to get indexed, it can be uniquely identified using the full-text index key value (which is based on the Key Index chosen during the creation of the FTIndex).

Lets look at an example. Note: This test was done on SQL Server 2008 R2 SP1


Table Definition

CREATE TABLE [dbo].[tblCandidateProfiles](
    [Candidate_ID] [int] IDENTITY(1,1) NOT NULL,
    [Name] [varchar](100) NOT NULL,
    [Resume] [varbinary](max) NOT NULL,
    [DocType] [nvarchar](20) NOT NULL,
    [DocName] [nvarchar](100) NOT NULL,
 CONSTRAINT [PK_tblCandidateProfiles] PRIMARY KEY CLUSTERED 
(
    [Candidate_ID] ASC
)WITH (PAD_INDEX  = OFF, STATISTICS_NORECOMPUTE  = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS  = ON, ALLOW_PAGE_LOCKS  = ON) ON [PRIMARY]
) ON [PRIMARY]

My table above has a PK on the candidate_ID column which is an IDENTITY column, so it is of type INT. I create the following full-text catalog and full-text index using the PK_CandidateProfiles as my KEY INDEX.

Full Text Index/Catalog Definition

-- Full Text Catalog Creation DDL
CREATE FULLTEXT CATALOG [Catalog_Resume_Search]WITH ACCENT_SENSITIVITY = OFF
AUTHORIZATION [dbo]
go

-- Full Text Index Creation DDL 
CREATE FULLTEXT INDEX ON [dbo].[tblCandidateProfiles]( 
[DocName] LANGUAGE [English], 
[DocType] LANGUAGE [English], 
[Name] LANGUAGE [English], 
[Resume] TYPE COLUMN [DocType] LANGUAGE [English]) 
KEY INDEX [PK_tblCandidateProfiles]ON ([Catalog_Resume_Search], FILEGROUP [PRIMARY]) 
WITH (CHANGE_TRACKING = AUTO, STOPLIST = SYSTEM) 
GO 

Okay, now that things are setup we can go ahead and insert a single document into this table.

Inserting 1 Sample Row/Document into the base table

SELECT 'TheSQLDude' AS Name, '.docx' AS doctype, 'TheSQLDude_Resume1.docx' AS docname,       
      * FROM OPENROWSET(BULK N'C:\MyResume.docx', SINGLE_BLOB) AS Resume

I used the OPENROWSET function to directly insert the word document as BLOB into the base table, since the datatype of the Resume column is varbinary(max) meaning it’s for storing BLOB data. This is possible starting with SQL Server 2005 onwards. If things go well, then you should be able to search for keywords based off this document. Like this for example,

select candidate_id, Name, DocName, Resume from
dbo.tblCandidateProfiles where contains (Resume, 'T-SQL')

image

As you can see in the above output, the document was successfully indexed and hence we are getting search results. If you happen to look into the SQLFT logs, which by the way are in the format SQLFT<dbid><catalog_id>.log, so in my case I looked into SQLFT0000800009.LOG, here is what I found for my 1 row that I inserted recently.

2012-07-06 01:21:26.14 spid38s     Informational: Full-text Auto population completed for table or indexed view ‘[test].[dbo].[tblCandidateProfiles]’ (table or indexed view ID ‘475148738’, database ID ‘8’). Number of documents processed: 1. Number of documents failed: 0. Number of documents that will be retried: 0.

So far things are good. But what happens if certain document types (like .PDF) or certain documents don’t return you search results, because they failed to get indexed. How do you find which document failed, if your table contained thousands of documents. Fear not, this is exactly where the full-text key values comes into the picture.

Identifying documents that failed to get indexed

Let me insert a junk document that I know will fail, into the same table mentioned above. I am taking a sample JPG Image file to show how this works, since there is no point in full-text indexing an image, because you can’t search inside an image. Duh!

INSERT INTO dbo.tblCandidateProfiles(Name, DocType, DocName, Resume) 
   SELECT 'JunkImage' AS Name, '.jpg' AS doctype, 'Heaven.jpg' AS docname,       
      * FROM OPENROWSET(BULK N'C:\heaven.jpg', SINGLE_BLOB) AS Resume

Once the row is inserted, lets open the same full-text log file SQLFT0000800009.LOG (DBID=9, Catalog_ID=9)

2012-07-06 01:20:13.12 spid38s     Warning: No appropriate filter was found during full-text index population for table or indexed view ‘[test].[dbo].[tblCandidateProfiles]’ (table or indexed view ID ‘475148738’, database ID ‘8’), full-text key value ‘2’. Some columns of the row were not indexed.

As you can see from the highlighted snippet above, that the 2nd row I inserted (junk JPG image) did NOT get indexed, since there was no Filter present for the JPG type. In the SQL FT Logs it will only give you the full-text key value which failed. In the above case, the key value = 2.

Let’s look back to the original KEY INDEX that I had defined. Since it was an INT data type, the key value is also displayed as an INT value. To find the row/document, you just have to look-up the row with candidate_id=2 and you got the document that did not get indexed.

Things get a little more complicated when you have some char/varchar or composite index defined as the PK or Unique KEY index. Let’s look at that scenario

Identifying documents that failed to get indexed (KEY INDEX is on column with CHARACTER data-type).

I altered the table with 1 change. Dropped the identity column and changed the data-type for the Candidate_ID to varchar(50). This still remains the column on which the Primary Key index exists. Then, I re-created the full-text index on this table.

[Candidate_ID] [varchar](50) NOT NULL,

I inserted 1 document and here is what I see in the full-text log

INSERT INTO dbo.tblCandidateProfiles(Candidate_ID, Name, DocType, DocName, Resume) 
   SELECT 'FBD367D1-642F-47CF-B79B-9BE48FB34007' AS Candidate_ID, 'TheSQLDude' AS Name, '.jpg' AS doctype, 'JunkDocument' AS docname,       
      * FROM OPENROWSET(BULK N'C:\heaven.jpg', SINGLE_BLOB) AS Resume

2012-07-06 01:51:21.36 spid33s     Warning: No appropriate filter was found during full-text index population for table or indexed view ‘[test].[dbo].[tblCandidateProfiles]’ (table or indexed view ID ‘875150163’, database ID ‘8’), full-text key value ‘FBD367D1-642F-47CF-B79B-9BE48FB34007’. Some columns of the row were not indexed.

Notice the random GUID value reported as failed. That is the value of my candidate_ID and since this was my KEY INDEX it is also reporting the same value in the FTLOG as having failed. So I just need to lookup the base table for rows that have the value of candidate_id = FBD367D1-642F-47CF-B79B-9BE48FB34007, and I will get the document that failed to get indexed. Life is still good with character columns, since SQL Server reports the same value in the log.

Another example of how easy this is with SQL 2008 R2

2012-07-06 02:05:03.66 spid33s     Error ‘0x8004170c: The document format is not recognized by the filter.’ occurred during full-text index population for table or indexed view ‘[test].[dbo].[fts_table2]’ (table or indexed view ID ‘1083150904’, database ID ‘8’), full-text key value ‘33377ver3.ppt’. Failed to index the row.


2012-07-06 02:05:04.66 spid33s     Informational: Full-text Auto population completed for table or indexed view ‘[test].[dbo].[fts_table2]’ (table or indexed view ID ‘1083150904’, database ID ‘8’). Number of documents processed: 1. Number of documents failed: 1. Number of documents that will be retried: 0.

Identifying documents that failed to get indexed in SQL Server 2005

Prior to SQL Server 2008, the full-text key value being reported in the SQLFT* logs would be in varbinary(900) format, so one you have to convert the value to make sense out of it. We still have to do this on SQL Server 2005. Here is how you would do it in SQL Server 2005.


2012-07-06 02:33:29.95 spid22s     Error ‘0x8004170c: The document format is not recognized by the filter.’ occurred during full-text index population for table or indexed view ‘[test].[dbo].[fts_table]’ (table or indexed view ID ‘2073058421’, database ID ‘5’), full-text key value 0x506F776572706F696E7446696C652E707074. Attempt will be made to reindex it.

2012-07-06 02:33:29.95 spid22s     The component ‘offfilt.dll’ reported error while indexing. Component path ‘C:\WINDOWS\system32\offfilt.dll’.

2012-07-06 02:33:49.92 spid22s     Warning: No appropriate filter was found during full-text index population for table or indexed view ‘[test].[dbo].[fts_table]’ (table or indexed view ID ‘5’, database ID ‘2073058421’), full-text key value 0x496D61676546696C652E6A7067. Some columns of the row were not indexed.

2012-07-06 02:35:53.81 spid22s     Error ‘0x8004170c: The document format is not recognized by the filter.’ occurred during full-text index population for table or indexed view ‘[test].[dbo].[fts_table]’ (table or indexed view ID ‘2073058421’, database ID ‘5’), full-text key value 0x446F63756D656E7446696C652E646F63. Attempt will be made to reindex it.

2012-07-06 02:35:53.81 spid22s     The component ‘offfilt.dll’ reported error while indexing. Component path ‘C:\WINDOWS\system32\offfilt.dll’.

In the above example, how would you identify the rows in the base table based on these values in the SQLFT logs (0x506F776572706F696E7446696C652E707074, 0x496D61676546696C652E6A7067, 0x446F63756D656E7446696C652E646F63) ???

I mentioned earlier that the full-text key values are in varbinary(900) format, so lets convert these values depending on what the data-type of the key column is (which is part of the Key index).

1. Identify the full-text key column on your table and its data-type using a query like given below

select a.name as [KeyColumn], b.name as [DataType], a.prec  as [Precision] from sys.syscolumns a
join sys.types b
on a.xtype = b.system_type_id
where colid=objectproperty(object_id('AllDocStreams'),'TableFulltextKeyColumn')
and id=object_id('AllDocStreams')

image

2. As you can see in the above output, the key column in on the column [docname] and it’s of type varchar(50). Before I show you the conversion part, here is the data in my table fts_table.
image
3. The Conversion Part.
  1. Full-Text Key Value is in Varbinary(900) format
  2. The Table Key Column is in Varchar(50) format
  3. Converting using the CONVERT() function, we see that 0x506F776572706F696E7446696C652E707074 translates to “PowerpointFile.ppt”. This is the document that failed to get indexed
declare @varbinary varbinary(900)
set  @varbinary =0x506F776572706F696E7446696C652E707074 
select convert(varchar(50),@varbinary)

 

image

4. If you have a lot of these types of indexing errors, make a list of the full-text key values reported in the SQLFT logs and you can directly query the base table using a query like given below to identify ONLY THOSE ROWS that failed to get indexed. You can add anything combination of unique columns you want to the SELECT list, but the where clause must be filtered on the full-text Key column identified in Step #1, which is “docname” in this example.

select docname, convert(varbinary(900),docname) as [Converted-FullTextKeyValue]
from fts_table where docname 
in (convert(varbinary(900),0x506F776572706F696E7446696C652E707074),
convert(varbinary(900),0x496D61676546696C652E6A7067),
convert(varbinary(900),0x446F63756D656E7446696C652E646F63) )

image

What Next, once you have identified the rows that failed to get full-text indexed?

Once you have identified a bunch of rows that failed to index, you can start to find out if there are any common patterns among these documents. The things to look out for are:-

  1. Do all the documents that failed have a common file-type. e.g. Did all .PDF or .DOCX files fail?
  2. Do these documents have the appropriate IFilters installed? e.g. Do we have a filter for .JPG present on the machine?
  3. Are all these failed documents from a single source/application?
  4. Is the source of all these documents a single computer/document library?

Life is even more easier with SQL Server 2008 onwards, since the entire procedure described above is done by a system stored procedure called sp_fulltext_keymappings

DECLARE @table_id int = OBJECT_ID(N'dbo.tblCandidateProfiles');
EXEC sp_fulltext_keymappings @table_id;

The [key] column returned by this system stored procedure is the Converted full-text key value which you will see in the SQLFT* log.

So now you know how to identify the rows/documents that failed to get indexed. In my next post on full-text, I will cover some of the common reasons for indexing failures and how to go about troubleshooting those. TTFN.

 

Advertisement

Posted in FTS (Full-Text Search) | Tagged: , , , , , , , , , | 4 Comments »

MSDN Webcast on Optimizing and Tuning SQL Server full-text Search (FTS)

Posted by Sudarshan Narasimhan on October 4, 2011


 

In May 2011, I had delivered a live webcast on the topic “Optimizing and Tuning Full Text Search for SQL Server”. You can download the slide deck (PPT) and Video from MSDN Event library. I am sharing it here since the webcast was before the birth of theSQLDude. Refer to the links below for downloading the presentation deck and webcast video.

Note: This talk covers using Full-text search only on SQL Server 2008 and above. Nerd smile

Watch the Webcast

SQL Server Full-Text Search: Tips for Optimizing and Tuning Search for large environments from sudarshan on Vimeo.

 

Presentation Slide Deck (PPT)

You can download the slide deck along with the sample/demo script from [HERE]. It is also available on [THIS POST] @ SQLServerFAQ.

Posted in Webcast | Tagged: , , , , , , , , , , | Leave a Comment »