The SQL Dude!

On-Disk & In-Memory Pages that teach you the SQL way of life!

Archive for the ‘FTS (Full-Text Search)’ Category

Identifying Documents & Document Types that are NOT present in the full-text index

Posted by Sudarshan Narasimhan on August 29, 2012


I recently worked on this scenario with a customer of mine. They had a DOCS table which contained resume’s and CV’s of their candidates. This DOCS table was full-text indexed. They noticed that their search results in the application was not showing all the documents when they performed keyword based search. On looking into the full-text index and following my previous post on this topic @ https://thesqldude.com/2012/07/06/how-to-identify-the-row-or-document-that-failed-to-index-during-a-fulltext-index-population/ , we found out that many .DOC and .HTML documents have failed, hence these don’t show up in the search results.

Now, dbo.DOCS is a huge table and it contained more than a million rows. 1190688 to be exact. So to find out that documents that need to be replaced, I did the following steps.

GOAL

1. Identify all the document types and the count of documents of that type, which are not a part of the full-text index

2. Identify each DocumentID that is not a part of full-text index, so that this can be corrected from the application and thus re-index via fulltext population.

Luckily for me this was SQL Server 2008 R2. Starting with SQL Server 2008 there are 2 useful DMV’s that contain information about the full-text index.

a) sys.dm_fts_index_keywords_by_document

b) sys.dm_fts_index_keywords

These views were not present in SQL 2005. Using the DMV sys.dm_fts_index_keywords_by_document we can get to find the Document that is present in the FTIndex. If a document is present, it means it has been indexed successfully.

Table Structure

[dbo].[Docs]
    [DocId] [int](50)
    [DocName] [varchar](400)
    [DocType] [varchar](200) 
    [DocCategory] [varchar](50)
    [Document] [varbinary]

sys.dm_fts_index_keywords_by_document
    [document_id]  [int]
    [occurrence_count]   [int]
    [display_term]   [nvarchar (8000)]

The Document_id column in the DMV corresponds to the full-text key value of the table. In this case the table dbo.Docs has the full-text key column as [DocId]. So, basically the Docs.dbo (DocId) is the same as sys.dm_fts_index_keywords_by_document (document_id). If you want to find the full-text key column of a table, you can use a query like this within the context of the user database.

SELECT COL_NAME(object_Id('dbo.Docs'),OBJECTPROPERTY(OBJECT_ID('dbo.Docs'), 'TableFulltextKeyColumn'))

 

Identifying Indexed Documents, Failed Documents and Document Types

1. Find the count of documents that are successfully indexed

drop table #fts_index_document_id
CREATE table #fts_index_document_id (docid int)
insert into #fts_index_document_id
SELECT distinct document_id FROM sys.dm_fts_index_keywords_by_document(db_id('ContentDB'), object_id('dbo.Docs'))
go
select count(*) as [Successful Document Count] 
from dbo.docs_tbl_data t1
join dbo.fts_index_document_id t2
on t1.DocID = t2.DocID

I found that 979415 rows were returned from the above query.

Successfully Indexed Documents = 979415

2. Find the count of documents that are NOT successfully indexed, i.e. Failed Documents

select count(*) as [Failed Document Count]
from dbo.Docs t1
where t1.docid not in ( select t2.docid from #fts_index_document_id t2)

I found that 211273 rows were returned from the above query, which is basically a NOT IN filter where each document_id present in the base table (dbo.Docs) is checked in the FTindex (#fts_index_document_id) and if not present then it is a failed document.

Failed to Index Documents = 211273

Okay, so I did a count of the total rows present in the table dbo.Docs, and that came up to 1190688

Total Documents 1190688
Successfully Indexed Docs 979415
Failed to Index Docs 211273

So, the count is tallying up fine.

3. Find the count of each document type and the number of documents of that type that has failed to be indexed.

The idea here is basically to see which document types/formats are the ones that have failed to get indexed. You might notice patterns when finding this out, like all PDF docs have failed or something similar. To find out, I used this query

select DocType, COUNT(1) as [Count] from dbo.Docs
where DOCID NOT IN (select docid from #fts_index_document_id) 
group by DocType
order by COUNT(1) desc

image

As we can see from the above output, most of the failed documents are of type HTML and DOC.

4. Find each individual document, document name of a specific document type has failed to be indexed

Now that we have the breakdown of document types that have failed, we need to identify each unique document in the table of a specific type (say .DOC) that isn’t indexed, so that it can be corrected from the application side by re-inserting the document.

I am using the same # temp table which contains my indexed documents in the FTIndex. The reason I am using a #tmp table is because the DMV is slow as it scans through the full-text index and we don’t have to read the index each time. So I’m storing the output in a #tmp table and re-using that for all my queries. I would recommend you follow this because if the FTIndex is large, then querying the DMV each time isn’t going to be fast nor good for performance.

select t3.DocId, t3.Doctype, t3.DocName, t3.DocCategory
from dbo.Docs t3
where t3.DocID not in (
select t1.DocID from dbo.Docs t1
join #fts_index_document_id t2
on t1.DocID = t2.DocId)
and t3.DocType = 'HTML'
order by t3.DocId asc

The output of the above query will return each document that has failed to be indexed (not a part of the full-text index) and of type HTML. You can change the filter clause highlighted to suite your needs. See sample output below for a list of failed documents of type ‘DOC’.

image

Since you now the Document names, Document ID, Document type, you can get in touch with your application team to have a look at these documents and verify these are valid documents. If possible they can be reinserted back from the application into the table dbo.Docs.

NOTE

The thing with document indexing is that, all of these documents are first stored on the file-system and usually there is an application involved (any document management application) that takes care of inserting these documents into the database as BLOB data (varbinary, text, image). Usually its not inserted directly, but via some application. The application has to programmatically read the document as BYTE or Stream and then pass this to the database insert statement. There are chances that the application code has issues for certain documents, like maybe it missed a BYTE value or header. You will notice this scenario, when a particular document opens up fine using Office Client but the same document does not get indexed with this error

Error ‘0x8004170c: The document format is not recognized by the filter.’ occurred during full-text index population

When you see this error, keep in mind the application that is actually taking the document and inserting it into the database. Whichever application is being used to insert the documents is not doing it correctly and the document is actually not valid for indexing. The IFilter for Office Documents is very strict on the file format. It will not even process other file formats that Word is able to open such as RTF. Opening the file with Office Client is very different to opening the stream for the IFilter. You will see this behaviour yourself when you open/save the same file without changes, Office Client adds other changes and filesize increases. It does make changes to the file header and this is a key point!

 
You may want to take a closer look to see how these DOC files are getting inserted into SQL Server. I’ve seen that even a small byte missed here or there can render the file invalid (in the eyes of the IFilter). I’ve seen this happen for XLS/PPT/DOC file formats. If you face this scenario, try and find answers to the following questions.

Question to Think about:-

1. What is the application used to insert these documents into the database?

Find the application vendor or support team and ask them if they have any application updates or any issues reported of specific documents types failing to get indexed

2. When were these failed documents inserted. Try and find the source date/year when these were added to the table (maybe a datetime column exists in the table?)

It is possible that all of the failed documents originated from a specific date and time. Usually this is the case for older file formats like DOC/PPT/XLS compared to the new Metro Office Formats like DOCX etc.

It is also possible that at the time these were inserted, the application performing the insert had some issues which were later fixed

3. Are there different versions of this application available?

Same as #2. If the application has later fixes, it doesn’t change the fact that the table contains BLOB value inserted using an older version of the application. Re-inserting the same documents using the newer version of the application is a step to take.

4. Are they facing failure with new documents being inserted today?

This will tell you if the issue is with older documents already present in the table or even with newly added documents. Again, we are just narrowing down the source of the problem.

 

Advertisements

Posted in FTS (Full-Text Search), T-SQL | Tagged: , , , , , , , , , , , | Leave a Comment »

How to identify the row or document that failed to index during a fulltext index population

Posted by Sudarshan Narasimhan on July 6, 2012


I wanted to write this post for a long time. If any of you out there are using full-text indexes, especially on those tables what contains documents like Office Documents (doc/docx/xls/xlsx) or Adobe PDF files, then you might have come across scenarios when some of the documents (or) rows do not get indexed. Usually as an end-user or developer or DBA, you typically come across these situations when you do not find the document you are searching for in your website or application. A good example of this would be a Job Database like Naukri or Monster, where you search for candidate resume/profile based on a set of keywords containing skillset or languages known.

For E.g. Lets say, my search string is “TSQL 4 years experience”. When you know there are documents that match this and they don’t show up in the search results, that’s when you start to look into the full-text index and its population.

A thumb rule in FTS; A row in table is called as a document (DocID) in a full-text index. Any properties for the full-text index in SSMS or anywhere else, always refers to rows as documents. Like Documents processed, Documents indexed, Documents failed etc. Just keep this in mind. Even if you have just text data (varchar/char etc), the row is still called a document. The role of this DocumentID/DocID is to map a full-text key value in the full-text index to the corresponding row in the base table. A full-text key value is in varbinary(900) format. So for us to find the corresponding row, we need to convert the hex string to the native format of your full-text key.

Sample Scenario

1. Assume I have a table called dbo.tblCandidateProfiles that has a primary key on the candidate_id column which is of INT data-type.

2. I have a full-text index created on this table to search columns like candidate_name, candidate_resume etc.

3. Now, when you create a full-text index, you will get an option in SSMS GUI to choose the full-text Key Column (TableFulltextKeyColumn). Many people don’t pay close attention to this column, because the GUI usually picks the PK Index or any Unique Index present on the base table.

The reason this is important is because, SQL Server uses the KEY INDEX (of which the KEY COLUMN is a part) to uniquely identify a row in the base table, for full-text index operations. So if any document/row succeeds or fails to get indexed, it can be uniquely identified using the full-text index key value (which is based on the Key Index chosen during the creation of the FTIndex).

Lets look at an example. Note: This test was done on SQL Server 2008 R2 SP1


Table Definition

CREATE TABLE [dbo].[tblCandidateProfiles](
    [Candidate_ID] [int] IDENTITY(1,1) NOT NULL,
    [Name] [varchar](100) NOT NULL,
    [Resume] [varbinary](max) NOT NULL,
    [DocType] [nvarchar](20) NOT NULL,
    [DocName] [nvarchar](100) NOT NULL,
 CONSTRAINT [PK_tblCandidateProfiles] PRIMARY KEY CLUSTERED 
(
    [Candidate_ID] ASC
)WITH (PAD_INDEX  = OFF, STATISTICS_NORECOMPUTE  = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS  = ON, ALLOW_PAGE_LOCKS  = ON) ON [PRIMARY]
) ON [PRIMARY]

My table above has a PK on the candidate_ID column which is an IDENTITY column, so it is of type INT. I create the following full-text catalog and full-text index using the PK_CandidateProfiles as my KEY INDEX.

Full Text Index/Catalog Definition

-- Full Text Catalog Creation DDL
CREATE FULLTEXT CATALOG [Catalog_Resume_Search]WITH ACCENT_SENSITIVITY = OFF
AUTHORIZATION [dbo]
go

-- Full Text Index Creation DDL 
CREATE FULLTEXT INDEX ON [dbo].[tblCandidateProfiles]( 
[DocName] LANGUAGE [English], 
[DocType] LANGUAGE [English], 
[Name] LANGUAGE [English], 
[Resume] TYPE COLUMN [DocType] LANGUAGE [English]) 
KEY INDEX [PK_tblCandidateProfiles]ON ([Catalog_Resume_Search], FILEGROUP [PRIMARY]) 
WITH (CHANGE_TRACKING = AUTO, STOPLIST = SYSTEM) 
GO 

Okay, now that things are setup we can go ahead and insert a single document into this table.

Inserting 1 Sample Row/Document into the base table

SELECT 'TheSQLDude' AS Name, '.docx' AS doctype, 'TheSQLDude_Resume1.docx' AS docname,       
      * FROM OPENROWSET(BULK N'C:\MyResume.docx', SINGLE_BLOB) AS Resume

I used the OPENROWSET function to directly insert the word document as BLOB into the base table, since the datatype of the Resume column is varbinary(max) meaning it’s for storing BLOB data. This is possible starting with SQL Server 2005 onwards. If things go well, then you should be able to search for keywords based off this document. Like this for example,

select candidate_id, Name, DocName, Resume from
dbo.tblCandidateProfiles where contains (Resume, 'T-SQL')

image

As you can see in the above output, the document was successfully indexed and hence we are getting search results. If you happen to look into the SQLFT logs, which by the way are in the format SQLFT<dbid><catalog_id>.log, so in my case I looked into SQLFT0000800009.LOG, here is what I found for my 1 row that I inserted recently.

2012-07-06 01:21:26.14 spid38s     Informational: Full-text Auto population completed for table or indexed view ‘[test].[dbo].[tblCandidateProfiles]’ (table or indexed view ID ‘475148738’, database ID ‘8’). Number of documents processed: 1. Number of documents failed: 0. Number of documents that will be retried: 0.

So far things are good. But what happens if certain document types (like .PDF) or certain documents don’t return you search results, because they failed to get indexed. How do you find which document failed, if your table contained thousands of documents. Fear not, this is exactly where the full-text key values comes into the picture.

Identifying documents that failed to get indexed

Let me insert a junk document that I know will fail, into the same table mentioned above. I am taking a sample JPG Image file to show how this works, since there is no point in full-text indexing an image, because you can’t search inside an image. Duh!

INSERT INTO dbo.tblCandidateProfiles(Name, DocType, DocName, Resume) 
   SELECT 'JunkImage' AS Name, '.jpg' AS doctype, 'Heaven.jpg' AS docname,       
      * FROM OPENROWSET(BULK N'C:\heaven.jpg', SINGLE_BLOB) AS Resume

Once the row is inserted, lets open the same full-text log file SQLFT0000800009.LOG (DBID=9, Catalog_ID=9)

2012-07-06 01:20:13.12 spid38s     Warning: No appropriate filter was found during full-text index population for table or indexed view ‘[test].[dbo].[tblCandidateProfiles]’ (table or indexed view ID ‘475148738’, database ID ‘8’), full-text key value ‘2’. Some columns of the row were not indexed.

As you can see from the highlighted snippet above, that the 2nd row I inserted (junk JPG image) did NOT get indexed, since there was no Filter present for the JPG type. In the SQL FT Logs it will only give you the full-text key value which failed. In the above case, the key value = 2.

Let’s look back to the original KEY INDEX that I had defined. Since it was an INT data type, the key value is also displayed as an INT value. To find the row/document, you just have to look-up the row with candidate_id=2 and you got the document that did not get indexed.

Things get a little more complicated when you have some char/varchar or composite index defined as the PK or Unique KEY index. Let’s look at that scenario

Identifying documents that failed to get indexed (KEY INDEX is on column with CHARACTER data-type).

I altered the table with 1 change. Dropped the identity column and changed the data-type for the Candidate_ID to varchar(50). This still remains the column on which the Primary Key index exists. Then, I re-created the full-text index on this table.

[Candidate_ID] [varchar](50) NOT NULL,

I inserted 1 document and here is what I see in the full-text log

INSERT INTO dbo.tblCandidateProfiles(Candidate_ID, Name, DocType, DocName, Resume) 
   SELECT 'FBD367D1-642F-47CF-B79B-9BE48FB34007' AS Candidate_ID, 'TheSQLDude' AS Name, '.jpg' AS doctype, 'JunkDocument' AS docname,       
      * FROM OPENROWSET(BULK N'C:\heaven.jpg', SINGLE_BLOB) AS Resume

2012-07-06 01:51:21.36 spid33s     Warning: No appropriate filter was found during full-text index population for table or indexed view ‘[test].[dbo].[tblCandidateProfiles]’ (table or indexed view ID ‘875150163’, database ID ‘8’), full-text key value ‘FBD367D1-642F-47CF-B79B-9BE48FB34007’. Some columns of the row were not indexed.

Notice the random GUID value reported as failed. That is the value of my candidate_ID and since this was my KEY INDEX it is also reporting the same value in the FTLOG as having failed. So I just need to lookup the base table for rows that have the value of candidate_id = FBD367D1-642F-47CF-B79B-9BE48FB34007, and I will get the document that failed to get indexed. Life is still good with character columns, since SQL Server reports the same value in the log.

Another example of how easy this is with SQL 2008 R2

2012-07-06 02:05:03.66 spid33s     Error ‘0x8004170c: The document format is not recognized by the filter.’ occurred during full-text index population for table or indexed view ‘[test].[dbo].[fts_table2]’ (table or indexed view ID ‘1083150904’, database ID ‘8’), full-text key value ‘33377ver3.ppt’. Failed to index the row.


2012-07-06 02:05:04.66 spid33s     Informational: Full-text Auto population completed for table or indexed view ‘[test].[dbo].[fts_table2]’ (table or indexed view ID ‘1083150904’, database ID ‘8’). Number of documents processed: 1. Number of documents failed: 1. Number of documents that will be retried: 0.

Identifying documents that failed to get indexed in SQL Server 2005

Prior to SQL Server 2008, the full-text key value being reported in the SQLFT* logs would be in varbinary(900) format, so one you have to convert the value to make sense out of it. We still have to do this on SQL Server 2005. Here is how you would do it in SQL Server 2005.


2012-07-06 02:33:29.95 spid22s     Error ‘0x8004170c: The document format is not recognized by the filter.’ occurred during full-text index population for table or indexed view ‘[test].[dbo].[fts_table]’ (table or indexed view ID ‘2073058421’, database ID ‘5’), full-text key value 0x506F776572706F696E7446696C652E707074. Attempt will be made to reindex it.

2012-07-06 02:33:29.95 spid22s     The component ‘offfilt.dll’ reported error while indexing. Component path ‘C:\WINDOWS\system32\offfilt.dll’.

2012-07-06 02:33:49.92 spid22s     Warning: No appropriate filter was found during full-text index population for table or indexed view ‘[test].[dbo].[fts_table]’ (table or indexed view ID ‘5’, database ID ‘2073058421’), full-text key value 0x496D61676546696C652E6A7067. Some columns of the row were not indexed.

2012-07-06 02:35:53.81 spid22s     Error ‘0x8004170c: The document format is not recognized by the filter.’ occurred during full-text index population for table or indexed view ‘[test].[dbo].[fts_table]’ (table or indexed view ID ‘2073058421’, database ID ‘5’), full-text key value 0x446F63756D656E7446696C652E646F63. Attempt will be made to reindex it.

2012-07-06 02:35:53.81 spid22s     The component ‘offfilt.dll’ reported error while indexing. Component path ‘C:\WINDOWS\system32\offfilt.dll’.

In the above example, how would you identify the rows in the base table based on these values in the SQLFT logs (0x506F776572706F696E7446696C652E707074, 0x496D61676546696C652E6A7067, 0x446F63756D656E7446696C652E646F63) ???

I mentioned earlier that the full-text key values are in varbinary(900) format, so lets convert these values depending on what the data-type of the key column is (which is part of the Key index).

1. Identify the full-text key column on your table and its data-type using a query like given below

select a.name as [KeyColumn], b.name as [DataType], a.prec  as [Precision] from sys.syscolumns a
join sys.types b
on a.xtype = b.system_type_id
where colid=objectproperty(object_id('AllDocStreams'),'TableFulltextKeyColumn')
and id=object_id('AllDocStreams')

image

2. As you can see in the above output, the key column in on the column [docname] and it’s of type varchar(50). Before I show you the conversion part, here is the data in my table fts_table.
image
3. The Conversion Part.
  1. Full-Text Key Value is in Varbinary(900) format
  2. The Table Key Column is in Varchar(50) format
  3. Converting using the CONVERT() function, we see that 0x506F776572706F696E7446696C652E707074 translates to “PowerpointFile.ppt”. This is the document that failed to get indexed
declare @varbinary varbinary(900)
set  @varbinary =0x506F776572706F696E7446696C652E707074 
select convert(varchar(50),@varbinary)

 

image

4. If you have a lot of these types of indexing errors, make a list of the full-text key values reported in the SQLFT logs and you can directly query the base table using a query like given below to identify ONLY THOSE ROWS that failed to get indexed. You can add anything combination of unique columns you want to the SELECT list, but the where clause must be filtered on the full-text Key column identified in Step #1, which is “docname” in this example.

select docname, convert(varbinary(900),docname) as [Converted-FullTextKeyValue]
from fts_table where docname 
in (convert(varbinary(900),0x506F776572706F696E7446696C652E707074),
convert(varbinary(900),0x496D61676546696C652E6A7067),
convert(varbinary(900),0x446F63756D656E7446696C652E646F63) )

image

What Next, once you have identified the rows that failed to get full-text indexed?

Once you have identified a bunch of rows that failed to index, you can start to find out if there are any common patterns among these documents. The things to look out for are:-

  1. Do all the documents that failed have a common file-type. e.g. Did all .PDF or .DOCX files fail?
  2. Do these documents have the appropriate IFilters installed? e.g. Do we have a filter for .JPG present on the machine?
  3. Are all these failed documents from a single source/application?
  4. Is the source of all these documents a single computer/document library?

Life is even more easier with SQL Server 2008 onwards, since the entire procedure described above is done by a system stored procedure called sp_fulltext_keymappings

DECLARE @table_id int = OBJECT_ID(N'dbo.tblCandidateProfiles');
EXEC sp_fulltext_keymappings @table_id;

The [key] column returned by this system stored procedure is the Converted full-text key value which you will see in the SQLFT* log.

So now you know how to identify the rows/documents that failed to get indexed. In my next post on full-text, I will cover some of the common reasons for indexing failures and how to go about troubleshooting those. TTFN.

 

Posted in FTS (Full-Text Search) | Tagged: , , , , , , , , , | 4 Comments »

Getting a custom IFilter working with SQL Server 2008/R2 (IFilterSample)

Posted by Sudarshan Narasimhan on July 14, 2011


For those of you not familiar with my past work, I am a moderator of the official SQL Server Full-Text Search blog on MSDN and I recently posted there an article about custom IFilter and SQL Server. I am attaching a sample of the project I downloaded and built from Windows SDK v6.1 which works with SQL Server 2008 and R2.

Background

In this post I will cover how to get a custom IFilter working on SQL Server 2008/R2 (64-bit or 32-bit). For the purpose of this demo, I am using the IFilterSample that is available with Windows Server 2008 SDK.  Once you have installed the SDK you can find the sample here à C:\Program Files\Microsoft SDKs\Windows\v6.1\Samples\winui\wsearch\IFilterSample

The IFilterSample implements a sample IFilter for a fictitious file format called “.filtersample” which is actually an XML file. This sample will work on Windows Vista and Windows 7.

 

Developers out there might already be familiar with using the IFilter interface to develop the filters. For those who aren’t you can start from here,


Windows Search Interfaces

http://msdn.microsoft.com/en-us/library/ee872063(v=VS.85).aspx


Data Add-in Interfaces

http://msdn.microsoft.com/en-us/library/ee872090(v=VS.85).aspx


IFilter Interface

http://msdn.microsoft.com/en-us/library/bb266451(v=vs.85).aspx

 


Building and Setting up the IFilter DLL

I am going to use IFilterSample as reference and must warn you this post is NOT a how-to-write-custom-IFilter lessonJ.


The IFilterSample is a native C++ COM based DLL and as part of its registration provides a CLSID that can be registered on the system. Here are the step-by-step instructions you came here for:-

 

1.       Download and install the Windows Server 2008 SDK on your machine.

2.       Copy the sample from C:\Program Files\Microsoft SDKs\Windows\v6.1\Samples\winui\wsearch\FilterSample, onto your desktop which is the copy we will use to build the project.

3.       Open the FilterSample.sln file and  in Additional Dependencies in Linker Properties, add xmllite.lib

clip_image002[4]

4.       Build the project for x64 or Win32 as appropriate (this would be dependent on the architecture of your SQL Server this IFilter will be used on).

5.       Once the project is built you will have FilterSample.DLL in your Debug or Release folder. Make sure that the SQL Server service account has FULL CONTROL permissions on this folder.

6.      From an elevated command prompt browse to the location of the DLL and register it using regsvr32 FilterSample.dll

clip_image004[4]

 

clip_image006[4]

 

7.       Now we need to create some registry keys in SQL Server to make it use this filter DLL.

a.       The keys I created below are for a default instance of SQL Server 2008, so change the InstanceID as appropriate.

b.      The CLSID given below is specific to this IFilterSample only.

c.       Notice that the file extension I provided is called “.filtersample”.

d.      The path provided has to match the path from where it was registered in step #6.

 

[HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Microsoft SQL Server\MSSQL10.MSSQLSERVER\MSSearch\Filters\.filtersample]
@="{6C337B26-3E38-4F98-813B-FBA18BAB64F5}"

[HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Microsoft SQL Server\MSSQL10.MSSQLSERVER\MSSearch\CLSID\{6C337B26-3E38-4F98-813B-FBA18BAB64F5}]
@="C:\\Users\\sudarn\\Desktop\\FilterSample\\Release\\FilterSample.dll"
"ThreadingModel"="Both"
"Flags"=dword:00000001

 

8.       Now, connect to the SQL instance as a sysadmin and run the following sequence of commands

 

EXEC sp_fulltext_service @action=‘load_os_resources’, @value=1;

EXEC sp_fulltext_service ‘verify_signature’, 0

EXEC sp_fulltext_service ‘update_languages’;

EXEC sp_fulltext_service ‘restart_all_fdhosts’;

 

9.       To verify the FilterSample.DLL has been successfully loaded and recognised by SQL Server, you can check the DMV sys.fulltext_document_types or use sp_help_fulltext_system_components ‘filter’

 

 

Testing the IFilter

That was all, folks! You can now do a test to verify that the search is working as expected. There are 2 ways of testing the IFilter.

 

1.      Using SQL Server

a.       Create a test table and enable it for full-text indexing.

CREATE TABLE [dbo].[testTBL](

      [doctype] [nvarchar](50) NOT NULL,

      [document] [varbinary](max) NULL,

      [docname] [varchar](50) NOT NULL,

 CONSTRAINT [PK_testTBL] PRIMARY KEY CLUSTERED

(

      [docname] ASC

)WITH (PAD_INDEX  = OFF, STATISTICS_NORECOMPUTE  = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS  = ON, ALLOW_PAGE_LOCKS  = ON) ON [PRIMARY]

) ON [PRIMARY]

 

GO

CREATE FULLTEXT CATALOG [CatalogTEST] WITH ACCENT_SENSITIVITY = OFF
AUTHORIZATION
[dbo]

 

GO

CREATE FULLTEXT INDEX ON [dbo].[testTBL](

[docname] LANGUAGE [English],

[doctype] LANGUAGE [English],

[document] TYPE COLUMN [doctype] LANGUAGE [English])

KEY INDEX [PK_testTBL]ON ([CatalogTEST], FILEGROUP [PRIMARY])

WITH (CHANGE_TRACKING = AUTO, STOPLIST = SYSTEM)

GO

 

b.      Insert a sample document into the table. The IFilterSample comes with a file called “SampleFile.filtersample”, which can be used for test purposes.

 

INSERT INTO dbo.testTBL(doctype,docname,document)

SELECT ‘.filtersample’ AS doctype, ‘SampleFile.filtersample’ AS docname,

* FROM OPENROWSET(BULK N’C:\Users\sudarn\Desktop\FilterSample\SampleFile.filtersample’, SINGLE_BLOB) AS Document

 

c.       Perform a test search

select * from dbo.testTBL where contains(document, ’emergency’)

 

d.      You can also check to see the document was indexed by querying DMV,

SELECT * FROM sys.dm_fts_index_keywords(db_id(‘test’), object_id(‘dbo.testTBL));

 

2.      Using FILTDUMP.exe

 

Filtdump.exe is a sample tool that is available again in WinSDK and can be used to test a particular filter by loading it and then getting the text chunk out of any sample document. It is available in both 32-bit and 64-bit versions. It can be found here once you install WinSDK,

C:\Program Files\Microsoft SDKs\Windows\v6.1\Bin\x64\filtdump.exe

C:\Program Files\Microsoft SDKs\Windows\v6.1\Bin\filtdump.exe

 

For example,

C:\Program Files\Microsoft SDKs\Windows\v6.1\Bin\x64>filtdump.exe "C:\Users\s

udarn\Desktop\FilterSample\SampleFile.filtersample"

 

<snippet of the output>

———————————————————————-

Type = 31 (0x1f): AuthorXYZ

 

———————————————————————-

                                Attribute = B725F130-47EF-101A-A5F102608C9EEBAC\19

                                idChunk = 4

                                BreakType = 0 (No Break)

                                Flags(chunkstate) = 0x1 (Text)

                                Locale = 1033 (0x409)

                                IdChunkSource = 0

                                cwcStartSource = 0

                                cwcLenSource = 0

———————————————————————-

 this is a test of the emergency broadcasting system.

 

This is the content of the document.

 

I’m most pleased that it is the content of the document.

 

</snippet>

 

If you see the above then the IFilter is working fine as you are able to read it and get the text chunks out.

 

 

Possible Issue(s)

You might receive this error message in the SQLFT logs when trying to index a newly inserted row. This indicates that the IFilter was not recognised and there was an issue in loading it.

 

“Warning: No appropriate filter was found during full-text index population for table or indexed view ‘[test].[dbo].[Documents]’ (table or indexed view ID ‘1003150619’, database ID ‘6’), full-text key value ‘SampleFile.filtersample’. Some columns of the row were not indexed.”

 

(OR)

 

Error ‘0x8004170c: The document format is not recognized by the filter.’ occurred during full-text index population for table or indexed view ‘[test].[dbo].[Documents]’ (table or indexed view ID ‘1003150619’, database ID ‘6’), full-text key value ‘SampleFile.filtersample’. Failed to index the row.

 

 

The possible things to check for above error are:-

1.       Ensure that the registration and other steps given above are done in the same sequence as indicated.

2.       If you are using the Debug build ensure that the VCRuntime Debug dependency DLL is present (msvcr90d.dll or msvcr100d.dll).

3.       Use Process Monitor to ensure that your DLL is getting loaded

4.       Use Dependency Walker to ensure there are no other missing dependencies for your DLL.

5.       Verify that the DLL architecture (x64/x86) matches your SQL Server architecture. You cannot mix & match this.

 

I must point out one important thing here if you are going to use this DLL on other machines. You will need to have the Visual Studio Redistributable package installed on the target machine where you plan to deploy this. You will need the corresponding redistributable package for the VS version you used to build the DLL.


FilterSample VS2010 Project

You can download the FilterSample project I used for my testing from here. This was built using Visual Studio 2010 on Windows Server 2008 R2 x64. Just FYI – The Release and Debug DLL present in the attached project is x64. You can rebuild to x86 as per your requirements.

https://skydrive.live.com/?cid=2c5b6bdee139d310&sc=documents&uc=1&id=2C5B6BDEE139D310%21135# 

 

Have fun IFiltering and stay tuned for more on this blog.

Posted in FTS (Full-Text Search) | Tagged: , , , , , , | 4 Comments »