How can I return the full text of a document indexed in SQL Server Full-Text?

Question

I'm using SQL Server 2014 with FileTables to store a large number of documents in different formats. The iFilters are working great, and everything is getting indexed with FTS + Semantic Search. Now I'd like to run some additional processing on the text of those documents, but don't see a reason to have the pipeline redo the work of decoding, extracting, etc. the text from the files.

It seems there should be an obvious solution ... but I've been running in circles without any luck.

So the question is:

How can I query to return the full plaintext of a file in T-SQL?
If that's not possible, can it be done in SSIS or SSAS after the normal FTS parser has run?
If that's not possible, is there a way to hook into the FTS pipeline (via a trigger perhaps) so I can split the plain text into another table?

Alternate solutions are appreciated as well if you've got good examples for me to reference. The only immediate idea I had was to use a different network share for dropoff, have SSIS pick up the file and extract the text (no idea how to do that), and then to move the file + text to SQL server ... but that seems wonky for a lot of reasons.

[Edited to clarify "why"]

If SQL Server has already pulled out the text in order to chunk it & do the base NLP for the semantic index ... I'd rather use that than reinvent the wheel. Specific uses I'm looking into are post-processing with other NLP utilities (e.g. NLTK, GenSim, Stanford NLP NER, etc.) so that I can generate extractive document summarizations, store n-gram statistics for my corpus, and include NER for more effective faceted search.

If I have to extract the text out of files before storing them in SQL Server (either using SSIS/.NET so I can keep the iFilters OR by using a different tool altogether) there's limited usefulness in SQL Server's ability to perform that work on FileTables for anything but the most basic tasks.

Consider the number of document formats already supported - and it's a major task to recreate the feature. Similarly, having to go back to the actual file afterwards and redo that indexing work is inefficient and it'd seem sensible to disable FTS on FileTables, skip using them altogether, or scrap SQL Server for document-based FTS altogether.

RLF · Answer 1 · 2015-08-11T19:26:52.067

I am curious about what you will accomplish by getting the plain text. You already have the documents in the FileTable and you can open the files when needed using the appropriate tools.

For example: If you are looking at a PDF, a Word document, an Excel spreadsheet, and so forth you probably have the tools to look at the data. Most tools will even allow you to save the 'plain text' manually. But I suppose it depends on how you define "plain text". Save of Word document to a non-Unicode .TXT file? Or something more complex?

Of course, I realize that you would likely want a more automated approach, which I comment on further down.

One thing to know is that not all files will necessarily copy to "plain text". (At least I had trouble with plain text from Chinese.)

Regarding the Full Text Indexes:

Although the SQL Server Full Text Index is aware of the relative positions of 'words' in the index so that it can search for phrases, it has no interface to reveal the serial sequence of words in the text. Thus, there is no way (currently) to build plain text from the Full Text Index.

Even if those details were available, the Full Text Index still does not have everything needed to fully represent a document.

If you have any Stop Words configured, then they will not appear in the Full Text Index.
Punctuation is generally not included except when tokenized within a word, such as "pre-apocalypse", "F.B.I." and so forth.
Formatting is lost.
The language being used for Full Text Indexing may affect the results.

The Full Text Indexes are solving a different problem from creating plain text. However, my experience is that large text bases usually benefit from Full Text Indexing anyway, since someone is always trying to find something

I understand that once you get the text you will redundantly store it in a SQL Server table so as to expedite your use of the plain text data.

Automating the Extract and Load

This will require some work. But, since your documents are currently in a FileTable you should be able to access them from the file system using file tools. The "off-topic" answer to plain text extraction below includes several tools that are being used by others. Perhaps some of these tools would be useful to you.

http://stackoverflow.com/questions/5671988/how-to-extract-just-plain-text-from-doc-docx-files-unix

Since you are using SQL Server 2014, once you have plain text files you should be able to import them in a number of ways: BCP, SSIS, and other tools are available to load data.

Depending on your approach, you might choose to use a staging table to further prepare the results before moving the data into the destination table. And if you are also doing document versioning you will likely need to create some metadata that allows you to track different versions.

How can I return the full text of a document indexed in SQL Server Full-Text?

1 Answers1