I'm using SQL Server 2014 with FileTables to store a large number of documents in different formats. The iFilters are working great, and everything is getting indexed with FTS + Semantic Search. Now I'd like to run some additional processing on the text of those documents, but don't see a reason to have the pipeline redo the work of decoding, extracting, etc. the text from the files.
It seems there should be an obvious solution ... but I've been running in circles without any luck.
So the question is:
How can I query to return the full plaintext of a file in T-SQL?
If that's not possible, can it be done in SSIS or SSAS after the normal FTS parser has run?
If that's not possible, is there a way to hook into the FTS pipeline (via a trigger perhaps) so I can split the plain text into another table?
Alternate solutions are appreciated as well if you've got good examples for me to reference. The only immediate idea I had was to use a different network share for dropoff, have SSIS pick up the file and extract the text (no idea how to do that), and then to move the file + text to SQL server ... but that seems wonky for a lot of reasons.
[Edited to clarify "why"]
If SQL Server has already pulled out the text in order to chunk it & do the base NLP for the semantic index ... I'd rather use that than reinvent the wheel. Specific uses I'm looking into are post-processing with other NLP utilities (e.g. NLTK, GenSim, Stanford NLP NER, etc.) so that I can generate extractive document summarizations, store n-gram statistics for my corpus, and include NER for more effective faceted search.
If I have to extract the text out of files before storing them in SQL Server (either using SSIS/.NET so I can keep the iFilters OR by using a different tool altogether) there's limited usefulness in SQL Server's ability to perform that work on FileTables for anything but the most basic tasks.
Consider the number of document formats already supported - and it's a major task to recreate the feature. Similarly, having to go back to the actual file afterwards and redo that indexing work is inefficient and it'd seem sensible to disable FTS on FileTables, skip using them altogether, or scrap SQL Server for document-based FTS altogether.