8

I have a legacy system with about 10 million rows in a table. In that table there is a column of type text, most of them are standard text but about 500 thousand rows have RTF markup in them. I need to convert the RTF formatted text in to plain text.

My current method is I have a C# program that loads the query in to a DataTable using a SqlDataAdapter and uses the winforms RichTextBox control to do the conversion.

void bw_DoWork(object sender, DoWorkEventArgs e)
{
    count = 0;

    rtbRTFToPlain = new RichTextBox();

    using (SqlDataAdapter ada = new SqlDataAdapter("select note_guid, notes from client_notes", Globals.SQLConnectionString))
    using(SqlCommandBuilder cmb = new SqlCommandBuilder(ada))
    {
        DataTable dt = new DataTable();
        ada.UpdateCommand = cmb.GetUpdateCommand();

        ada.Fill(dt);

        int reportEvery = dt.Rows.Count / 100;
        if (reportEvery == 0)
            reportEvery = 1;
        foreach (DataRow row in dt.Rows)
        {
            if (count % reportEvery == 0)
                bw.ReportProgress(count / reportEvery);

            try
            {
                if (((string)row["notes"]).TrimStart().StartsWith("{") == true)
                {
                    rtbRTFToPlain.Rtf = (string)row["notes"];
                    row["notes"] = rtbRTFToPlain.Text;
                }
            }
            catch
            {
            }

            count++;

        }
        bw.ReportProgress(100);

        this.Invoke(new Action(() => 
            {
                this.ControlBox = false;
                this.Text = "Updating database please wait";
            }));
        ada.Update(dt);
    }
}

This is working great for small tables, however this is the first time I had to run it on a table with such a large data-set (some of the rtf files can be several megabytes in size with embedded pictures) and I am getting OutOfMemory errors with my C# program.

I know I can chunk my query down in to a smaller batches, but I wanted to see if there is a better way that I was missing to strip off RTF formatting.

Should I just do the same thing as my current solution but only query out data smaller chunks at a time, or is there a better way to do this?

Scott Chamberlain
  • 1,045
  • 1
  • 9
  • 25

4 Answers4

5

I ended up making a CLR function to convert it.

I found this library, I then tweaked it a tiny bit to remove things I did not need like logging and Drawing methods, which allowed me to mark it as safe.

I then just made this small class.

using System.Data.SqlTypes;
using Itenso.Rtf.Converter.Text;
using Itenso.Rtf.Support;

public partial class StoredProcedures
{
    [Microsoft.SqlServer.Server.SqlFunction]
    public static SqlString RtfToPlainText(SqlString text)
    {
        if (text.Value.StartsWith(@"{\rtf"))
        {
            RtfTextConverter textConverter = new RtfTextConverter();
            RtfInterpreterTool.Interpret(text.Value, textConverter);
            return textConverter.PlainText;
        }
        else
            return text;
    }
}

And ran this in SQL

sp_configure 'clr enabled', 1
GO
RECONFIGURE
GO

CREATE ASSEMBLY ConversionsSqlExtensionsAssembly 
from 'E:\Code\ConversionsSqlExtensions\bin\Debug\ConversionsSqlExtensions.dll' 
WITH PERMISSION_SET = safe
go

CREATE function RtfToPlainText(@value nvarchar(max))
returns nvarchar(max)
AS EXTERNAL NAME ConversionsSqlExtensionsAssembly.StoredProcedures.RtfToPlainText

And it is fast and works great!

Scott Chamberlain
  • 1,045
  • 1
  • 9
  • 25
2

I did the same thing as Scott Chamberlain with the Itenso RTF DLL, but in my case there was a LOT more work that had to be done before this could be marked as SAFE in my SQL 2008R2 database.

First, like Scott, I had to remove the reference to System.Drawing. I found the easiest way to do that was to remove the reference, recompile and then rewrite the bits of code that were making use of the library. In most cases I just removed ALL the code from the VOID functions that were using it, and in situations where I couldn't I changed Drawing/Color objects to just "object" objects.

The other thing I had to do was remove any and all references to log4net as it references a library System.DirectoryServices which cannot be marked as safe either. This was a little harder but generally I took the same approach.

Lastly, after I did that I got complaints about setting static values, which is not allowed in a SAFE CLR function. So I updated the code to change all Static values to READONLY and that worked (this was pretty much all in the "logging" section of the code which I didn't really care about any way).

My final CLR code looked like so:

[Microsoft.SqlServer.Server.SqlFunction]
[return: SqlFacet(MaxSize=-1)]
public static SqlChars RTFFix([SqlFacet(MaxSize=-1)]string rtfField)
{
    SqlChars returnChars;
    try
    {

        RtfTextConverter textConverter = new RtfTextConverter();
        RtfInterpreterTool.Interpret(rtfField, textConverter);
        returnChars = new SqlChars(new SqlString(textConverter.PlainText.Trim()));
    }
    catch (Exception e)
    {
        returnChars = new SqlChars(new SqlString(rtfField));
    }
    return returnChars;
}
1

I wrote a small SQL function that scrapes the text out of markup-ish strings: http://cookingwithsql.com/index.php?option=com_content&task=view&id=65&Itemid=60

Unfortunately it will only handle string data up to 8000 characters. Perhaps it could be changed to use varchar(max), if you are running on SQL 2005 and up.

usage:

select dbo.ScrapeText('<I love SQL> gobbldygook font 12 blah blah') as 'Result'

Result
----------------------------
I love SQL

I'll post the source here for quick reference.

use master
IF (object_id('dbo.ScrapeText') IS NOT NULL)
BEGIN
  PRINT 'Dropping: dbo.ScrapeText'
  DROP function dbo.ScrapeText
END
GO
PRINT 'Creating: dbo.ScrapeText'
GO
CREATE FUNCTION dbo.ScrapeText 
(
  @string varchar(8000)
) 
returns varchar(8000)

AS
BEGIN
---------------------------------------------------------------------------------------------------
-- Title:        ScrapeText
--               
-- Date Created: April 4, 2006
--               
-- Author:       William McEvoy
--               
-- Description:  This function will attempt to remove markup language formatting from a string. This is 
--               accomplished by concetenating all text contained between greater than and less 
--               than signs within the formatted text.  
--               
---------------------------------------------------------------------------------------------------
-- Date Revised: 
-- Author:       
-- Reason:       
---------------------------------------------------------------------------------------------------

declare @text  varchar(8000),
        @PenDown char(1),
        @char  char(1),
        @len   int,
        @count int

select  @count = 0,
        @len   = 0,
        @text  = ''


---------------------------------------------------------------------------------------------------
-- M A I N   P R O C E S S I N G
---------------------------------------------------------------------------------------------------

-- Add tokens
select @string = '>' + @string + '<'

-- Replace Special Characters
select @string = replace(@string,' ',' ')

-- Parse out the formatting codes
select @len = len(@string)
while (@count <= @len)
begin
  select @char = substring(@string,@count,1)

  if (@char = '>')
     select @PenDown = 'Y'
  else 
  if (@char = '<')
    select @PenDown = 'N'
  else  
  if (@PenDown = 'Y')
    select @text = @text + @char

  select @count = @count + 1
end

RETURN @text
END
GO
IF (object_id('dbo.ScrapeText') IS NOT NULL)
  PRINT 'Function created.'
ELSE
  PRINT 'Function NOT created.'
GO
datagod
  • 7,141
  • 4
  • 38
  • 58
1

If you use a DataReader instead of a DataTable, you can process the rows one at a time instead of loading it all into memory. That should bypass your out of memory errors.

Turntwo
  • 11
  • 1