I have a legacy system with about 10 million rows in a table. In that table there is a column of type text, most of them are standard text but about 500 thousand rows have RTF markup in them. I need to convert the RTF formatted text in to plain text.
My current method is I have a C# program that loads the query in to a DataTable using a SqlDataAdapter and uses the winforms RichTextBox control to do the conversion.
void bw_DoWork(object sender, DoWorkEventArgs e)
{
count = 0;
rtbRTFToPlain = new RichTextBox();
using (SqlDataAdapter ada = new SqlDataAdapter("select note_guid, notes from client_notes", Globals.SQLConnectionString))
using(SqlCommandBuilder cmb = new SqlCommandBuilder(ada))
{
DataTable dt = new DataTable();
ada.UpdateCommand = cmb.GetUpdateCommand();
ada.Fill(dt);
int reportEvery = dt.Rows.Count / 100;
if (reportEvery == 0)
reportEvery = 1;
foreach (DataRow row in dt.Rows)
{
if (count % reportEvery == 0)
bw.ReportProgress(count / reportEvery);
try
{
if (((string)row["notes"]).TrimStart().StartsWith("{") == true)
{
rtbRTFToPlain.Rtf = (string)row["notes"];
row["notes"] = rtbRTFToPlain.Text;
}
}
catch
{
}
count++;
}
bw.ReportProgress(100);
this.Invoke(new Action(() =>
{
this.ControlBox = false;
this.Text = "Updating database please wait";
}));
ada.Update(dt);
}
}
This is working great for small tables, however this is the first time I had to run it on a table with such a large data-set (some of the rtf files can be several megabytes in size with embedded pictures) and I am getting OutOfMemory errors with my C# program.
I know I can chunk my query down in to a smaller batches, but I wanted to see if there is a better way that I was missing to strip off RTF formatting.
Should I just do the same thing as my current solution but only query out data smaller chunks at a time, or is there a better way to do this?