6

I'm trying to parse readable data from PDFs and I keep ending up writing code like the following:

if (IsDob(line))
{
    dob = ParseDob(line);
}
else if (IsGender(line))
{
    gender = ParseGender(line);
}
...
...
...
else if (IsRefNumber(line))
{
    refNumber = ParseRefNumber(line);
}
else
{
    unknownLines.Add(line);
}

Then I use all this data to build up relevant objects, e.g. Customer using all of their personal data.

This tends to get kind of ugly when there's lots to parse.

I've split them up into functions like TryParsePersonalInfo(line), TryParseHistory(line) etc. But that feels like it's just moving the problem as I still have these endless else ifs everywhere.

Levi H
  • 177

4 Answers4

10

Here's what I would start with given the information you've provided.

Create an interface like so:

interface LineProcessor<E> {
  boolean Process(Record record, Line line); 
}

Let's assume Record is a propery bag for now. Don't get hung up on this, I'm keeping it simple for demonstration purposes.

class Record {
  public Date dob { get; set; }
  public String gender { get; set; }
  public String refNumber { get; set; }
  // ...
}

Sorry if the syntax is not right for C#. If you aren't sure what I'm getting at, I'll clarify.

Then create a list of instances of LineParser: one for type of line. Then, you write a loop (python/pseudocode):

for line in pdf:
  obj = process(record, line)

def process(record, line):
  for handler in processors:
    if handler.process(record, line): return
  else:
    unknown.add(line)

Now the implementation of one of these might look like so:

class DobProcessor implements LineProcessor {
  boolean process(Record record, Line line) {
    if (IsDob(line)) {
      record.dob = ParseDob(line);
      return true;
    } else {
      return false;
    }
  }

  Date ParseDob(Line line) {
    //...
  }

  boolean IsDob(Line line) {
    //...
  }
}

This should make the code more manageable. Instead of a large if statement, you will have a number of classes where each one is specific to a case. This not only organizes the code, it means you can add or remove cases without touching any of the code around other cases.

One other thing is that now that the processes interface is down to a single method, you can actually start thinking of this as more of a function pointer/lambda so you can dispense with the interface if you so desire.

JimmyJames supports Canada
  • 30,578
  • 3
  • 59
  • 108
3

Use a list of delegates

...to call idemopotent parsing functions encapsulated in an injectable class contained in an extesnsible object model

I'm going to introduce a few programming concepts here so bear with the long answer. This may seem overcomplicated until you're used to it. But all of these patterns are actually very common and very useful in commercial software.

Create a DTO

First, you need somewhere to store your results all together. A DTO class perhaps; would look like this example. I added a ToString() override so you can see the contents in the debug pane instead of just the class name:

public enum Gender
{
    Male, Female
}

public class DocumentMetadata
{
    public DateTime DateOfBirth { get; set; }
    public Gender Gender { get; set; }
    public string RefNum { get; set; }

    public override string ToString()
    {
        return string.Format("DocumentMetadata: DOB={0:yyyy-MM-dd}, Gender={1}, RefNum={2}", DateOfBirth, Gender, RefNum);
    }
}

Define a delegate for parser methods that follow a pattern

Now that we have a DTO, we can reason about how to parse the lines. Ideally, we'd want a series of idempotent functions that you can unit test easily. And to iterate over them, it would be helpful if they were similar in some way. So we define a delegate:

public delegate bool Parser(string line, DocumentMetadata dto);

So we can write a parser method sort of like this:

protected bool ParseDateOfBirth(string line, DocumentMetadata dto)
{
    if (!line.StartsWith("DOB:")) return false;
    dto.DateOfBirth = DateTime.Parse(line.Substring(4));
    return true;
}

We can write any number of parser methods, and as long as they return a boolean and accept a string and a DTO object as arguments, they will all match the delegate, so they could all be put into a list, sort of like this:

List<Parser>  parsers = new List<Parser>
{
    ParseDateOfBirth,
    ParseGender,
    ParseRefNum
};

We'll use this capability in a moment when we write the parser class.

Create the base parser class

Now here are a few more things to worry about:

  1. We want our parser methods contained in a single unit of code, e.g. a class.
  2. We want the class to be injectable, e.g. in case you need to have more than one kind of parser in the future, and so you can stub it out for unit tests.
  3. We want the logic for iterating parsers to be common.

So first define an interface:

public interface IDocumentParser
{
    DocumentMetadata Parse(IEnumerable<string> input);
}

An abstract base parser:

public abstract class BaseParser : IDocumentParser
{
    protected abstract List<Parser> GetParsers();

    public virtual DocumentMetadata Parse(IEnumerable<string> input)
    {
        var parsers = this.GetParsers();
        var instance = new DocumentMetadata();

        foreach (var line in input)
        {
            foreach (var parser in parsers)
            {
                parser(line, instance);  //This is the line that does it all!!!
            }
        }
        return instance;
    }       
}

Or if we want the Parse function to be a bit more clever (and also count the lines that are successfully parsed):

    public virtual DocumentMetadata Parse(IEnumerable<string> input)
    {
        var parsers = this.GetParsers();
        var instance = new DocumentMetadata();

        var successCount = input.Sum( line => parsers.Count( parser => parser(line, instance) ));

        Console.WriteLine("{0} lines successfully parsed.", successCount);

        return instance;
    }

While the LINQ solution is more "clever" the nested loops may communicate the intent more clearly. Judgment call here. I like the LINQ version because I can count the lines that succeed, and possibly use that information to validate the document.

Implement the parsers

Now we have a basic framework for parsing documents. All we need is to implement GetParsers so that it returns a list of methods that do the work:

public class DocumentParser : BaseParser
{
    protected override List<Parser> GetParsers()
    {
        return new List<Parser>
        {
            ParseDateOfBirth,
            ParseGender,
            ParseRefNum
        };
    }

    private bool ParseDateOfBirth(string line, DocumentMetadata dto)
    {
       ///Implementation
    }

    private bool ParseGender(string line, DocumentMetadata dto)
    {
       ///Implementation        
    }

    private bool ParseRefNum(string line, DocumentMetadata dto)
    {
       ///Implementation
    }
}

Notice that only the document-specific logic is held in the final implementation. And all it does is supply the delegates via GetParsers(). All common logic is in the base class, where it can be reused.

Test

We can now parse a document with a couple lines of code:

var parser = new DocumentParser();
var doc = parse.Parse(input);

But we'd like to inject this thing, so let's write it properly:

public class Application
{
    protected readonly IDocumentParser _parser; // injected

    public Application(IDocumentParser parser)
    {
        _parser = parser;
    }

    public void Run()
    {
        var input = new string[]
        {
            "DOB:1/1/2018",
            "Sex:Male",
            "RefNum:1234"
        };

        var result = _parser.Parse(input);

        Console.WriteLine(result);
    }
}

public class Program
{
    public static void Main()
    {
        var application = new Application(new DocumentParser());
        application.Run();
    }
}

Output:

3 lines successfully parsed.
DocumentMetadata: DOB=2018-01-01, Gender=Male, RefNum=1234

Now we have all of the following:

  1. Generic logic for iterating over all the parsers
  2. An extensible object model allowing new parsers to be introduced
  3. An injectable interface
  4. Idemopotent, unit-testable methods that do the complicated stuff
  5. The ability to count successful parse operations (which could be used, for example, to ensure the document is valid)
  6. A class that encapsulates the resulting data

Working example on DotNetFiddle

John Wu
  • 26,955
1

Answer to your Question

A more C#-esque solution would be to take advantage of delegates:

delegate bool TryParseHandler(string line);

private readonly TryParseHandler[] _handlers = new[]
{
    TryParseDob,
    TryParseGender,
    TryParseRefNumber,
    //...
}
bool TryParseDob(string line)
{
    if(!IsDob(line)) return false;
    dob = ParseDob(line);
    return true;
}
//etc
void ProcessLine(string line)
{
    foreach(var handler in _handlers)
    {
        if(handler(line)) return;
    }
}

Answer to your Problem

The correct answer would be to ditch your "is" solution entirely. What does "line" look like? Is it regular? Does it contain keywords? For example, does it look like dob: 1/1/1990, gender: female, ref: 123456? If so, you want to pull out those keywords and use that for your lookup:

private readonly IReadOnlyDictionary<string, Action<string>> _setValueLookup = new Dictionary<string, Action<string>>
{
    ["dob"] = s => dob = DateTime.Parse(s),
    ["gender"] = s => gender = ParseGender(s),
    ["ref"] = s => refString = s,
};
void ProcessLine(string line)
{
    var pair = line.Split(new []{':'}, 2);
    if(pair.Length < 2) 
    {
        unknownLines.Add(line);
        return;
    }

    var key = pair[0];
    var value = pair[1];
    if(!_setValueLookup.TryGetValue(key, out Action<string> callback))
    {
        unknownLines.Add(line);
        return;
    }

    callback(value);
}
Kevin Fee
  • 2,847
-1

If you are happy using lambdas and linq, you can simplify all of that code as:

var lineHandlers = new List<(Func<LineType, bool> test, Action<LineType> action)>
{
    (IsDob, l => dob = ParseDob()),
    (IsGender, l => gender = ParseGender(l)),
    (IsRefNumber, l => refNumber = ParseRefNumber(l)),
    (_ => true, l => unknownLines.Add(l))
};

var (_, action) = lineHandlers.First(handler => handler.test(line));
action(line);

(Note, requires C# 7 in order to use tuples. Just replace that tuple with a simple type if you are using an old version of the compiler).

David Arno
  • 39,599
  • 9
  • 94
  • 129