13

I want a configuration file for a .NET program. This file is to configure pairs of regular expressions. The regular expressions belong within a hierarchy of sections.

Section1
    SubsectionA
        regular expression
        regular expression
    SubsectionB
        regular expression
        regular expression
Section2
    (etc.)

Or in Markdown format

# Section1

SubsectionA

regular expression
regular expression

Anyway I want a configuration file format in which the regular expression literals do not need to be escaped.

What configuration file format supports this? Even YAML requires escaping.

The two examples I showed above -- i.e. an indented text file, and Markdown -- are OK but non-standard.

Kilian Foth
  • 110,899
ChrisW
  • 3,427

8 Answers8

16

CDATA sections in XML should do.

Here's a stackoverflow post about it: https://stackoverflow.com/questions/2784183/what-does-cdata-in-xml-mean

I remember it took me a while to understand how to use them. A DOM parser has a dedicated instruction for creating a CDATA section but there is no equivalent statement for reading them. Reading is transparent, you just read the contents of the element that has the CDATA section in it to have the literal text returned.

Here's an example taken from the input data file of a code scrutinizer I once made. It allows the definition of the forms of problematic code fragments using regular expressions.

<IssueBuster type="Basic" name="Suspicious lambdas" skip="true">
    <Description>
        <!-- See https://stackoverflow.com/questions/2465040/using-lambda-expressions-for-event-handlers -->
        A lambda expression is used for event handlers which inhibits unsubscribing.
    </Description>
    <Regex><![CDATA[\+\=\s*\([^\s\,]+\,\s*[^\s\)]+\)\s*=>]]></Regex>
    <SkipFileNames>
        <!-- If any of these inner texts appears in a file path, this buster will ignore that file. -->
        <FileName>SMMdataComponent\DeltaPlusGenerator\TestForm.cs</FileName>
        <FileName>Toolchain\Validate-TranslationEnums</FileName>
        <FileName>Tools\JcSimulator</FileName>
        <FileName>Tools\AR3toGps</FileName>
        <FileName>Tools\XMLConverter</FileName>
        <FileName>GitManipulator.cs</FileName>
    </SkipFileNames>
</IssueBuster>

Note that CDATA takes this form:

<![CDATA[your_literal_text]]>

Whatever you put in between the inner square brackets will be returned verbatim.

To wrap this up: in the unlikely event you have to include a ]]> sequence in the content, you can split the content after the second ] and create two consecutive CDATA sections. This can easily be implemented recursively.

Martin Maat
  • 18,652
13

This is indeed an interesting question, as commonly the requirements for config file formats are somewhat different, so it's understandable that available formats don't really support this requirement.

If there are no other configuration data in those files, having a non-standard but easily readable and editable format is ok (of course that's just my humble opinion, there's no absolute truth here.)

CDATA sections in XML as mentioned by Martin Maat are standard, but probably a little cumbersome and error-prone when editing. You also need to think up a proper XML schema for the XML tags, as just using <Section1> and <Section2> would be counter to XML conventions when Section1 and Section2 actually have the same structure. <section name="1">...</section> would be more appropriate but tedious to type.

YAML with the pipe format might actually work and is probably good enough:

Section1:
    SubsectionA: |
        regular expression
        regular expression
    SubsectionB: |
        regular expression
        regular expression
Section2:
    (etc.)

Your app will have to split the values of the subsections (which are simply strings with embedded newlines) into lines to retrieve the regular expressions. One thing that might be difficult would be expressions with leading or trailing blanks, but that applies to any format that allows unquoted values. An advantage of YAML here would be that you have sufficient quoting mechanisms to handle this.

10

Consider TOML

It handles two different forms of raw strings:

regex    = '<\i\c*\s*>'

OR

regex2 = '''I [dw]on't need \d{2} apples'''
JimmyJames supports Canada
  • 30,578
  • 3
  • 59
  • 108
6

NestedText is a configuration file format that makes a point of not requiring any escaping or quoting, which makes it very good for applications like this:

# regex examples from:
# https://support.google.com/a/answer/1371417
Section1:
    SubsectionA:
        - (\W|^)stock\stips(\W|$)
        - (\W|^)stock\s{0,3}tips(\W|$)
    SubsectionB:
        - 192\.168\.1\.
        - 192\.168\.1\.\d{1,3}

Unfortunately, I don't know that there are any NestedText implementations available for .NET at the moment. The reference implementation is for python3, so you could make it work, but it'd probably involve launching an external python3 process. Even if this isn't a useful suggestion for OP, though, I think that it could be useful for someone else with the same question.

Disclosure: I was involved in designing NestedText.

4

Roll your own

Seems a simple enough format; just write your own custom parser to deserialize from a plain text file (perhaps just like your first example) into your object model. This would require maybe a couple dozen lines of code.

You have a simple, domain-specific problem to solve; why saddle yourself with a bunch of generalized constraints and requirements to conform to some standard format? What benefit does it give you? You've already spent more time looking for an existing library than it would have taken to just write the code.

3

Tab-separated value format (.tsv) is a simple, easy to edit text format that allows any text within a field except TAB and newline characters.

There are no escaping rules in its IANA format spec.

TSV doesn't explicitly define a hierarchy of sections, but you can put section info in the first column with the regex in the second column. E.g. the first column could contain

Section1.SubsectionA

or to avoid repeating the section name,

Section1
.SubsectionA

So here's an example file:

section<TAB>regex1<TAB>regex2
Section1<TAB><TAB>
.SubsectionA<TAB>regular expression1<TAB>regular expression2
.SubsectionB<TAB>another regular expression1<TAB>another regular expression2

(<TAB> here stands in for a plain TAB character.)

NOTE: CSV reader libraries usually have configurable delimiters to support TSV format and other variations. Or just read text lines and split each line on '\t'.

I'm not familiar with .NET but see https://stackoverflow.com/questions/17838365/how-to-read-tsv-file-using-asp-net

Jerry101
  • 5,447
0

If the regex syntax does not allow embedded whitespace, INI-files is your best bet, since a value can contain any non-whitespace character.

Any other format I can think of have some characters which have special meaning and therefore need to be escaped. For example, CData sections in XML is terminated by the ]]> character sequence, but that sequence could be valid (if uncommon) in a regex and therefore needs escaping.

If the regex syntax allow embedded whitespace, I don't know of any mainstream format which can support this unescaped. Indeed, I wonder if it is even theoretically possible since basically any sequence of characters could occur in a regex.

JacquesB
  • 61,955
  • 21
  • 135
  • 189
-1

In configuration-files it is as far as i know nearly always required to escape some certain characters (\n, ',", etc.). Because if you are not escaping it then the configurationfile-parser does not know where the regex ends. It is in every syntax always possible that the first character recognized after the regex should actually still be a part of the regex. And in general the end of a regex can be arbitrary characters, even characters which have a special meaning for your configurationfile-parser. Inside of files which contains other information you can probably not avoid that.

If you really do not want to escape anything then write your regex in a dedicated file without any other information than the regex and then in the configuration-file you can specify a file-path for the location of your regex-file. that's very unconventional but if there is a requirement that you really do not have to escape anything then you can do this as a design-decision.

Example:

Section1
    SubsectionA
        regexpA1.txt
        regexpA2.txt
    SubsectionB
        regexpB1.txt
        regexpB2.txt

Inside this .txt-files you do not have to escape anything but it is obviously not easy to read when someone opens your configuration-file.

anion
  • 285