9

Background: I'm writing micro controller C code to write an EBML file. EBML is like a binary XML with nested elements, but instead of start and end tags, there is a start ID, length, and then the data. I am writing this into external Flash in a low power application, so I'd like to keep the flash accesses to a minimum. Memory is also limited, because nothing is ever easy.

When I can keep the whole EBML element in memory, then generating it is easy because I can go back and fill in the length of each element after I know what that length is. The problem is what to do when I can't hold the whole element in memory. The options I see are:

  • Write what I know, then go back and add in the lengths (easiest, but adds more flash access than I want)
  • Calculate each element's length before I start writing it (relatively easy, but a lot of processor time)
  • Switch modes once my memory fills up, so that I then continue through the data, but only to calculate the lengths for elements already reserved in memory. Then write what I have in memory, and go back and continue processing the data from where I left off. (My favorite option so far)
  • Give elements a maximum or worst case length when they need to be written and their final length is not yet known. (Easier than above, but could backfire and waste space)

Question: It seems like this should be a relatively common issue that people have thought about. I know it can also happen when forming some data packets. Is there a better / more common / more accepted technique I'm missing here? Or just some terms for the issue that I can search for?

3 Answers3

2

If you do not know how long your payload will be, that is rarely cause for worry even if you cannot remember the position and backfill the length later:

Just note down "unknown size".

That feature depends on the payload consisting of EBML-elements and the following element not being a valid child-element though.

If you want, you can later canonicalize the resulting EBML offline at your convenience any way you want, for example to "no unknown sizes, minimal size" or to "minimal size, avoid unknown sizes".


Refer to the EBML RFC Draft on matroska.org for the details.

Deduplicator
  • 9,209
0

If a single element with fixed number of subelements is too large, then perhaps you should try to divide it in schema. I don't know this format, but most probably you can define a maximum lenght in it.

For sequences you could try to define max count of subelements and "stream" remaining in next file

For elements potentially exceeding max memory size prepare a stack containing pairs: reserved element length location and length counter. On pop save current counter in current marker and add it's value to the next counter.

In general try to minimalize number of the too big elements

Whoot
  • 1
0

KISS and YAGNI.
Choose option #1 and if it becomes a real problem - only then reiterate on it.

At least for similar use-cases with similar binary formats, when only a couple of values had to be filled in such a manner, this is the simplest/easiest/best solution. If you have to do this on each and every chunk of data - then it might be a flaw in architecture.

Kromster
  • 606
  • 1
  • 8
  • 18