Code generation for multi-platform, multi-language library

Question

I need to design a code generator system that produces serialisation/deserialisation libraries for multiple languages running on multiple platforms.

Hard Constraints

I need this to capture a pre-existing binary serialisation format bit-for-bit. It cannot add frames, versions, etc. to the stream; therefore I think that e.g. Google Protobuf is not possible.

I can use open-source dependencies but the message definitions are closed-source.

I need to capture 20-some structure definitions that already exist but are currently non-portable.

I need to support C, C++, C# and Python targets.

I need to support structures that have the following C-like elements:

fixed-length arrays;
sub-structures;
C-like primitives for byte, int, char, float, double, enum

I need these library artefacts to be generated at compile time.

Whether in-library or out-of-library, I need to be able to serialise and deserialise these structures to pipe- or socket-like streams.

Soft constraints

Especially for the C/C++ targets, there should be very little overhead during the serialisation/deserialisation phase. In the ideal case, the output would simply be structure definitions with explicit pack/align attributes. I can assume that each structure is guaranteed to fit in memory.

It would be nice to use an off-the-shelf solution for this, but if no such solution exists then I am (reluctantly) open to hacking up my own.

The central protocol definitions should use some pre-existing, well-understood markup format like XML. Alternatively, if C# is the first-class platform, then I would be open to a system that annotates the C# structure versions and produces other language bindings from there.

It would be nice if the structure definition system had a versioning scheme (though this cannot influence the on-stream format). If this feature does not exist I will just rely on the repository's existing version control.

My solution should be data-centric and should not really have any business logic.

Pre-existing art

Best practices / Design patterns for code generation has some hints but is rather too vague.

Writing a API for a hardware device for mutliple platforms seems like it's more routine-focused than data-focused.

I cannot find any other questions that seem related enough.

Questions

What are the comparative design costs/risks to rolling this in-house versus using an off-the-shelf serialisation library; and do off-the-shelf frameworks for this kind of thing even exist?

Doc Brown · Accepted Answer · 2024-12-04T16:13:37.850

capture a pre-existing binary serialisation format bit-for-bit.

Means: you must be very lucky to find a fully generic generator for this kind of serialization which will suit your needs. At least it needs to be highly customizable. Even if such a thing exists, you need to check if its configuration is really less complex than rolling out your own hand-coded program in a high-level language like C# or Python.*

_{*This answer mentions Kaitai, it might be worth a look.}

I would be open to a system that annotates the C# structure versions and produces other language bindings

Well, that is what we did in the past for generating comparable code from a single source to different targets:

C# classes with annotated attributes/properties (but no methods) as input, compiled into an assembly
use reflection to iterate over the classes and their properties in the assembly, then generate whatever you want from it.

This saves you the hassle of writing a parser for the input data, and it will allow you to add any kind of annotations you need. It is a useful and effective approach when

the code for the generator as well as the input C# code is developed by the same team
annotated C# is descriptive enough for your requirements
the team is fine with using C# code as input

In fact, if I were in your shoes, I would give this approach a try and see where it gets you. If you end up with something between 1000 and 2000 lines of code for the generator (like us), there is no huge saving potential left, let alone the extra effort to find, evaluate, choose, learn and test some template based 3rd party system which partly supports what you need (and where you still will have to make custom adaptions to make the output match your requirements).

Let me finally add that all of your platforms may require some structure-independent low-level code for reading and writing binary data. If that is the case, and you think this part alone is complex enough to justify a common library, I would write this library in C, since C libraries can be consumed by all the other platforms you mentioned.

score 1 · Answer 2 · answered Dec 04 '24 at 12:42

Kaitai Struct does half of this: https://kaitai.io/ but doesn't emit writer code. Maybe you could use it to solve half of your problem then you only have to write the other half? Serialization is usually easier, too.

Python Construct does writers as well, but only in Python.

There's a good summary of some tools in this space.

score 1 · Answer 3 · answered Dec 04 '24 at 16:16

If you're starting from XML, the T4 support in Visual Studio may be useful if you want to roll your own generated code. As far as I know, the logic parts of the T4 files have to be in a .NET language (I know VB and C# are both supported), but I don't think there's any requirement for the generated code to be in any particular language as it's ultimately a text template. You would be able to have the logic for making a model of the source information in a common file, then have individual templates for the different destination languages which would consume that model.

I've found T4 to work very well for my purposes where I wanted to have a single source of truth used to generate boilerplate source code.

Code generation for multi-platform, multi-language library

Hard Constraints

Soft constraints

Pre-existing art

Questions

3 Answers3