How to document and teach others "optimized beyond recognition" computationally intensive code?

Question

Occasionally there is the 1% of code that is computationally intensive enough that needs the heaviest kind of low-level optimization. Examples are video processing, image processing, and all kinds of signal processing, in general.

The goals are to document, and to teach the optimization techniques, so that the code does not become unmaintainable and prone to removal by newer developers. (*)

(*) Notwithstanding the possibility that the particular optimization is completely useless in some unforeseeable future CPUs, such that the code will be deleted anyway.

Considering that software offerings (commercial or open-source) retain their competitive advantage by having the fastest code and making use of the newest CPU architecture, software writers often need to tweak their code to make it run faster while getting the same output for a certain task, whlist tolerating a small amount of rounding errors.

Typically, a software writer can keep many versions of a function as a documentation of each optimization / algorithm rewrite that takes place. How does one make these versions available for others to study their optimization techniques?

Clean readable code vs fast hard to read code. When to cross the line?

Mark Booth · Answer 1 · 2014-09-01T20:05:01.087

Short answer

Keep optimisations local, make them obvious, document them well and make it easy to compare the optimised versions with each other and with the unoptimised version, both in terms of source code and run-time performance.

Full answer

If such optimisations really are that important to your product, then you need to know not only why the optimisations were useful before, but also provide enough information to help developers know whether they will be useful in the future.

Ideally, you need to enshrine performance testing into your build process, so you find out when new technologies invalidate old optimisations.

Remember:

The First Rule of Program Optimisation: Don't do it.

The Second Rule of Program Optimisation (for experts only!): Don't do it yet."

— Michael A. Jackson

In order to know whether now is the time requires benchmarking and testing.

As you mention, the biggest problem with highly optimised code is that it is difficult to maintain so, as far as possible, you need to keep the optimised portions separate from the unoptimised portions. Whether you do this through compile time linking, runtime virtual function calls or something in between shouldn't matter. What should matter is that when you run your tests, you want to be able to test against all of the versions you are currently interested in.

I would be inclined to build a system in such a way that the basic unoptimised version of the production code could always be used to understand the intent of the code, then build different optimised modules alongside this containing the optimised version or versions, explicitly documenting wherever the optimised version differs from the base-line. When you run your tests (unit and integration), you run it on the unoptimised version and on all current optimised modules.

Example

For instance, lets say you have a Fast Fourier Transform function. Maybe you have a basic, algorithmic implementation in fft.c and tests in fft_tests.c.

Then along comes the Pentium and you decide to implement fixed point version in fft_mmx.c using MMX instructions. Later the pentium 3 comes along and you decide to add a version which uses Streaming SIMD Extensions in fft_sse.c.

Now you want to add CUDA, so you add fft_cuda.c, but find that with the test dataset that you've been using for years, the CUDA version is slower than the SSE version! You do some analysis and end up adding a dataset that's 100 times bigger and you get the speed-up you expect, but now you know that the set-up time for using the CUDA version is significant and that with small datasets you should use an algorithm without that set-up cost.

In each of these cases you are implementing the same algorithm, all should behave in the same way, but will run with differing efficiencies and speeds on different architectures (if they will run at all). From the code point of view though, you can compare any pair of source files to find out why the same interface is implemented in different ways and usually, the easiest way will be to refer back to the original unoptimised version.

All of the same goes for a OOP implementation where a base class which implements the unoptimised algorithm, and derived classes implement different optimisations.

The important thing is to keep the same things which are the same, so that the differences are obvious.

Dipan Mehta · Answer 2 · 2011-12-18T04:44:10.377

Specifically since you have take the example of Video and Image processing one can keep the code as part of the same version but active or inactive depending on the context.

While you haven't mentioned, i am assuming C here.

The simplest way in C code, one does optimization (and also applies when trying to make things portable) is to keep

 
#ifdef OPTIMIZATION_XYZ_ENABLE 
   // your optimzied code here... 
#else  
   // your basic code here...

When you enable #define OPTIMIZATION_XYZ_ENABLE during compilation in Makefile, everything works accordingly.

Usually, cutting a few lines of code in the middle of functions could become messy when there are too many functions are optimized. Hence, in this case one defines different function pointers to perform a specific function.

the main code always executes through a function pointer like


   codec->computed_idct(blocks);

But the function pointers are defined depending on type of example (e.g. here idct function is optimized for different CPU architecture.



if(OPTIMIZE_X86) {
  codec->computed_idct = compute_idct_x86; 
}
else if(OPTIMZE_ARM) {
  codec->computed_idct = compute_idct_ARM;
}
else {
  codec->computed_idct = compute_idct_C; 
}

you should see libjpeg code and libmpeg2 code and may be ffmpeg for such techniques.

score 6 · Answer 3 · answered Nov 24 '11 at 08:50

I believe this to be best solved through comprehensive commenting of the code, to the point where each significant block of code has explanatory commenting beforehand.

The comments should include citations to the specifications or hardware reference material.

Use industry-wide terminology and algorithm names where appropriate - e.g. 'architecture X generates CPU traps for unaligned reads, so this Duff's Device fills to the next alignment boundary'.

I would use in-your-face variable naming to ensure no misunderstanding of what is going on. Not Hungarian, but things like 'stride' to describe the distance in bytes between two vertical pixels.

I would also supplement this with a short, humanly readable document which has high-level diagrams and block design.

score 6 · Answer 4 · answered Nov 29 '11 at 11:10

As a researcher I end up writing quite a bit of the "bottleneck" code. However, once it is taken into production, the onus of integrating it into the product and providing subsequent support falls to the developers. As you can imagine, communicating clearly what and how the program is supposed to operate is of the utmost importance.

I have found that there are three essential ingredients in completing this step succesfully

The algorithm used must be absolutely clear.
The purpose of every line of implementation must be clear.
Deviations from expected results must be identified as soon as possible.

For the first step, I always write a short whitepaper that documents the algorithm. The aim here is to actually write it up so that another person can implement it from scratch using only the whitepaper. If it's a well-known, published algorithm it's enough to give the references and to repeat the key equations. If it's original work, you will need to be quite a bit more explicit. This will tell you what the code is supposed to do.

The actual implementation that is handed off to development must be documented in such a manner that all the subtleties are rendered explicit. If you acquire locks in a particular order to avoid deadlock, add a comment. If you iterate over the columns instead of over the rows of a matrix because of cache-coherence issues, add a comment. If you do anything even slightly clever, comment it. If you can guarantee the whitepaper and the code will never be separated (through VCS or similar system), you can refer back to the whitepaper. The result can easily be over 50% comment. That's alright. This will tell you why the code does what it does.

Finally, you need to be able to guarantee correctness in the face of changes. Fortunately we a handy tool in automated testing and continuous integration platforms. These will tell you what the code is actually doing.

My most hearty recommendation would be not to skimp out on any of the steps. You will need them later ;)

How to document and teach others "optimized beyond recognition" computationally intensive code?

4 Answers4

Short answer

Full answer

Example

Linked