6

I am a lecturer for a post-graduate module where I expect my students to write Python code that replicates examples from the textbook. The project has been running for a couple of years and this year I want to introduce more automated testing. The problem is writing tests for the functions which primarily produce a matplotlib figure. Note that we are only trying to replicate the figures approximately, so binary comparison with a target image won't be very useful.

The short question is therefore: what strategies can I use to test programs whose primary graphical output can not be compared against a strict reference image?

Some of the problems that I have with the existing code base that prevents automated testing in this case:

  • Producing the graph stops execution in many cases
  • It is not practical to generate a copy of the figure in the textbook that includes annotations and so on, so algorithmic image comparison is unlikely to be the answer.

To clarify my goals:

  1. I would like to be able to execute all the code in the codebase to check that it is actually running, even if that means throwing away the output. This would catch regressions where a function is changed
  2. Instead of investing deeply into fuzzy matching of the graphical output with the target, I believe that visual checking between a reference image and the generated image is probably going to be the simplest, but this should be deferred to happen once at the end of the run rather than during the run
  3. Since this is a collaborative project, I don't have to assume that the students are going to be adversarial. Mistakes will be good-faith rather than perverse.
Doc Brown
  • 218,378

2 Answers2

4

This depends on what the context of such tests is.

In development, testing is usually a barrier against negligence, code rot, or misunderstandings. Software gets very complex, but normally at least it "fights fair": it doesn't deliberately try subvert your intentions with deliverables that are useless, but pass the test on a technicality.

When you deal with students programming for grades, this is a real possibility. Therefore, the balance of pressures shifts subtly against automated testing, particularly when the requirement is not some hard-and-fast mathematical property but something as vague as "replicate this figure approximately". Obviously there are various approximations of the goal criterion that you can use and that may be useful (count the number of black and white pixels, detect edges and compare their structure, perform a full Fourier transform and judge the similarity in the frequency space rather than on the pixels themselves...)

However, it may be that looking at the results with human eyes and classifying them manually is actually more efficient. Think of it as exploiting the amazing hardware acceleration that the human brain provides for pattern recognition.

Kilian Foth
  • 110,899
1

If the result cannot match the original perfectly, you'll have to do some fuzzy matching.

One very simple measure would be to use the original picture as the golden standard. Subtract your own result (using the absolute difference as the result), and perhaps average over the whole picture. That should give you an idea of how well at least the shape of things fit the original.

A tweak on this method would be to resize both images until their sizes match. Another would be to resize both to a much smaller size and compare those much more strictly, to fit just overall structure of the original.

I'd also second @KilianFoth's idea of using other phase spaces, but be very careful that you're actually comparing apples and oranges in such cases, by for example ignoring low-frequence ("noise") data.

As a general tip, you have to have something the results can be compared to if you want to be able to test it. No matter what that is - a text description, a function or an image - you have to use that as the goal for the testing.

l0b0
  • 11,547