How can successfully pre-verified changes cause regressions that should have been caught?

Question

In a CI context one of the commonly-used measures of increasing the quality levels of the integration branch is a mandatory set of pre-commit quality verifications (typically including building some artifacts, performing unit tests and even some feature/integration tests).

Yet some regressions (build breakages, various test failures) are detected by the CI system verifications in exactly the areas which were supposed to be covered by these mandatory pre-commit verifications.

During analysis of these regressions an argument often heard is that the developer who committed the change identified as root-cause of the regression has successfully passed all such verifications. And often the claim is supported by hard evidence indicating that:

after the final version of the change was reached it was ported to a fresh workspace based on the tip of the branch
the required artifacts were built from scratch (so the build was totally fine, no cache-related issues, etc.)
all mandatory tests passed, including those covering the area in question and should have detected the regression
no intermittent false-positives affected the respective verifications
no file merges were detected when committing the change to the branch
none of the files being modified was touched by any other change committed in the branch since the fresh workspace was pulled

Is it really possible for a software change to cause such a regression despite correctly following all the prescribed processes and practices? How?

score 5 · Answer 1 · answered Mar 07 '17 at 14:38

There's one possibility I can think of, if when the dev are working on their own workstation, with sometimes images baked for virtual box to run on their workstation where your infrastructure doesn't use the exact same image.

The dev will need, while developing a feature, need to add a JVM parameter or whatever change to the middleware early in its work and forget it.

Before commiting, all unit/integration tests run on its workstation works great, as the baked image is shared, it works on every develloper system.

But when going through CI, it fails because the change to the middle-ware wasn't implemented, either because the dev forgot to ask for it, or just because the team in charge of updating the base images/provisioning system didn't had the time or did forget to update the system.

That's a good thing it break in CI, because it tells early before going into production that the system won't work as expected, but sometimes it becomes a hell to find the missing parameter.

This last point advocate to avoid rejecting commits, and just break on CI on a feature branch, thus it won't block anyone else, and let the dev fix the problem early, when the change is needed and prevent this change to be forget in the flow.

FWIW, we did exactly this here, developers had whole access to development machines and releases in Q/A were failing because a parameter change has been forget, we did move to chef to handle the configuration of the middleware (tomcat now) so each needed change to the infrastructure has to be coded somewhere and will be reproduced in all environment.

score 2 · Answer 2 · answered Mar 07 '17 at 04:54

Sure it is. Production is always different. Real money. Real load. Real users. Real pain. This is why it is so important to put any significant change behind a feature flag. Your deployment should not change anything. Turning on a feature is the only thing that should make significant changes to your site.

score 2 · Answer 3 · answered Mar 13 '17 at 17:34

The breakage is always theoretically possible because the pre-commit verification performed by the developer is done in isolation and thus can't take into account other in-flight changes being verified in parallel. Such changes can interfere with each-other and cause regressions without actually have collissions detectable at the SCM-level.

A simple example of such interfering changes:

Let's assume the code in the latest version of a project branch includes a certain function, defined in one file and invoked in a couple of other files. Two developers working in parallel on that project are preparing to make some changes to the code.

Developer A reworks that function removing or adding a mandatory function argument and, of course, updates all invocations of the function in all the associated files to match the updated definition.

Developer B decides to add an invocation of said function in a file which didn't contain any such invocation before and is thus not touched by developer A's changes. Of course developer B is filling the argument list to match the function's definition visible in the latest label. Which is the old definition, as developer A's changes aren't yet committed.

Both developers correctly perform the pre-commit verifications with a pass result and proceed to commit their code changes. Since the two changesets do not touch the same files no final merge happens, which typically would be an indication of potential problems, warranting a closer look and maybe a re-execution of the pre-commit verification. Nothing whatsoever can give even a subtle hint that something may go wrong.

Yet the end result is catastrophic - the build is broken as the function call added by developer B doesn't match the function definition updated by developer A.

score 1 · Answer 4 · answered Mar 23 '17 at 23:04

When you find this type of issue, you should write some new extremely quick running acceptance tests that can eke out these issues, and add them to your build verification tests that run prior to your integration tests. You should constantly be shifting left and trying to shorten the feedback loop to the developers committing changes. If you cannot find a way to do this, perhaps your architecture isn't as agile as it needs to be.

@Dan Cornilescu - your scenario is valid for tightly coupled architectures which is why loosely coupled architectures (microservices of versioned RESTful APIs) have emerged as the current best practice in high performing organizations. These matrix of services organizations have other complexities to overcome though.

Sometimes you need to refactor your entire architecture to overcome issues such as these. I believe it was both Google and eBay that have completely rearchitected their platforms 5 times (in a span of something like 10 years) due to constraints their previous architecture imposed upon them.

How can successfully pre-verified changes cause regressions that should have been caught?

4 Answers4

Linked