Testing against a customer's environment before releasing

Question

"In my company, we have this distributed system over Kubernetes. Some microservices are shared among all customers, and others are up to them to upgrade. This system has to interact with A LOT of customer's services (VPNs, private APIs, databases, log aggregators, etc.), and each customer can have wildly different environments.

We follow a very common software development lifecycle:

Our "engineering teams" work on features or bug fixes; once done
Our QA teams test the changes to ensure they work; and then
We have a release process where new container images are generated, and more thorough tests are done.

The reasoning behind this development -> QA -> release pipeline is sound. It allows the engineering team to focus on their tasks without being overwhelmed by an influx of customer requests. It also creates space for product evolution and ensures the quality one would expect from the thorough QA/release process.

However, when a customer requires a new feature or a bug fix, this software development lifecycle can backfire because we are not sure our changes will address all aspects of the customer's environment. For example, we may reproduce an issue, compose a fix, etc., but when deployed, the database may be using a different encoding, or the VPN server requires a specific cipher, etc.

Technically, it is not hard to test changes earlier. We can release different images, use feature flags, etc. Or customers also always have development and UAT environments.

The problem is the policy. I'd like to suggest we use a different workflow in situations where we need to test against the real environments.

So, is there some known process where a development team can test a developing change together with a customer, bypassing the default SDLC?

John Wu · Answer 1 · 2023-08-02T19:29:42.990

I deal with this exact problem constantly at my work. Here are three ways I can think of that we have dealt with it for different clients:

Blue/Green

You keep two production environments. One is open to the public, the other is for pilot testing only. When testing passes, you swap the environments, making go-live a breeze.

The main sticking point with this methodology is the database. There can be only one database (which means db changes require special handling) and it can only contain production data (which means testing on it is a risk, e.g. your tests might accidentally send emails to real customers).

While this may seem like an expensive option (it doubles the hardware cost), many customers already have a second production environment for disaster recovery-- usually just sitting there gathering dust. This is a way to put it to work.

Outage window

Testing is done in lower environments then promoted to production during an outage window. During the outage window, the firewall is manipulated so that only QA can access production.

If bugs are found, client can either fix-forward (e.g. a simple config issue) or roll back and reschedule. The latter is required for code defects.

When analyzing the defects, emphasis is on reproducing the production issue in the lower environment first. If it cannot, special attention is given to the differences between the lower environment and production-- if possible, the environments are brought into sync. If not possible, QA notes the delta on their test plans for future releases. This way you have continuous improvement.

Signoff on risk

We get the client to sign off on the risk that some additional defects may be found in production if test environments are not production-like. The signoff includes an agreement on who will pay for fixing such defects.

Once you have the signoff, production defects are not necessarily a bad thing; they are sales, in fact, with a captive customer. That being said, a good partner should try to identify the risks beforehand and mitigate them somehow, e.g. by using network test tools to prove connectivity to new network connections, or by incorporating feature flags to turn the new feature off while issues are resolved.

score -1 · Accepted Answer · answered Dec 07 '23 at 18:44

Well, apparently this concept of testing software against a customer's environment before release does not have a name. From what I see, it doesn't because it is quite prevalent, so why naming something that common?

But since I needed one, I called it "field testing" in my presentation.

Testing against a customer's environment before releasing

2 Answers2

Blue/Green

Outage window

Signoff on risk