Merits of copy-on-write semantics

Question

I am wondering what possible merits does copy-on-write have? Naturally, I don't expect personal opinions, but real-world practical scenarios where it can be technically and practically beneficial in a tangible way. And by tangible I mean something more than saving you the typing of a & character.

To clarify, this question is in the context of datatypes, where assignment or copy construction creates an implicit shallow copy, but modifications to it creates an implicit deep copy and applies the changes to it instead of the original object.

The reason I am asking is I don't seem to find any merits of having COW as a default implicit behavior. I use Qt, which has COW implemented for a lot of the datatypes, practically all which have some underlying dynamically allocated storage. But how does it really benefit the user?

An example:

QString s("some text");
QString s1 = s; // now both s and s1 internally use the same resource

qDebug() << s1; // const operation, nothing changes
s1[o] = z; // s1 "detaches" from s, allocates new storage and modifies first character
           // s is still "some text"

What do we win by using COW in this example?

If all we intend to do is use const operations, s1 is redundant, might as well use s.

If we intend to change the value, then COW only delays the resource copy until the first non-const operation, at the (albeit minimal) cost of incrementing the ref count for the implicit sharing and detaching from the shared storage. It does look like all the overhead involved in COW is pointless.

It is not much different in the context of parameter passing - if you don't intend to modify the value, pass as const reference, if you do want to modify, you either make an implicit deep copy if you don't want to modify the original object, or pass by reference if you want to modify it. Again COW seems like needless overhead that doesn't achieve anything, and only adds a limitation that you cannot modify the original value even if you want to, as any change will detach from the original object.

So depending on whether you know about COW or are oblivious to it, it may either result in code with obscure intent and needless overhead, or completely confusing behavior which doesn't match the expectations and leaves you scratching your head.

To me it seems that there are more efficient and more readable solutions whether you want to avoid an unnecessary deep copy, or you intend to make one. So where is the practical benefit from COW? I assume there must be some benefit since in it used in such a popular and powerful framework.

Furthermore, from what I've read, COW is now explicitly forbidden in the C++ standard library. Don't know whether the con's I see in it have something to do with it, but either way, there must be a reason for this.

score 18 · Accepted Answer · edited May 23 '17 at 11:33

Copy on write is used in situations where you very often will create a copy of the object and not modify it. In those situations, it pays for itself.

As you mentioned, you can pass a const object, and in many cases that is sufficient. However, const only guarantees that the caller can't mutate it (unless they const_cast, of course). It does not handle multithreading cases and it does not handle cases where there are callbacks (which might mutate the original object). Passing a COW object by value puts the challenges of managing these details on the API developer, rather than the API user.

The new rules for C+11 forbid COW for std::string in particular. Iterators on a string must be invalidated if the backing buffer is detached. If the iterator was being implemented as a char* (As opposed to a string* and an index), this iterators are no longer valid. The C++ community had to decide how often iterators could be invalidated, and the decision was that operator[] should not be one of those cases. operator[] on a std::string returns a char&, which may be modified. Thus, operator[] would need to detach the string, invalidating iterators. This was deemed to be a poor trade, and unlike functions like end() and cend(), there's no way to ask for the const version of operator[] short of const casting the string. (related).

COW is still alive and well outside of the STL. In particular, I have found it very useful in cases where it is unreasonable for a user of my APIs to expect that there's some heavyweight object behind what appears to be a very lightweight object. I may wish to use COW in the background to ensure they never have to be concerned with such implementation details.

score 5 · Answer 2 · 2018-02-04T19:04:21.387

For strings and such it seems like it would pessimize more common use cases than not, as the common case for strings is often small strings, and there the overhead of COW would tend to far outweigh the cost of simply copying the small string. A small buffer optimization makes much more sense to me there to avoid the heap allocation in such cases instead of the string copies.

If you have a heftier object, however, like an android, and you wanted to copy it and just replace its cybernetic arm, COW seems quite reasonable as a way to keep a mutable syntax while avoiding the need to deep copy the entire android just to give the copy a unique arm. Making it just immutable as a persistent data structure at that point might be superior, but a "partial COW" applied on individual android parts seems reasonable for these cases.

In such a case the two copies of the android would share/instance the same torso, legs, feet, head, neck, shoulders, pelvis, etc. The only data which would be different between them and not shared is the arm which was made unique for the second android on overwriting its arm.

score 3 · Answer 3 · answered Jan 27 '23 at 08:07

The point is that COW has zero cost when neither copy is changed, and little additional cost when one copy is changed.

Swift uses COW for strings, arrays and dictionaries. Each of these is implemented as a struct with a pointer to a data object, and passing one as a parameter or assigning copies the tiny struct and increases the reference counter of the data object. Then if either original or copy are modified, their data object is copied. The copy has a reference count of 1, the unchanged one has a reference count decreased.

In addition, there is a class for substrings which share the data with the original, plus the bounds of the substring, so “substring starting at index 3” doesn’t allocate new data. The [] operator either returns a character or changes a character so returning a character doesn’t trigger copying.

And strings are often large. It’s not unusual to read a multi-megabyte file into a string.

Anti Gamer · Answer 4 · 2023-02-01T00:19:29.700

Optimizing Away the Need To Perform Expensive Deep Copies

It can be very useful for multithreading but not in the type of example you provided. For small strings, it would be so much more efficient to avoid any ref counting/GC and just deep copy and ideally with a small buffer optimization as I'm sure you realize.

But consider a game example where you have systems that want to operate in a parallel pipeline like a physics system, AI system, rendering system, etc, all inputting a game scene and producing a modified output so that the physics system can be working on frame 2 while the rendering system is still rendering frame 1. Very few game engines avoid a serial pattern here across systems (they might multithread the work done within a system like with parallel loops, but not achieve a parallel pipeline across systems)[*].

Most gamedevs I've talked to on the game exchange section of stack exchange, including very high-profile ones, consider it too much trouble than it's worth to even double-buffer game state to allow even two system to run in parallel, not to mention triple-buffering to allow three, and quadruple-buffering for four, and so forth. But they might not have considered copy-on-write data structures which can trivialize the effort.

Massive, Shared, Mutable Data

Allowing these systems to operate in parallel means that they cannot share mutable data without locks. They can have their own thread-local copy of unshared mutable data, or shared immutable data, but they cannot share mutable data without locking and bottlenecking other threads on access. Avoiding either the sharing or the mutability (we only need one to avoid bottlenecking threads) may or may not be tricky depending on the game engine.

With basic game engines that deal mostly with unchanging scene data, we can probably easily and cleanly separate the immutable scene data that can be freely shared without locking from the small subset that is mutable which can be locally deep copied for each thread to avoid the sharing (and as John Carmack pointed out, this may only have to be some megabytes for many games, not hundreds of megabytes or gigabytes). The design constraints allow that for most orthodox game engines which don't even have the possibility of mutating much scene data per frame.

For example, most game engines don't even offer the ability to freely mutate hefty character models for bone deformations or facial animations in any arbitrary frame while the game is running (only their animation parameters, like bone matrices). Instead they generate the deformed version on the fly in a vertex shader and so the bulk of the hefty game data is immutable and easy to separate from the mutable given the heavy engine-imposed restrictions on what's allowed to change per frame.

Designs That Cannot Anticipate What Will Mutate

Yet consider a very innovative game doing things so differently from orthodox AAA engines that uses a freely-destructible voxel environment with voxels much smaller than Minecraft, maybe even close to pixel resolution or less at normal viewing distances. But the sheer amount of environment destructibility means that almost all the hefty data of the scene is mutable and can be changed by user input at any given time. Here even generating the results in a shader would still require treating enormous amounts of data as mutable per frame, as the input parameters are no longer simple matrices or vectors or scalars affecting things at a whole model level, but would be parameters affecting things at the individual voxel level with billions of voxels.

That's going to involve a scene that might easily require hundreds of megabytes to gigabytes of data that could be mutated at any given time (we cannot possibly anticipate what might change in a frame given such user freedom) per deep copy of the scene, even with a very efficient sparse voxel octree that can compress voxels down to less than a byte in size. What would normally have to be treated as inevitably shared and mutable data in this case is enormous, and eliminating the sharing of the mutable data via thread-local deep copies might require deep copying this enormous amount of data in close to its entirety per thread per frame (which might easily require more time just spent copying than the thread requires to do its thing with it not to mention the explosive memory use).

Automating Away the Sharing of Mutable Data With COW

Copy-on-write in this case comes to the rescue where we cannot possibly anticipate what will be mutated of this enormous scene in advance as it automatically avoids modifying the shared, immutable shallow copy of the parts of the scene which have not been modified. If one thread -- like the physics thread working on frame 4 while the AI system is working on frame 3 while the rendering system is working on frame 2 -- wants to modify a small section of the scene, only that small section of the scene that is requested to be modified is deep-copied on write keeping the other threads able to churn away and do their thing while keeping regular copying relatively dirt cheap and shallow (at least for data that spans hundreds of megabytes or more).

Writing/mutation becomes a bit more expensive as a result of the atomic ref counting or GC but very often at least in the types of scenarios I deal with, a thread might only need to modify 1 megabyte worth of data while the scene spans an entire gigabyte. It's more than a worthwhile exchange and instead a fantastic bargain to avoid having to deep copy the entirety of that scene data at the relatively trivial expense of some atomic operations and small, partial deep copies of the smallest subset of the scene to avoid what would otherwise be hundreds of megabytes to gigabytes of data deep copied per-frame per-thread.

Conclusion

So apologies for the long-windedness, but this is at least one use case where I think copy-on-write has to be both the most efficient and elegant solution: in cases where the shared mutable data is too massive to be deep copied left and right for each thread for every single frame but only a subset of it is actually modified per thread per frame (but in ways that are impossible for designers to anticipate in advance).

A Note

It's also worth noting that all persistent data structures in functional languages use copy-on-write behind the scenes. That's how a PDS is implemented. They may not expose the mutable interface to the users in ways we might want to do in an imperative language like C++, but under the hood it's always COW. So fans of languages like Haskell or Clojure are at least using COW all over the place under the hood at the implementation level, even if they're not exposed to it and only dealing with conceptually read-only interfaces to immutable data structures.

Merits of copy-on-write semantics

4 Answers4