13

I recall back from my days of programming in C that when two strings are joined, the OS must allocate memory for the joined string, then the program can copy all the string text over to the new area in memory, then the old memory must manually be released. So if this is done multiple times as in the case of joining a list, the OS has to constantly allocate more and more memory, just to have it released after the next concatenation. A much better way to do this in C would be to determine the total size of the combined strings and allocate the necessary memory for the entire joined list of strings.

Now in modern programming languages (C# for instance), I commonly see the contents of collections being joined together by iterating through the collection and adding all the strings, one at a time, to a single string reference. Is this not inefficient, even with modern computing power?

JSideris
  • 601

6 Answers6

22

Your explanation why it is inefficient is accurate, in at least the languages I am familiar with (C, Java, C#), though I would disagree that it is universally common to perform massive amounts of string concatenation. In the C# code I work on, there is copious usage of StringBuilder, String.Format, etc. which are all memory saving techiniques to avoid over-reallocation.

So to get to the answer for your question, we must ask another question: if it's never really an issue to concatenate strings, why would classes like StringBuilder and StringBuffer exist? Why is the use of such classes included in even semi-beginner programming books and classes? Why would seemingly pre-mature optimization advice be so prominent?

If most string-concatenating developers were to base their answer purely on experience, most would say it never makes a difference and would shun the use of such tools in favor of the "more readable" for (int i=0; i<1000; i++) { strA += strB; }. But they never measured it.

The real answer to this question could be found in this SO answer, which reveals that in one instance, when concatenating 50,000 strings (which depending on your application, may be a common occurence), even small ones, resulted in a 1000x performance hit.

If performance literally doesn't mean anything at all, by all means concatenate away. But I would disagree that using alternatives (StringBuilder) is difficult or less readable, and therefore would be a reasonable programming practice that shouldn't invoke the "premature optimization" defense.

UPDATE:

I think what this comes down to, is know your platform and follow its best practices, which are sadly not universal. Two examples from two different "modern languages":

  1. In another SO answer, the exact opposite performance characteristics (array.join vs +=) were found to be sometimes true in JavaScript. In some browsers, string concatenation appears to be optimized automatically, and in other cases it isn't. So the recommendation (at least in that SO question), is to just concatenate and not worry about it.
  2. In another case, a Java compiler can automatically replace concatenation with a more efficient construct such as StringBuilder. However, as others have pointed out, this is indeterministic, not guaranteed, and using StringBuilder doesn't hurt readability. In this particular case, I would tend to recommend against the use of concatenation for large collections or relying on an indeterministic Java compiler behavior. Similarly, in .NET, no optimization of the sort is performed, ever.

It's not exactly a cardinal sin to not know every nuance of every platform right away, but ignoring important platform issues like this would almost be like moving from Java to C++ and not caring about deallocating memory.

2

It is not efficient, roughly for the reasons you described. Strings in C# and Java are immutable. Operations on strings return a separate instance instead of modifying the original one, unlike it was in C. When concatenating multiple strings, a separate instance is created at each step. Allocating and later garbage collecting those unused instances can cause a performance hit. Only this time memory management is handled for you by the garbage collector.

Both C# and Java introduce a StringBuilder class as a mutable string specifically for this type of tasks. An equivalent in C would be using a linked list of concatenated strings instead of joining them in an array. C# also offers a convenient Join method on strings for joining a collection of strings.

scrwtp
  • 4,542
1

Strictly speaking it is less efficient use of CPU cycles, so you are correct. But what about developer time, maintenance costs etc. If you add the cost of time to the equation, it is almost always more efficient to do whats easiest, then if needed, profile and optimize the slow bits.
"The First Rule of Program Optimization: Don't do it. The Second Rule of Program Optimization (for experts only!): Don't do it yet."

mattnz
  • 21,490
1

It's very hard to say anything about performance without a practical test. Recently I was very surprised to find out that in JavaScript a naïve string concatenation was usually faster than the recommended "make list and join" solution (test here, compare t1 with t4). I'm still puzzled about why that happens.

A few questions you might ask when reasoning about performance (especially concerning memory usage) are: 1) how big is my input? 2) how smart is my compiler? 3) how does my runtime manage memory? This is not exhaustive, but it's a starting point.

  1. How big is my input?

    A complex solution will often have a fixed overhead, maybe in the form of extra operations to be performed, or maybe in extra memory needed. Since those solutions are designed to handle big cases, the implementors will usually have no problem introducing that extra cost, since the net gain is more important that micro-optimizing the code. So, if your input is sufficiently small, a naïve solution may well have a better performance than the complex one, if only for avoiding this overhead. (determining what is "sufficiently small" is the hard part though)

  2. How smart is my compiler?

    Many compilers are smart enough to "optimize away" variables that are written to, but never read. Likewise, a good compiler might also be able to convert a naïve string concatenation to a (core) library use and, if many of them are made without any reads, there's no need to convert it back to a string between those operations (even if your source code seems to do just that). I can't tell whether or not any compilers out there do that, or to what extent that is done (AFAIK Java at least replace several concats in the same expression to a sequence of StringBuffer operations), but it's a possibility.

  3. How does my runtime manage memory?

    In modern CPUs the bottleneck is usually not the processor, but the cache; if your code accesses many "distant" memory addresses in a short time, the time it takes to move all that memory between the cache levels outweights most optimizations in the instructions used. That's of particular importance in runtimes with generational garbage collectors, since the most recently created variables (inside the same function scope, for instance) will usually be in contiguous memory addresses. Those runtimes also routinely move memory back and forth between method calls.

    One way it can affect string concatenation (disclaimer: this is a wild guess, I'm not knowledgeable enough to say for sure) would be if the memory for the naïve one were allocated close to the rest of the code that uses it (even if it allocates and releases it multiple times), while the memory for the library object were allocated far from it (so the many context changes while your code computes, the library consumes, your code computes more, etc would generate many cache misses). Of course for big inputs OTOH the cache misses will happen anyway, so the problem of multiple allocations becomes more pronounced.

That said, I'm not advocating the use of this or that method, only that testing and profiling and benchmarking should precede any theoretical analysis about performance, since most systems nowadays are just too complex to fully understand without a deep expertise in the subject.

mgibsonbr
  • 504
0

Joel wrote a great article on this subject a while back. As some others have pointed, it is heavily dependent on the language. Because of the way strings are implemented in C (zero terminated, with no length field), the standard strcat library routine is very inefficient. Joel presents an alternative with just a minor change that is much more efficient.

tcrosley
  • 9,621
-1

Is it inefficient to concatenate strings one at a time?

No.

Have you read 'The Sad Tragedy of Micro-Optimization Theater'?

Jim G.
  • 8,035