Do compilers optimise in concurrency?

Question

Presuming that I have written some sequential code where it can be broken down into multiple isolated tasks, it seems it might be efficient for concurrency to be introduced.

For example

print(expensive_calc_one() + expensive_calc_two())

Asuming expensive_calc_one and expensive_calc_two are pure functions but also computationally quite expensive, it would seem sensible for a compiler to optimise the code by introducing concurrency, and allowing the two functions to run in parallel.

I know that this also has its downsides (context switching adds overhead, and some computers still only have one logical core).

Are there any compilers which would introduce concurrency into previously non-concurrent code, and are there any general patterns for it (or reasons not to do it)?

score 54 · Answer 1 · edited Jan 31 '21 at 20:17

Asuming expensive_calc_one and expensive_calc_two are pure functions

Unfortunately, determining whether a function is pure is equivalent to solving the Halting Problem in the general case. So, you cannot have an Ahead-of-Time compiler which can in the general case decide whether a function is pure or not.

You have to help the compiler by explicitly designing the language in such a way that the compiler has a chance to decide purity. You have to force the programmer to explicitly annotate such functions, for example, or do something like Haskell or Clean, where you clearly isolate side effects using the type system.

but also computationally quite expensive

Unfortunately, determining in an Ahead-of-Time compiler whether a function is "computationally quite expensive" is also equivalent to solving the Halting Problem. So, you would need to force the programmer to explicitly annotate computationally expensive functions for the compiler to parallelize.

Now, if you have to force the programmer to explicitly annotate pure and computationally expensive functions as candidates for parallelization, then is it really automatic parallelization? Where is the difference to simply annotating functions for parallelization?

Note that some of those problems could be addressed by performing the automatic parallelization at runtime. At runtime, you can simply benchmark a function and see how long it runs, for example. Then, the next time it is called, you evaluate it in parallel. (Of course, if the function performs memoization, then your guess will be wrong.)

Are there any compilers which would introduce concurrency into previously non-concurrent code

Not really. Auto-parallelization has been (one of) the holy grail(s) of compiler research for over half a century, and is still as far away today as it was 50–70 years ago. Some compilers perform parallelization at a very small scale, by auto-vectorization, e.g. performing multiple arithmetic operations in parallel by compiling them to vector instructions (MMX/SSE on AMD64, for example). However, this is generally done on a scale of only a handful of instructions, not entire functions.

There are, however, languages where the language constructs themselves have been designed for parallelism. For example, in Fortress, a for loop executes all its iterations in parallel. That means, of course, that you are not allowed to write for loops where different iterations depend on each other. Another example is Go, which has the go keyword for spawning a goroutine.

However, in this case, you either have the programmer explicitly telling the compiler "execute this in parallel", or you have the language explicitly telling the programmer "this language construct will be executed in parallel". So, it's really the same as, say, Java, except it is much better integrated into the language.

But doing it fully automatically, is near impossible, unless the language has been specifically designed with it in mind.

And even if the language is designed for it, you often have the opposite problem now: you have so much parallelism that the scheduling overhead completely dominates the execution time.

As an example: in Excel, (conceptually) all cells are evaluated in parallel. Or more precisely, they are evaluated based on their data dependencies. However, if you were to actually evaluate all formulae in parallel, you would have a massive amount of extremely simple parallel "codelets".

There was, apparently, an experiment in having a Haskell implementation evaluate expressions in parallel. Even though the concurrent abstraction in Haskell (a "spark") is quite lightweight (it is just a pointer to a "thunk", which in turn is just a piece of un-evaluated code), this generated so many sparks that the overhead of managing the sparks overwhelmed the runtime.

When you do something like this, you essentially end up with the opposite problem compared to an imperative language: instead of having a hard time breaking up huge sequential code into smaller parallel bits, you have a hard time combining tiny parallel bits into reasonably-sized sequential bits. While this is semantically easier, because you cannot break code by serializing pure parallel functions, it is still quite hard to get the degree of parallelism and the size of the sequential bits right.

score 9 · Answer 2 · answered Jan 31 '21 at 14:20

Compilers are not generally smart enough to do this, in particular because most languages don't have a sufficiently reliable concept of a “pure function”. There can be rather subtle interactions between threads. Clearly, they should not access the same memory regions. But the program might rely on invariants that could be broken through concurrent modifications. Introducing threads is also a fairly drastic change to the original program. Some other aspects of “automatic async/await” were discussed here.

However, given a concurrent program, some compilers/runtimes can make that program run in parallel without substantial extra work. These languages generally have a concept of an executor that is responsible for completing pending tasks, where tasks can be I/O operations or async functions that are being awaited. There can often be different executor implementations, i.e. different event loops, and sometimes even multiple running executors within the same program.

With executors, all I have to say is that my functions are async and where they can be suspended. The runtime is then responsible for deciding how to schedule them, whether in a single-threaded event loop or on multiple threads in parallel. Having at most one thread per CPU and using an in-process scheduler avoids task-switching overhead. Typically, a language with async/await would express your code snippet like this:

# start expensive computations
task_one = expensive_calc_one()
task_two = expensive_calc_two()
wait for both to complete and print the result
print(await task_one + await task_two)

However, languages differ in how exactly async functions are executed. When an async function is invoked, some execute them directly up to the first suspension point which cannot lead to parallelism. Others will create task without executing any part of the function. CPU-bound tasks that should execute in the “background” might still require extra annotation.

Languages that follow this general strategy are C#, Python, Rust, and many others. Python doesn't really support parallelism though even when using multiple threads. In Rust, the type system ensures that data can be sent to a different thread, or shared across threads – and prevents creation of such tasks if not. In Go, an await can be simulated with single-element channels, whereas tasks can be created with the Go keyword. JavaScript can start multiple workers, but the event loop is single-threaded. Executors are also available in Java and Swift, albeit without real await syntax.

Haskell is probably the only language that could feasibly do automatic multi-threading because every function (that doesn't involve the IO monad) is pure by definition. But I don't think Haskell/GHC actually does start multiple threads implicitly.

score 2 · Answer 3 · answered Feb 02 '21 at 17:19

Introducing things running in parallel (not the same as concurrency, as noted in comments) is actually done quite a bit during runtime.

It sounds like what you're most interested in is instruction level parallelism where two instructions are running simultaneously, finishing in half(ish) the time it would take them to run sequentially. Specifically, you're talking about out-of-order execution where independent steps can be run in any order. I think other answers have done a good job of talking about why this is difficult, but I wanted to give a few examples of where the generic idea of automatic parallelization is actually being used successfully.

Specifically, if you do this during runtime you can use heuristic data based on what the program is actually doing, rather than (much more complicated) trying to analyze the program statically.

Speculative execution/branch prediction: We have a normal order of (calculate which branch to take)->(execute that branch). But we can sometimes parallelize this so that we (calculate which branch to take), (execute branch a), (execute branch b) simultaneously--and then only apply the results of whichever branch it turned out that we needed to take. Sometimes this can be additionally optimized by recognizing that certain branches are much more likely than others (maybe a loop repeats hundreds of times before choosing the exit-loop branch) and by analyzing the program during runtime the computer can make choices about which branch to run based on how likely each branch is.

Beyond just parallelizing your actual program, we can parallelize the work around running your program. Just-in-time compilation: Java used to be S-L-O-W but now is (reasonably) fast... by parallelizing running the program and compiling the program. The JIT compiler even does runtime analysis to spend more time optimizing the compilation of the areas of the program that are spending more time in execution and skips areas that aren't spending much or any time in execution.

Some of this can be done at a hardware level, too...such as instruction pipelining. If different parts of a processor are used for read, execute, and write, then a processor may be reading the next (predicted) instruction while executing the current one and writing the results from the previous one.

score 0 · Answer 4 · answered Feb 02 '21 at 10:59

No. Yes, sort of, but not the way you think.

For procedural languages like C/C++ or Pascal or Java it is generally not possible to simply run functions in parallel because functions can have state which can end up affecting the result depending on the order of invocation.

Languages like Go allows you to mark functions as able to be executed in parallel called goroutines (the keyword is simply go - the name of the language). However goroutines can't really be used in the same way as regular functions since they are executed asynchronously. Unlike multithreading in languages like C not all calls of goroutines run in a separate thread. Go will automatically decide how many threads are needed to execute goroutines which makes goroutines a lot easier to use compared to threads.

However there are languages which can sort of do what you want. But it's not because the compiler can do it. It is because the language allows compilers to make such assumptions. These are pure functional languages: Lisp, Haskell, Erlang etc.

The reason why pure functional languages can do this is because they don't have variables. They only have constants (so much so that most of these languages call constants "variables"). Having no variables means there are no state changes which means compilers can make the assumption that they can call any function in any order and the result should be the same because results depend only on the arguments to functions.

Google's use of Lisp's map and reduce to search through their early database of websites is a good example of automatic parallelisation. Because each invocation of the map callback can be executed independently, in order to scale Google what they originally did was run their original algorithm on a version of Lisp with parallel map. They didn't have to change their algorithm - they just had to use a different interpreter/compiler.

Do compilers optimise in concurrency?

4 Answers4

wait for both to complete and print the result