Do compiler-writers actually need to 'understand' machine code?

Question

Might be kind of an odd question.

A guy writing a C++ compiler (or whatever non-VM language): Does he need to be able to read/write raw machine language? How does that work?

EDIT: I am specifically referring to compilers that compile to machine code, not to some other programming language.

score 14 · Accepted Answer · answered Apr 19 '14 at 08:42

No, not at all. It is perfectly possible (and often even preferrable) for your compiler to emit assembly code instead. The assembler then takes care of creating the actual machine code.

By the way, your distinguishing of non-VM implementation vs. VM implementation is not useful.

For starters, using a VM or precompilation to machine code are just different ways to implement a language; in most cases a language can be implemented using either strategy. I actually had to use an C++ interpreter once.
Also, many VMs like the JVM both have a binary machine code and some assembler, just like an ordinary architecture.

The LLVM (which is used by the Clang compilers) deserves special mention here: It defines a VM for which instructions can be represented as either byte code, textual assembly, or a data structure which makes it very easy to emit from a compiler. So although it would be useful for debugging (and to understand what you are doing), you wouldn't even have to know about the assembly language, only about the LLVM API.

The nice thing about the LLVM is that its VM is just an abstraction, and that the byte code isn't usually interpreted, but transparently JITted instead. So it's entirely possible to write a language that's effectively compiled, without ever having to know about your CPU's instruction set.

Euphoric · Answer 2 · 2014-04-19T22:04:56.657

No. The key point of your question is that compilation is extremely broad term. Compilation can happen from any language to any language. And assembly/machine code is only one of the many languages for compilation target. For example Java and .NET languages like C#, F# and VB.NET all compile to some kind of intermediate code instead of machine-specific code. It doesn't matter if it then runs on VM, the language is still compiled. There is also option to compile to some other language, like C. C is actually quite popular compilation target and many tools do it. And finally, you could use some tool or library to do the hard work of producing machine code for you. there is for example LLVM which can reduce effort needed to create a standalone compiler.

Also, your edit doesn't make any sense. It is like asking "Does every engineer need to understand how engine works? And I'm asking about engineers working on engines." If you are working on a program or library that emits a machine code, then you have to understand it. The point is, you don't have to do such a thing when writing compiler. Many people did it before you, so you need to have serious reason to do it again.

score 3 · Answer 3 · 2014-04-19T08:47:25.463

Classically a compiler has three parts: lexical analysis, parsing, and code generation. Lexical analysis breaks up the text of the program into language keywords, names, and values. Parsing figures how the tokens that come from the lexical analysis are combined in syntactically correct statements for the language. Code generation takes the data structures produced by the parser, and translates them into machine code or some other representation. Nowadays the lexical analysis and parsing may be combined into a single step.

Clearly the person writing the code generator has to understand the target machine code at a very deep level, including instruction sets, processor pipelines and cache behavior. Otherwise the programs produced by the compiler would be slow and inefficient. They very well might be able to read and write machine code as represented by octal or hexadecimal numbers, but they'll generally write functions to generate the machine code, referring internally to tables of machine instructions. Theoretically the folks writing the lexer and the parser might not know anything about the generation of the machine code. In fact, some modern compilers let you plug in your own code generation routines which might emit machine code for some CPU the lexer and parser writers have never heard of.

However, in practice compiler writers at each step know a lot about different processor architectures, and that helps them design the data structures the code generation step will need.

score 2 · Answer 4 · answered Apr 19 '14 at 13:20

A long time ago I wrote a compiler that converted between two different shell scripts. It went no way near machine code.

A compiler write has to understand their output, but that is often not machine code.

Most programmers will never write a compiler that outputs machine code or assembly code, but custom compilers can be very useful on lots of projects to produce other outputs.

YACC is one such compiler that does not output machine code….

score 0 · Answer 5 · answered Apr 19 '14 at 19:24

You don't need to start with a detailed knowledge of the semantics of your input and output languages, but you better finish with an exquisitely detailed knowledge of both, otherwise your compiler will be unusably buggy. So if your input is C++ and your output is some specific machine language you will eventually need to know the semantics of both.

Here are some of the subtleties in compiling C++ to machine code: (just off the top of my head, I'm sure there are more I'm forgetting.)

What size will int be? The "correct" choice here is an art, based on both the natural pointer size of the machine, the performance of the ALU for various sizes of arithmetic operations, and the choices made by existing compilers for the machine. Does the machine even have 64-bit arithmetic? If not then addition of 32-bit integers should translate to an instruction while addition of 64-bit integers should translate to a function call to do the 64-bit add. Does the machine have 8-bit and 16-bit add operations or do you have to simulate those with 32-bit ops and masking (e.g. the DEC Alpha 21064)?
What is the calling convention used by other compilers, libraries and languages on the machine? Do the parameters get pushed on the stack right-to-left or left-to-right? Do some parameters go in registers while others go on the stack? Are ints and floats in different register spaces? Do the register allocated parameters need to get treated specially on varargs calls? Which registers are caller-saved and which are callee-saved? Can you perform leaf-call optimizations?
What does each of the machine's shift instructions do? If you ask to shift a 64 bit integer by 65 bits what is the result? (On many machines the result is the same as shifting by 1 bit, on others the result is "0".)
What are the memory consistency semantics of the machine? C++11 has a very well defined memory semantics that places restrictions on some optimizations in some cases, but permits optimizations in other cases. If you are compiling a language that does not have well defined memory semantics (like every version of C/C++ before C++11, and many other imperative languages) then you will have to invent the memory semantics as you go along, and usually you will want to invent the memory semantics that best matches your machine semantics.

Do compiler-writers actually need to 'understand' machine code?

5 Answers5