4

How is Bytecode "parsed"?

It is my understand that Bytecode is a binary, intermediate representation of the syntax of a given programming language. Certain programming languages convert their source text into Bytecode which is then written to a file. How does the Virtual Machines of those languages "parse" their Bytecode?

To narrow down this question, take Python's Bytecode for instance. When the Python Virtual Machine begins reading Bytecode from a *.pyc file, how does the Virtual Machine translate the stream of bytes it is reading, into specific instructions?

When the Virtual Machine reads bytecode from a file, it is my understanding that the bytecode one long stream of bytes. How then, is the bytecode broken into useful chunks? How is it transformed into an opcode and the opcodes arguments?

For example, say the Virtual Machine was reading in the bytecode to add two numbers. The Virtal Machine sees the instruction 0x05, which would mean "add two numbers".

Each number could be represented by a different number of bytes, so how would the Virtual Machine know how many bytes it would need to read ahead to gather the arguments for the op 0x05?

Chris
  • 2,850

3 Answers3

11

I think your confusion comes from thinking of bytecodes as a language that is being interpreted by the virtual machine. While this is technically a correct way to describe it, it's leading you to some assumptions about things that are not correct.

The first thing to understand is that bytecode is a type of machine code. The only thing that makes it different from the machine code that your CPU understands is that the machine in this case is virtual (hardware that uses bytecode directly is possible.) This might seem like a big distinction but if you consider what emulators do, whether the target machine is virtual or not isn't really of much importance in the context of the machine language.

Machine code is easy for computers to parse because is is expressly built for to make it easy to do so. The main distinction between machine languages and the higher languages most people are familiar with is the latter are generally built to be easy for humans to use.

This 1997 article on java bytecode might help. Let's walk through an example from that text:

84 00 01

For the first byte (called the opcode) is 84. We can lookup what that opcode means and find that it's iinc (increment local variable #index by signed byte const) and that the two following bytes indicate the index of the variable and the amount, respectively. The JVM then takes that instruction and translates it (while following the language specification) into machine instructions that correspond to the bytecode instructions.

JimmyJames supports Canada
  • 30,578
  • 3
  • 59
  • 108
4

Byte codes are decoded. They are designed like a processor instruction set. Because the byte codes are variable length, even though we know where they are, in order to decode them, you have to decode from the beginning (usually of a method).

When you reach a branch instruction (especially conditional) you might choose to follow the branch target or the fall thru (next instruction). If you were interpreting, you'd do the former, and when JITing, you'd likely do the latter.

Each encoded byte says something about the instruction to execute as well as its length. Simple, common operations are encoded within a single byte. Other operations use additional bytes. The decoder looks at the values of the bytes so far and can then determine minimally whether the instruction is completed or takes one more byte. Some encodings may indicate multiple additional bytes.


Have a look at Java bytecode class file format, and also VAX instruction set architecture, which is a variable length and highly regular. Java bytecode uses a stack architecture, and is fairly high level (as it is bytecode), while the VAX is a register machine and low level. (You could also look at x86, but that is less regular and thus more complicated, IMHO.)

Erik Eidt
  • 34,819
3

The file will have a small header with information about the version, where the executable bytecode is located (plus maybe information about the functions contained in it) and where the constant data (like strings) is located. On stackoverflow the question about python's bytecode has already been asked.

The bytecode itself is very often a very simple syntax. Where the first few bytes indicate what operation has to be performed and what operands are needed. The bytecode will be designed so that when reading byte per byte there is a unambiguous interpretation of the instructions.

To give an example that makes the bytes-per-operation very explicit there is SPIR-V. The first 4-byte word of each instruction is constructed as 2-byte length + 2-byte opcode.

ratchet freak
  • 25,986