34

I was reading a bit about garbage collectors and I am wondering if the garbage collector of a program scans the entire heap memory or what is allocated to it? If it reads the entire system memory, does it mean it is reading memory locations that are used by other applications? I understand that this does not make much sense security or performance wise.

If garbage collector only reads the memory that is allocated to it, how does it mark those areas?

Sorry for the rookie question, I am not a software engineer and this is pure out of my curiosity

PoJam
  • 475

7 Answers7

60

I was reading a bit about garbage collectors and I am wondering if the garbage collector of a program scans the entire heap memory or what is allocated to it?

That depends on the garbage collector. There are many different kinds of garbage collectors.

For example, Reference Counting Garbage Collectors don't "scan" anything at all! In a Reference Counting Garbage Collector, the system counts references to objects, something like this:

SomeObject foo = new SomeObject();

Let's say, this new object was allocated at memory address 42. The GC records "there is 1 reference to the object at address 42".

SomeObject bar = foo;

Now, the GC records "there are 2 references to the object at address 42".

foo = null;

Now, the GC records "there is 1 reference to the object at address 42".

bar = null;

Now, the GC says "there are 0 references to the object at address 42, therefore, I can collect it".

At no point did the GC "scan" anything.

What you are probably thinking about is an extremely simplistic implementation of a so-called "Tracing Garbage Collector", namely the Mark-and-Sweep GC.

Any Tracing GC starts off with a set of objects that they know are always reachable. This is called the root set. The root set typically includes all global variables, the local variables, CPU registers, the stack, and some other stuff. For all of these objects, the GC looks at the instance variables and checks the objects that the instance variables point to. Then it checks those objects' instance variables, and so on and so forth.

This way, the GC "sees" all "live" objects, i.e. the objects that are reachable from the root set. What the GC does with those "live" objects depends on the kind of GC.

As I mentioned above, what you are thinking of is the most simplistic kind of Tracing GC, which is the Mark-and-Sweep GC. During the tracing phase I described above, the GC will "mark" all live objects by either setting a flag directly in the object header itself, or in a separate data structure.

Then, the GC will indeed "scan" the entire memory and find all objects and do one of two things:

  • If the object is marked, remove the mark.
  • If the object is unmarked, de-allocate the object.

After this "sweep" phase, you end up with all unreachable objects destroyed and all live objects unmarked, ready for the next "mark" phase.

But, as I mentioned, this is only one of many different kinds of Tracing GCs, and is a very simple one with many disadvantages. The two major disadvantages are that scanning the entire memory is expensive and leaving the live objects where they are and only collecting the dead objects in between leads to memory fragmentation.

Another very simple but much faster Tracing GC is Henry Baker's Semi-Space GC. The Semi-Space GC "wastes" half of the available memory, but gains a lot of performance for it. The way the Semi-Space GC works is that it divides the available memory into two halves, let's call them A and B. Only one of the two halves is active at any one time, meaning new objects only get allocated in one of the two halves.

We start out with half A: The GC "traces" the "live" objects just as described above, but instead of "marking" them, it copies them to half B. That way, once the "tracing" phase is done, there is no need to scan the entire memory anymore. We know that all live objects are in half B, so we can simply forget about half A completely. From now on, all new objects are allocated in half B. Until the next garbage collection cycle, when all live objects are copied to half A, and we forget about half B.

These are just two examples of Tracing GCs, and only one of those two scans the entire memory.

If it reads the entire system memory, does it mean it is reading memory locations that are used by other applications? I understand that this does not make much sense security or performance wise.

This is simply impossible. No modern Operating System allows a process to read another process's memory. (And when I say "modern", I mean "since the 1960s or so".)

But even if it were possible, it would not make sense. If the memory belongs to another process, then the GC has no idea what the objects that are in this memory even look like. But it needs to know what the objects look like in order to find all the instance variables and to know how to interpret those references. If it uses an internal marker flag inside the object itself, it also needs to know how to find that marker flag and how to set it. And that is assuming that the marker flag is even there at all! What happens if the application that owns that memory doesn't use marker flags?

Or, worse: what happens if the application that owns that memory does use a GC which uses marker flags. Now, the one GC is overwriting the other GC's markers!

If garbage collector only reads the memory that is allocated to it, how does it mark those areas?

There are two popular approaches.

The first approach is that there is flag in the object header of each object reserved for marking. During the "mark" phase, the GC sets this flag. The major advantage of this approach is that there is no separate bookkeeping involved and it is thus very simple: the mark is right there on the object itself. The major disadvantage is that objects are scattered all through the memory, and thus during the marking phase, the GC writes all over the entire memory. This means that there are "dirty" pages all over memory, in a multiprocessor system (which almost all systems are nowadays) this means that we have to notify the other CPU cores that we have modified some memory, we have polluted the cache with tons of writes that we will never need again, and so on.

The alternative is to keep a separate data structure where we keep a table of all marked objects. This has the disadvantage of more bookkeeping (we need to keep a relationship between the mark table and the objects) but it has the major advantage that we are only writing to one place in memory, which means we can keep this one piece of data in the cache all the time.

But again, not all GCs even have a concept of "marking" at all.

Jörg W Mittag
  • 104,619
14

There are a lot of details in garbage collection. Lots lots lots lots lots. Each one of them has different behaviors, so it's hard to cover everything.

Garbage collectors almost universally scan the memory within a single process. This is for two natural reasons. One is that garbage collection is always closely tied to how the memory is laid out, and thus language-specific. The second is that the OS won't let you read someone else's memory anyway (without special steps).

The most naïve solutions for garbage collection do indeed look at everything in the process's memory, but that's obviously slow. Faster solutions do exist. The most common is a "generational" garbage collector. It has been observed that most objects have a very short lifespan, so putting young objects in a "nursery" lets you quickly reclaim memory as efficiently as possible. With a generational garbage collector, one only scans all of memory when one has to (when doing the shorter cheaper scan isn't sufficient).

There are also lots of clever tricks to avoid scanning extra memory. One of them is "card marking." Whenever a reference is updated, the language sets a bit on a "card," which is just an array in memory. During a garbage collection, only cards that have been marked could possibly have changed what they were referring to, so you can avoid scanning all of the data managed by the unmarked cards.

But don't think those are the only approaches. We've come up with countless different ways to do garbage collection, each with their own pros and cons.

Cort Ammon
  • 11,917
  • 3
  • 26
  • 35
10

No garbage collector scans the entire memory. Because, when reading some random memory content, it is not possible to guess if it's used or not. So garbage collection goes through objects or list of objects that were previously allocated and created to see if they are still used.

Most of the GC algorithms belong to the more general family of marking algorithms: very naively, objects referenced in variables that are still alive are marked as used. The marking is then propagated to all the related objects, and so on until there is no new object to be marked. The unmarked objects can then be discarded. The marking usually takes the form of a flag or a counter in some technical data that describes each object. The marking and cleaning can be done at once, or incrementally, or at every referencing/dereferencing.

But this is very simplified, and a book would not be sufficient to describe all the existing object management strategies and the garbage collection algorithms (not to speak of the more than 700 patents on the topic, e.g. this one). But here already a compact overview.

Christophe
  • 81,699
9

Top-level, GC is specific to the application. It doesn't really make much sense for a Python application to be scanning a Java application's memory space, even if it was allowed to do so. Also, not all memory in a system is 'heap'.

Broadly, while there are many different approaches and strategies to garbage collection, we can generalize. Without getting into the weeds of GC algorithms, the idea that it's scanning even the entire memory space that it's allocated is a little off. In a mark-sweep-compact, the approach is to start from 'roots' that are inherent to the application and look at everything that they point to and then everything those point to, and everything those point to, etc. until all 'live' objects are found. That's the 'marking' phase. Everything that's not marked is considered garbage. The live objects are then moved around in a way similar to defragmenting a disk drive and the remaining heap is then considered free space.

In a way, I think the term garbage-collection is a bit of a misnomer. The point of GC is really about keeping track of what is still in use, not really finding garbage, or 'collecting' it. A good example of this is the copy-collector approach. In that design, you keep 2 (or more) spaces where new objects (structures, whatever) are allocated. When on of the spaces fills up, the live objects are identified (see paragraph above) and copied to the other 'side' which becomes the new area for allocation. Nothing specific is done with the 'garbage'; it is simply forgotten and the memory that it occupied is used for new allocations. There are some minor caveats to that but at a high level, there's really no 'collecting' of unused memory, it's just being tracked as available for use.

JimmyJames supports Canada
  • 30,578
  • 3
  • 59
  • 108
0

Your program, including the garbage collector, will not be able to access memory belonging to other processes unless you have a totally ancient OS. Plus the garbage collector has no reason at all to look at the memory of other processes.

In a primitive garbage collector, every chunk of memory is identified as “free”, “examined” or “used”. The garbage collector marks everything that isn’t “free” as “examined”, then systematically turns memory that can be accessed from “examined” to “used”, and when there is nothing left to mark “used”, everything “examined” is marked “free”.

Now a garbage collector is used a lot for example by any Java program, so it is a prime target for optimisation, and to use any trick in the book to make it run faster. Not visiting all memory if avoidable would help, so a good garbage collector will try to avoid it.

gnasher729
  • 49,096
0

The Garbage collector needs to know which part of memory is a pointer (and how it is encoded), and what spans of the address space are effectively allocated. Those informations are known only to the application or its virtual machine.

Therefore, a GC can't scan the whole memory. If the GC is VM-based, it could effectively span multiple applications, provided they share a common address space. For security reason, it is rarely the case.

As for marking, there are multiple ways to achieve that (bitmap, position-size table, etc). It depends on the GC implementation and the memory-manager of the language or VM.

0

A garbage collector only need to scan the memory which have been actively allocated by the program. It does not need to scan unused memory or memory used by other processes on the system.

When heap memory is allocated, e.g by malloc in C or new Foo() in Java, the language runtime keeps track of what memory areas has been allocated. In its simplest form, it just keeps a list of the memory regions with start and end addresses. (Ideally there is only a single region which grows when new memory is allocated, but since memory can be freed in the middle of a region, you end up with multiple regions over time)

A mark-and-sweep garbage collector scans the stack for pointers and then recursively follows these pointers and mark all objects it meets. Then it scans all objects in the allocated memory, and if some allocated objects have not been marked, the destructor is invoked and the memory is freed.

So even a naive mark-and-sweep garbage collector only need to scan the memory which have been actively allocated by the program at some point.

JacquesB
  • 61,955
  • 21
  • 135
  • 189