# Static analysis to optimize GC I'm trying to flesh out a vague idea that keeps coming to mind. The basic motivation is to eliminate some or all of GC related overhead that's incurred even on regions of code that meticulously ensure not to use any GC allocation. Part of Zisp's design will try to offer various memory management strategies to the programmer, defaulting to GC only when code is written in a "don't care" style. This brings us to the issue of "what if other parts of the code base, like a library we're using, allocates things on the GC heap." To my understanding, there are two intertwined categories of overhead incurred on code "globally" even if only some parts of the codebase use GC allocations: 1. The need to insert safepoints. This typically means an additional instruction or two which gets inserted into the start of every function body or call, and the start of every loop body or back jump. (Fun fact: I believe under full TCO and CPS these are one and the same thing.) If you use OS signals to stop threads, then they don't need these safepoint instructions, but it introduces other issues, like the possibility that a GC pointer has landed within some part of a SIMD register or something. 2. The need to actually stop a thread and scan it, when it was busy doing some super high intensity hot loop stuff. (This doesn't just apply to a design idea like that of Zisp where you can take control over memory management; it also applies to codebases where some functions do allocation-free but computationally heavy work, and thus would ideally be free of GC overhead.) The reason that even a function which *itself* doesn't deal with any GC pointers needs safepoints to be inserted, is simple; consider the following trivial Scheme-like example: ```scheme (define (main argv) (print-prime (vector-ref argv 0))) (define (print-prime n) (define label (sprintf "calculating prime %d" n)) (printf "[START] %s\n" label) (printf "%d\n" (calculate-prime n)) (printf "[DONE] %s\n" label)) (define (calculate-prime n) (left as an exercise to the reader) (but there are no heap allocations here)) ``` The `label` string will be allocated on the heap, and although some escape analysis and such shenanigans could in theory transform this into a non-GC object given that its lifetime is obviously limited to one function, assume for this demonstration that there's some reason such optimizations cannot be applied, so `label` is truly on the GC heap throughout the duration of the call to `calculate-prime`. (Maybe this is part of a big codebase, with its own thread metadata and monitoring system, and a pointer to `label` ends up in a larger metadata object, saved in a global "current tasks" table. Maybe the pointer is mutable too, so the thread can dynamically swap the label to indicate the current sub-task it's busy with.) This means that, in the prime calculation loop, the code needs to periodically check if the GC is waiting for the thread to yield, so frames can safely be scanned for GC pointers. The GC would then find the pointer to `label` in the stack frame of the main function, and the stack frame of `calculate-prime` would be scanned needlessly, although that can be fixed with auxiliary stack frame metadata if I understand correctly. (Created at compile-time and stored in DWARF data or something like that.) Let's try to fix this. We want the prime calculation to be free of safepoints, without causing the GC a massive headache. Typical cases of headache induced on the GC, that we need to avoid, include: * Causing the GC to stall until the no-GC function (prime calculation in this case) is finished, i.e., enormous latency on other threads. * Needing to use OS signals to forcibly stop the thread because it doesn't use safepoint "polling" instructions (called async preempt in Golang lingo I believe) which could also mean needing to be conservative in scanning the currently active stack frame of the thread, since the no-GC function may be in the midst of some SIMD register gymnastics or whatever. ## The idea The strategy I'm about to describe could be applied to arbitrarily small sub-regions of code, I think, but to prevent myself from going insane I will treat functions as indivisible "code region" units. This means we can treat "may have local pointers to GC heap" as part of the static type signature of a function. That is, whether the stack or registers, during execution of the function, may contain pointers into the GC heap. Detecting this on a per-function basis should not be difficult; just be reasonably conservative and check whether there are any operations within the function body whose result type is a type with some kind of "GC heap" trait. I need to think more about this, but I'm pretty sure it's a solved issue since the invention of precise GC. Now what our static analysis needs to do, first of all, is to map out the call graph of a thread as a DAG. Each node in the DAG is marked as "GC" or "no GC". There are a number of things that can happen now. First, if the entire graph is no-GC then the entire thread can be exempted from GC work. The GC won't ever bother to pause it, be it via OS signals or by setting the flag that triggers safepoint polls (assuming the functions still have those). Second, analysis of an entire program could reveal that certain no-GC functions are only ever called from this thread, which would mean that no safepoint instructions need to be compiled into these functions! If a function is used both by a pure-no-GC thread and a GC thread, it may even be treated as "polymorphic" with respect to GC and it could be duplicated ("monomorphized") into a GC and no-GC version. Probably overkill, but an idea worth keeping in mind. Where things get most exciting, though, is making use of information about partial GC vs. no-GC *regions* of the call graph. Especially when it's the tail that's no-GC; wait for it. ### No-GC head of call graph When the head of the call graph is no-GC, it means that the thread's registration with the GC can be deferred to the point at which it first enters a GC region of the call graph. If it happens to be that the thread first does some heavy computation free of allocations, then enters a phase in which it reports its results or something (which may include string allocations, for example, or some heap object to wrap the result) then this could be a bountiful optimization. There may however be the risk of the thread repeatedly exiting and re-entering the first GC region of the call graph, in which case there would be some repeated overhead related to checking if it already registered itself with the GC. I think it can be a good thing for a programming language to expose some implementation details and allow the programmer to take control over certain decisions which one would traditionally expect to be automatic. Here, it could make sense for the programmer to be able determine whether this optimization is applied to a thread or not. ### Deregistering a thread from GC You may think that functions which only appear in the initial no-GC heads of call graphs would be eligible for safepoint elimination (SE), but this doesn't necessarily work. Once a thread is registered, the GC may at any point in the future set the flag that tells the thread to stop for GC, and then there needs to be safepoint polling even if the thread returned back to the no-GC head of its call graph. To solve that, threads would need to be able to not only register but also deregister themselves dynamically, when they return to a no-GC region of the call graph. This once again imposes the risk of rapid repeated registration and deregistration if the thread happens to constantly move between those regions of the call graph, so again it may be a good idea to give programmers some decision power here. This brings us to the following insight... ### Coloring the call graph for GC registration state Whether a function is eligible for total safepoint removal is not a matter of what regions of a mixed-GC call graph it appears in (be it head, tail, or in between) but rather a question of whether the function may ever be called at a point in time in which a thread is currently registered for GC. Assuming we do have some mechanism to deregister threads, and this is an operation that can be interjected into node transitions in the call graph (i.e., calls and returns), we can effectively map out a second type of "sectioning" of the call graph into regions; namely, regions in which the thread is or isn't registered with GC. In the most naive and extreme implementation, these registration and deregistration operations could be inserted into every single point where the thread transitions between a GC and no-GC call graph region, which would make the two sectioning types equivalent. That is, the "coloring" of the call graph telling you which nodes are functions that may hold GC pointers in their local variables becomes equivalent to the coloring of the call graph telling you which nodes represent a possible point in the thread's execution during which it's registered with the GC. Through either heuristics or programmer control, it would make sense to try and create a "GC registration coloring" of the call graph that will cause relatively few register/deregister operations at run time, while reaping significant efficiency gains from having the thread be exempt from GC during certain points in execution. Oh and: A function now becomes eligible for SE optimization if it only appears in GC-unregistered portions of thread call graphs; this is a generalization of the previous idea of performing SE on functions that are only called from pure-no-GC threads. ### But deregistration can be very expensive OK, there's one issue: If you deregister a thread while there's GC frames within the stack, you obviously can't just let the GC ignore those by not checking this thread's stack. Instead, deregistration would need to run a thread-local mark phase of the GC and save all pointers it found elsewhere. This *may* still be worth it in some cases, but means that it becomes all the more important to minimize this type of deregistration that happens in deeper parts of the call graph, spanning over GC regions. This brings us to the final part. ### Barrier-free concurrent GC during no-GC stack tails This is the idea I'm most proud of because it's the most crazy: Imagine there's a tail of the call graph that's no-GC colored. (This is with respect to whether they can hold GC pointers in their locals; not the other coloring that's about registration state.) Further, imagine that all *other* uses of the functions, that make up this region of the call graph, make them eligible for SE. To ensure they remain eligible for SE, the compiler would normally need to insert a deregistration point of the very expensive kind with a thread-local mark phase. But what if... Instead of adding a deregistration point, we do a special thing: 1. The thread informs the GC that it's entering a no-GC tail region. 2. A special type of safepoint is added to the *return point* of the no-GC region; not one that checks for the GC flag saying "hey I'm waiting for threads to stop" but rather a special one saying "the GC is **currently scanning your stack**". So long as the thread is within the no-GC tail of the call graph, the GC is free to simply start scanning its stack, up to the point where the no-GC region begins. If the no-GC region returns while this is happening, it hits the special type of safepoint mentioned above, which simply makes it pause itself, after clearing the flag saying "I'm in a no-GC region" so that the GC will know what happened and resume it once it's done. This retains the SE eligibility of the functions and doesn't require the expensive deregistration. Instead, it allows a limited form of barrier-free concurrent mark phase under special conditions, ensuring that any potential hot loops in tail regions of a call graph can run completely uninhibited and without safe-points. Did I invent something new?? How useful would this actually be?? Or did I goof up, forgot about some thorny detailed, and described something that wouldn't even work in practice?