From 4b9213d9addd6315de73747a0fdd7e996ba2826a Mon Sep 17 00:00:00 2001 From: Taylan Kammer Date: Sun, 24 May 2026 16:22:41 +0200 Subject: Add a note. --- notes/260524-gc-opt.md | 258 +++++++++++++++++++++++++++++++++++++++++++++++++ notes/index.md | 1 + 2 files changed, 259 insertions(+) create mode 100644 notes/260524-gc-opt.md (limited to 'notes') diff --git a/notes/260524-gc-opt.md b/notes/260524-gc-opt.md new file mode 100644 index 0000000..6687520 --- /dev/null +++ b/notes/260524-gc-opt.md @@ -0,0 +1,258 @@ +# Static analysis to optimize GC + +I'm trying to flesh out a vague idea that keeps coming to mind. The +basic motivation is to eliminate some or all of GC related overhead +that's incurred even on regions of code that meticulously ensure not +to use any GC allocation. Part of Zisp's design will try to offer +various memory management strategies to the programmer, defaulting to +GC only when code is written in a "don't care" style. This brings us +to the issue of "what if other parts of the code base, like a library +we're using, allocates things on the GC heap." + +To my understanding, there are two intertwined categories of overhead +incurred on code "globally" even if only some parts of the codebase +use GC allocations: + +1. The need to insert safepoints. This typically means an additional + instruction or two which gets inserted into the start of every + function body or call, and the start of every loop body or back + jump. (Fun fact: I believe under full TCO and CPS these are one + and the same thing.) If you use OS signals to stop threads, then + they don't need these safepoint instructions, but it introduces + other issues, like the possibility that a GC pointer has landed + within some part of a SIMD register or something. + +2. The need to actually stop a thread and scan it, when it was busy + doing some super high intensity hot loop stuff. + +(This doesn't just apply to a design idea like that of Zisp where you +can take control over memory management; it also applies to codebases +where some functions do allocation-free but computationally heavy +work, and thus would ideally be free of GC overhead.) + +The reason that even a function which *itself* doesn't deal with any +GC pointers needs safepoints to be inserted, is simple; consider the +following trivial Scheme-like example: + +```scheme +(define (main argv) + (print-prime (vector-ref argv 0))) + +(define (print-prime n) + (define label (sprintf "calculating prime %d" n)) + (printf "[START] %s\n" label) + (printf "%d\n" (calculate-prime n)) + (printf "[DONE] %s\n" label)) + +(define (calculate-prime n) + (left as an exercise to the reader) + (but there are no heap allocations here)) +``` + +The `label` string will be allocated on the heap, and although some +escape analysis and such shenanigans could in theory transform this +into a non-GC object given that its lifetime is obviously limited to +one function, assume for this demonstration that there's some reason +such optimizations cannot be applied, so `label` is truly on the GC +heap throughout the duration of the call to `calculate-prime`. + +(Maybe this is part of a big codebase, with its own thread metadata +and monitoring system, and a pointer to `label` ends up in a larger +metadata object, saved in a global "current tasks" table. Maybe the +pointer is mutable too, so the thread can dynamically swap the label +to indicate the current sub-task it's busy with.) + +This means that, in the prime calculation loop, the code needs to +periodically check if the GC is waiting for the thread to yield, so +frames can safely be scanned for GC pointers. The GC would then find +the pointer to `label` in the stack frame of the main function, and +the stack frame of `calculate-prime` would be scanned needlessly, +although that can be fixed with auxiliary stack frame metadata if I +understand correctly. (Created at compile-time and stored in DWARF +data or something like that.) + +Let's try to fix this. We want the prime calculation to be free of +safepoints, without causing the GC a massive headache. Typical cases +of headache induced on the GC, that we need to avoid, include: + +* Causing the GC to stall until the no-GC function (prime calculation + in this case) is finished, i.e., enormous latency on other threads. + +* Needing to use OS signals to forcibly stop the thread because it + doesn't use safepoint "polling" instructions (called async preempt + in Golang lingo I believe) which could also mean needing to be + conservative in scanning the currently active stack frame of the + thread, since the no-GC function may be in the midst of some SIMD + register gymnastics or whatever. + +## The idea + +The strategy I'm about to describe could be applied to arbitrarily +small sub-regions of code, I think, but to prevent myself from going +insane I will treat functions as indivisible "code region" units. + +This means we can treat "may have local pointers to GC heap" as part +of the static type signature of a function. That is, whether the +stack or registers, during execution of the function, may contain +pointers into the GC heap. + +Detecting this on a per-function basis should not be difficult; just +be reasonably conservative and check whether there are any operations +within the function body whose result type is a type with some kind of +"GC heap" trait. I need to think more about this, but I'm pretty sure +it's a solved issue since the invention of precise GC. + +Now what our static analysis needs to do, first of all, is to map out +the call graph of a thread as a DAG. Each node in the DAG is marked +as "GC" or "no GC". + +There are a number of things that can happen now. + +First, if the entire graph is no-GC then the entire thread can be +exempted from GC work. The GC won't ever bother to pause it, be it +via OS signals or by setting the flag that triggers safepoint polls +(assuming the functions still have those). + +Second, analysis of an entire program could reveal that certain no-GC +functions are only ever called from this thread, which would mean that +no safepoint instructions need to be compiled into these functions! + +If a function is used both by a pure-no-GC thread and a GC thread, it +may even be treated as "polymorphic" with respect to GC and it could +be duplicated ("monomorphized") into a GC and no-GC version. Probably +overkill, but an idea worth keeping in mind. + +Where things get most exciting, though, is making use of information +about partial GC vs. no-GC *regions* of the call graph. Especially +when it's the tail that's no-GC; wait for it. + +### No-GC head of call graph + +When the head of the call graph is no-GC, it means that the thread's +registration with the GC can be deferred to the point at which it +first enters a GC region of the call graph. If it happens to be that +the thread first does some heavy computation free of allocations, then +enters a phase in which it reports its results or something (which may +include string allocations, for example, or some heap object to wrap +the result) then this could be a bountiful optimization. There may +however be the risk of the thread repeatedly exiting and re-entering +the first GC region of the call graph, in which case there would be +some repeated overhead related to checking if it already registered +itself with the GC. + +I think it can be a good thing for a programming language to expose +some implementation details and allow the programmer to take control +over certain decisions which one would traditionally expect to be +automatic. Here, it could make sense for the programmer to be able +determine whether this optimization is applied to a thread or not. + +### Deregistering a thread from GC + +You may think that functions which only appear in the initial no-GC +heads of call graphs would be eligible for safepoint elimination (SE), +but this doesn't necessarily work. Once a thread is registered, the +GC may at any point in the future set the flag that tells the thread +to stop for GC, and then there needs to be safepoint polling even if +the thread returned back to the no-GC head of its call graph. + +To solve that, threads would need to be able to not only register but +also deregister themselves dynamically, when they return to a no-GC +region of the call graph. This once again imposes the risk of rapid +repeated registration and deregistration if the thread happens to +constantly move between those regions of the call graph, so again it +may be a good idea to give programmers some decision power here. + +This brings us to the following insight... + +### Coloring the call graph for GC registration state + +Whether a function is eligible for total safepoint removal is not a +matter of what regions of a mixed-GC call graph it appears in (be it +head, tail, or in between) but rather a question of whether the +function may ever be called at a point in time in which a thread is +currently registered for GC. Assuming we do have some mechanism to +deregister threads, and this is an operation that can be interjected +into node transitions in the call graph (i.e., calls and returns), we +can effectively map out a second type of "sectioning" of the call +graph into regions; namely, regions in which the thread is or isn't +registered with GC. + +In the most naive and extreme implementation, these registration and +deregistration operations could be inserted into every single point +where the thread transitions between a GC and no-GC call graph region, +which would make the two sectioning types equivalent. That is, the +"coloring" of the call graph telling you which nodes are functions +that may hold GC pointers in their local variables becomes equivalent +to the coloring of the call graph telling you which nodes represent a +possible point in the thread's execution during which it's registered +with the GC. + +Through either heuristics or programmer control, it would make sense +to try and create a "GC registration coloring" of the call graph that +will cause relatively few register/deregister operations at run time, +while reaping significant efficiency gains from having the thread be +exempt from GC during certain points in execution. + +Oh and: A function now becomes eligible for SE optimization if it only +appears in GC-unregistered portions of thread call graphs; this is a +generalization of the previous idea of performing SE on functions that +are only called from pure-no-GC threads. + +### But deregistration can be very expensive + +OK, there's one issue: If you deregister a thread while there's GC +frames within the stack, you obviously can't just let the GC ignore +those by not checking this thread's stack. Instead, deregistration +would need to run a thread-local mark phase of the GC and save all +pointers it found elsewhere. + +This *may* still be worth it in some cases, but means that it becomes +all the more important to minimize this type of deregistration that +happens in deeper parts of the call graph, spanning over GC regions. + +This brings us to the final part. + +### Barrier-free concurrent GC during no-GC stack tails + +This is the idea I'm most proud of because it's the most crazy: + +Imagine there's a tail of the call graph that's no-GC colored. (This +is with respect to whether they can hold GC pointers in their locals; +not the other coloring that's about registration state.) + +Further, imagine that all *other* uses of the functions, that make up +this region of the call graph, make them eligible for SE. + +To ensure they remain eligible for SE, the compiler would normally +need to insert a deregistration point of the very expensive kind with +a thread-local mark phase. + +But what if... + +Instead of adding a deregistration point, we do a special thing: + +1. The thread informs the GC that it's entering a no-GC tail region. + +2. A special type of safepoint is added to the *return point* of the + no-GC region; not one that checks for the GC flag saying "hey I'm + waiting for threads to stop" but rather a special one saying "the + GC is **currently scanning your stack**". + +So long as the thread is within the no-GC tail of the call graph, the +GC is free to simply start scanning its stack, up to the point where +the no-GC region begins. If the no-GC region returns while this is +happening, it hits the special type of safepoint mentioned above, +which simply makes it pause itself, after clearing the flag saying +"I'm in a no-GC region" so that the GC will know what happened and +resume it once it's done. + +This retains the SE eligibility of the functions and doesn't require +the expensive deregistration. Instead, it allows a limited form of +barrier-free concurrent mark phase under special conditions, ensuring +that any potential hot loops in tail regions of a call graph can run +completely uninhibited and without safe-points. + +Did I invent something new?? How useful would this actually be?? + +Or did I goof up, forgot about some thorny detailed, and described +something that wouldn't even work in practice? diff --git a/notes/index.md b/notes/index.md index afc627d..f9e4f19 100644 --- a/notes/index.md +++ b/notes/index.md @@ -29,3 +29,4 @@ * [What not to intern?](260109-uninterned.html) * [JOKE RANT](260516-joke-rant.html) * [Interpreter](260522-interpreter.html) +* [Static analysis to optimize GC](260524-gc-opt.html) -- cgit v1.2.3