Add a note.

author: Taylan Kammer <taylan.kammer@gmail.com> 2026-05-24 16:22:41 +0200
committer: Taylan Kammer <taylan.kammer@gmail.com> 2026-05-24 16:22:41 +0200
commit: 4b9213d9addd6315de73747a0fdd7e996ba2826a (patch)
tree: 50a9d85fc43be14dc40cc5e551b62a97bc6739d6 /notes/260524-gc-opt.md
parent: 378f8598a5a57b731948241e41f584f5172dc2a2 (diff)
1 files changed, 258 insertions, 0 deletions
diff --git a/notes/260524-gc-opt.md b/notes/260524-gc-opt.md
new file mode 100644
index 0000000..6687520
--- /dev/null
+++ b/notes/260524-gc-opt.md
@@ -0,0 +1,258 @@
+# Static analysis to optimize GC
+
+I'm trying to flesh out a vague idea that keeps coming to mind.  The
+basic motivation is to eliminate some or all of GC related overhead
+that's incurred even on regions of code that meticulously ensure not
+to use any GC allocation.  Part of Zisp's design will try to offer
+various memory management strategies to the programmer, defaulting to
+GC only when code is written in a "don't care" style.  This brings us
+to the issue of "what if other parts of the code base, like a library
+we're using, allocates things on the GC heap."
+
+To my understanding, there are two intertwined categories of overhead
+incurred on code "globally" even if only some parts of the codebase
+use GC allocations:
+
+1. The need to insert safepoints.  This typically means an additional
+   instruction or two which gets inserted into the start of every
+   function body or call, and the start of every loop body or back
+   jump.  (Fun fact: I believe under full TCO and CPS these are one
+   and the same thing.)  If you use OS signals to stop threads, then
+   they don't need these safepoint instructions, but it introduces
+   other issues, like the possibility that a GC pointer has landed
+   within some part of a SIMD register or something.
+
+2. The need to actually stop a thread and scan it, when it was busy
+   doing some super high intensity hot loop stuff.
+
+(This doesn't just apply to a design idea like that of Zisp where you
+can take control over memory management; it also applies to codebases
+where some functions do allocation-free but computationally heavy
+work, and thus would ideally be free of GC overhead.)
+
+The reason that even a function which *itself* doesn't deal with any
+GC pointers needs safepoints to be inserted, is simple; consider the
+following trivial Scheme-like example:
+
+```scheme
+(define (main argv)
+  (print-prime (vector-ref argv 0)))
+
+(define (print-prime n)
+  (define label (sprintf "calculating prime %d" n))
+  (printf "[START] %s\n" label)
+  (printf "%d\n" (calculate-prime n))
+  (printf "[DONE] %s\n" label))
+
+(define (calculate-prime n)
+  (left as an exercise to the reader)
+  (but there are no heap allocations here))
+```
+
+The `label` string will be allocated on the heap, and although some
+escape analysis and such shenanigans could in theory transform this
+into a non-GC object given that its lifetime is obviously limited to
+one function, assume for this demonstration that there's some reason
+such optimizations cannot be applied, so `label` is truly on the GC
+heap throughout the duration of the call to `calculate-prime`.
+
+(Maybe this is part of a big codebase, with its own thread metadata
+and monitoring system, and a pointer to `label` ends up in a larger
+metadata object, saved in a global "current tasks" table.  Maybe the
+pointer is mutable too, so the thread can dynamically swap the label
+to indicate the current sub-task it's busy with.)
+
+This means that, in the prime calculation loop, the code needs to
+periodically check if the GC is waiting for the thread to yield, so
+frames can safely be scanned for GC pointers.  The GC would then find
+the pointer to `label` in the stack frame of the main function, and
+the stack frame of `calculate-prime` would be scanned needlessly,
+although that can be fixed with auxiliary stack frame metadata if I
+understand correctly.  (Created at compile-time and stored in DWARF
+data or something like that.)
+
+Let's try to fix this.  We want the prime calculation to be free of
+safepoints, without causing the GC a massive headache.  Typical cases
+of headache induced on the GC, that we need to avoid, include:
+
+* Causing the GC to stall until the no-GC function (prime calculation
+  in this case) is finished, i.e., enormous latency on other threads.
+
+* Needing to use OS signals to forcibly stop the thread because it
+  doesn't use safepoint "polling" instructions (called async preempt
+  in Golang lingo I believe) which could also mean needing to be
+  conservative in scanning the currently active stack frame of the
+  thread, since the no-GC function may be in the midst of some SIMD
+  register gymnastics or whatever.
+
+## The idea
+
+The strategy I'm about to describe could be applied to arbitrarily
+small sub-regions of code, I think, but to prevent myself from going
+insane I will treat functions as indivisible "code region" units.
+
+This means we can treat "may have local pointers to GC heap" as part
+of the static type signature of a function.  That is, whether the
+stack or registers, during execution of the function, may contain
+pointers into the GC heap.
+
+Detecting this on a per-function basis should not be difficult; just
+be reasonably conservative and check whether there are any operations
+within the function body whose result type is a type with some kind of
+"GC heap" trait.  I need to think more about this, but I'm pretty sure
+it's a solved issue since the invention of precise GC.
+
+Now what our static analysis needs to do, first of all, is to map out
+the call graph of a thread as a DAG.  Each node in the DAG is marked
+as "GC" or "no GC".
+
+There are a number of things that can happen now.
+
+First, if the entire graph is no-GC then the entire thread can be
+exempted from GC work.  The GC won't ever bother to pause it, be it
+via OS signals or by setting the flag that triggers safepoint polls
+(assuming the functions still have those).
+
+Second, analysis of an entire program could reveal that certain no-GC
+functions are only ever called from this thread, which would mean that
+no safepoint instructions need to be compiled into these functions!
+
+If a function is used both by a pure-no-GC thread and a GC thread, it
+may even be treated as "polymorphic" with respect to GC and it could
+be duplicated ("monomorphized") into a GC and no-GC version.  Probably
+overkill, but an idea worth keeping in mind.
+
+Where things get most exciting, though, is making use of information
+about partial GC vs. no-GC *regions* of the call graph.  Especially
+when it's the tail that's no-GC; wait for it.
+
+### No-GC head of call graph
+
+When the head of the call graph is no-GC, it means that the thread's
+registration with the GC can be deferred to the point at which it
+first enters a GC region of the call graph.  If it happens to be that
+the thread first does some heavy computation free of allocations, then
+enters a phase in which it reports its results or something (which may
+include string allocations, for example, or some heap object to wrap
+the result) then this could be a bountiful optimization.  There may
+however be the risk of the thread repeatedly exiting and re-entering
+the first GC region of the call graph, in which case there would be
+some repeated overhead related to checking if it already registered
+itself with the GC.
+
+I think it can be a good thing for a programming language to expose
+some implementation details and allow the programmer to take control
+over certain decisions which one would traditionally expect to be
+automatic.  Here, it could make sense for the programmer to be able
+determine whether this optimization is applied to a thread or not.
+
+### Deregistering a thread from GC
+
+You may think that functions which only appear in the initial no-GC
+heads of call graphs would be eligible for safepoint elimination (SE),
+but this doesn't necessarily work.  Once a thread is registered, the
+GC may at any point in the future set the flag that tells the thread
+to stop for GC, and then there needs to be safepoint polling even if
+the thread returned back to the no-GC head of its call graph.
+
+To solve that, threads would need to be able to not only register but
+also deregister themselves dynamically, when they return to a no-GC
+region of the call graph.  This once again imposes the risk of rapid
+repeated registration and deregistration if the thread happens to
+constantly move between those regions of the call graph, so again it
+may be a good idea to give programmers some decision power here.
+
+This brings us to the following insight...
+
+### Coloring the call graph for GC registration state
+
+Whether a function is eligible for total safepoint removal is not a
+matter of what regions of a mixed-GC call graph it appears in (be it
+head, tail, or in between) but rather a question of whether the
+function may ever be called at a point in time in which a thread is
+currently registered for GC.  Assuming we do have some mechanism to
+deregister threads, and this is an operation that can be interjected
+into node transitions in the call graph (i.e., calls and returns), we
+can effectively map out a second type of "sectioning" of the call
+graph into regions; namely, regions in which the thread is or isn't
+registered with GC.
+
+In the most naive and extreme implementation, these registration and
+deregistration operations could be inserted into every single point
+where the thread transitions between a GC and no-GC call graph region,
+which would make the two sectioning types equivalent.  That is, the
+"coloring" of the call graph telling you which nodes are functions
+that may hold GC pointers in their local variables becomes equivalent
+to the coloring of the call graph telling you which nodes represent a
+possible point in the thread's execution during which it's registered
+with the GC.
+
+Through either heuristics or programmer control, it would make sense
+to try and create a "GC registration coloring" of the call graph that
+will cause relatively few register/deregister operations at run time,
+while reaping significant efficiency gains from having the thread be
+exempt from GC during certain points in execution.
+
+Oh and: A function now becomes eligible for SE optimization if it only
+appears in GC-unregistered portions of thread call graphs; this is a
+generalization of the previous idea of performing SE on functions that
+are only called from pure-no-GC threads.
+
+### But deregistration can be very expensive
+
+OK, there's one issue: If you deregister a thread while there's GC
+frames within the stack, you obviously can't just let the GC ignore
+those by not checking this thread's stack.  Instead, deregistration
+would need to run a thread-local mark phase of the GC and save all
+pointers it found elsewhere.
+
+This *may* still be worth it in some cases, but means that it becomes
+all the more important to minimize this type of deregistration that
+happens in deeper parts of the call graph, spanning over GC regions.
+
+This brings us to the final part.
+
+### Barrier-free concurrent GC during no-GC stack tails
+
+This is the idea I'm most proud of because it's the most crazy:
+
+Imagine there's a tail of the call graph that's no-GC colored.  (This
+is with respect to whether they can hold GC pointers in their locals;
+not the other coloring that's about registration state.)
+
+Further, imagine that all *other* uses of the functions, that make up
+this region of the call graph, make them eligible for SE.
+
+To ensure they remain eligible for SE, the compiler would normally
+need to insert a deregistration point of the very expensive kind with
+a thread-local mark phase.
+
+But what if...
+
+Instead of adding a deregistration point, we do a special thing:
+
+1. The thread informs the GC that it's entering a no-GC tail region.
+
+2. A special type of safepoint is added to the *return point* of the
+   no-GC region; not one that checks for the GC flag saying "hey I'm
+   waiting for threads to stop" but rather a special one saying "the
+   GC is **currently scanning your stack**".
+
+So long as the thread is within the no-GC tail of the call graph, the
+GC is free to simply start scanning its stack, up to the point where
+the no-GC region begins.  If the no-GC region returns while this is
+happening, it hits the special type of safepoint mentioned above,
+which simply makes it pause itself, after clearing the flag saying
+"I'm in a no-GC region" so that the GC will know what happened and
+resume it once it's done.
+
+This retains the SE eligibility of the functions and doesn't require
+the expensive deregistration.  Instead, it allows a limited form of
+barrier-free concurrent mark phase under special conditions, ensuring
+that any potential hot loops in tail regions of a call graph can run
+completely uninhibited and without safe-points.
+
+Did I invent something new??  How useful would this actually be??
+
+Or did I goof up, forgot about some thorny detailed, and described
+something that wouldn't even work in practice?
author	Taylan Kammer <taylan.kammer@gmail.com>	2026-05-24 16:22:41 +0200
committer	Taylan Kammer <taylan.kammer@gmail.com>	2026-05-24 16:22:41 +0200
commit	4b9213d9addd6315de73747a0fdd7e996ba2826a (patch)
tree	50a9d85fc43be14dc40cc5e551b62a97bc6739d6 /notes/260524-gc-opt.md
parent	378f8598a5a57b731948241e41f584f5172dc2a2 (diff)