notes/260524-gc-opt.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258

# Static analysis to optimize GC

I'm trying to flesh out a vague idea that keeps coming to mind.  The
basic motivation is to eliminate some or all of GC related overhead
that's incurred even on regions of code that meticulously ensure not
to use any GC allocation.  Part of Zisp's design will try to offer
various memory management strategies to the programmer, defaulting to
GC only when code is written in a "don't care" style.  This brings us
to the issue of "what if other parts of the code base, like a library
we're using, allocates things on the GC heap."

To my understanding, there are two intertwined categories of overhead
incurred on code "globally" even if only some parts of the codebase
use GC allocations:

1. The need to insert safepoints.  This typically means an additional
   instruction or two which gets inserted into the start of every
   function body or call, and the start of every loop body or back
   jump.  (Fun fact: I believe under full TCO and CPS these are one
   and the same thing.)  If you use OS signals to stop threads, then
   they don't need these safepoint instructions, but it introduces
   other issues, like the possibility that a GC pointer has landed
   within some part of a SIMD register or something.

2. The need to actually stop a thread and scan it, when it was busy
   doing some super high intensity hot loop stuff.

(This doesn't just apply to a design idea like that of Zisp where you
can take control over memory management; it also applies to codebases
where some functions do allocation-free but computationally heavy
work, and thus would ideally be free of GC overhead.)

The reason that even a function which *itself* doesn't deal with any
GC pointers needs safepoints to be inserted, is simple; consider the
following trivial Scheme-like example:

```scheme
(define (main argv)
  (print-prime (vector-ref argv 0)))

(define (print-prime n)
  (define label (sprintf "calculating prime %d" n))
  (printf "[START] %s\n" label)
  (printf "%d\n" (calculate-prime n))
  (printf "[DONE] %s\n" label))

(define (calculate-prime n)
  (left as an exercise to the reader)
  (but there are no heap allocations here))
```

The `label` string will be allocated on the heap, and although some
escape analysis and such shenanigans could in theory transform this
into a non-GC object given that its lifetime is obviously limited to
one function, assume for this demonstration that there's some reason
such optimizations cannot be applied, so `label` is truly on the GC
heap throughout the duration of the call to `calculate-prime`.

(Maybe this is part of a big codebase, with its own thread metadata
and monitoring system, and a pointer to `label` ends up in a larger
metadata object, saved in a global "current tasks" table.  Maybe the
pointer is mutable too, so the thread can dynamically swap the label
to indicate the current sub-task it's busy with.)

This means that, in the prime calculation loop, the code needs to
periodically check if the GC is waiting for the thread to yield, so
frames can safely be scanned for GC pointers.  The GC would then find
the pointer to `label` in the stack frame of the main function, and
the stack frame of `calculate-prime` would be scanned needlessly,
although that can be fixed with auxiliary stack frame metadata if I
understand correctly.  (Created at compile-time and stored in DWARF
data or something like that.)

Let's try to fix this.  We want the prime calculation to be free of
safepoints, without causing the GC a massive headache.  Typical cases
of headache induced on the GC, that we need to avoid, include:

* Causing the GC to stall until the no-GC function (prime calculation
  in this case) is finished, i.e., enormous latency on other threads.

* Needing to use OS signals to forcibly stop the thread because it
  doesn't use safepoint "polling" instructions (called async preempt
  in Golang lingo I believe) which could also mean needing to be
  conservative in scanning the currently active stack frame of the
  thread, since the no-GC function may be in the midst of some SIMD
  register gymnastics or whatever.

## The idea

The strategy I'm about to describe could be applied to arbitrarily
small sub-regions of code, I think, but to prevent myself from going
insane I will treat functions as indivisible "code region" units.

This means we can treat "may have local pointers to GC heap" as part
of the static type signature of a function.  That is, whether the
stack or registers, during execution of the function, may contain
pointers into the GC heap.

Detecting this on a per-function basis should not be difficult; just
be reasonably conservative and check whether there are any operations
within the function body whose result type is a type with some kind of
"GC heap" trait.  I need to think more about this, but I'm pretty sure
it's a solved issue since the invention of precise GC.

Now what our static analysis needs to do, first of all, is to map out
the call graph of a thread as a DAG.  Each node in the DAG is marked
as "GC" or "no GC".

There are a number of things that can happen now.

First, if the entire graph is no-GC then the entire thread can be
exempted from GC work.  The GC won't ever bother to pause it, be it
via OS signals or by setting the flag that triggers safepoint polls
(assuming the functions still have those).

Second, analysis of an entire program could reveal that certain no-GC
functions are only ever called from this thread, which would mean that
no safepoint instructions need to be compiled into these functions!

If a function is used both by a pure-no-GC thread and a GC thread, it
may even be treated as "polymorphic" with respect to GC and it could
be duplicated ("monomorphized") into a GC and no-GC version.  Probably
overkill, but an idea worth keeping in mind.

Where things get most exciting, though, is making use of information
about partial GC vs. no-GC *regions* of the call graph.  Especially
when it's the tail that's no-GC; wait for it.

### No-GC head of call graph

When the head of the call graph is no-GC, it means that the thread's
registration with the GC can be deferred to the point at which it
first enters a GC region of the call graph.  If it happens to be that
the thread first does some heavy computation free of allocations, then
enters a phase in which it reports its results or something (which may
include string allocations, for example, or some heap object to wrap
the result) then this could be a bountiful optimization.  There may
however be the risk of the thread repeatedly exiting and re-entering
the first GC region of the call graph, in which case there would be
some repeated overhead related to checking if it already registered
itself with the GC.

I think it can be a good thing for a programming language to expose
some implementation details and allow the programmer to take control
over certain decisions which one would traditionally expect to be
automatic.  Here, it could make sense for the programmer to be able
determine whether this optimization is applied to a thread or not.

### Deregistering a thread from GC

You may think that functions which only appear in the initial no-GC
heads of call graphs would be eligible for safepoint elimination (SE),
but this doesn't necessarily work.  Once a thread is registered, the
GC may at any point in the future set the flag that tells the thread
to stop for GC, and then there needs to be safepoint polling even if
the thread returned back to the no-GC head of its call graph.

To solve that, threads would need to be able to not only register but
also deregister themselves dynamically, when they return to a no-GC
region of the call graph.  This once again imposes the risk of rapid
repeated registration and deregistration if the thread happens to
constantly move between those regions of the call graph, so again it
may be a good idea to give programmers some decision power here.

This brings us to the following insight...

### Coloring the call graph for GC registration state

Whether a function is eligible for total safepoint removal is not a
matter of what regions of a mixed-GC call graph it appears in (be it
head, tail, or in between) but rather a question of whether the
function may ever be called at a point in time in which a thread is
currently registered for GC.  Assuming we do have some mechanism to
deregister threads, and this is an operation that can be interjected
into node transitions in the call graph (i.e., calls and returns), we
can effectively map out a second type of "sectioning" of the call
graph into regions; namely, regions in which the thread is or isn't
registered with GC.

In the most naive and extreme implementation, these registration and
deregistration operations could be inserted into every single point
where the thread transitions between a GC and no-GC call graph region,
which would make the two sectioning types equivalent.  That is, the
"coloring" of the call graph telling you which nodes are functions
that may hold GC pointers in their local variables becomes equivalent
to the coloring of the call graph telling you which nodes represent a
possible point in the thread's execution during which it's registered
with the GC.

Through either heuristics or programmer control, it would make sense
to try and create a "GC registration coloring" of the call graph that
will cause relatively few register/deregister operations at run time,
while reaping significant efficiency gains from having the thread be
exempt from GC during certain points in execution.

Oh and: A function now becomes eligible for SE optimization if it only
appears in GC-unregistered portions of thread call graphs; this is a
generalization of the previous idea of performing SE on functions that
are only called from pure-no-GC threads.

### But deregistration can be very expensive

OK, there's one issue: If you deregister a thread while there's GC
frames within the stack, you obviously can't just let the GC ignore
those by not checking this thread's stack.  Instead, deregistration
would need to run a thread-local mark phase of the GC and save all
pointers it found elsewhere.

This *may* still be worth it in some cases, but means that it becomes
all the more important to minimize this type of deregistration that
happens in deeper parts of the call graph, spanning over GC regions.

This brings us to the final part.

### Barrier-free concurrent GC during no-GC stack tails

This is the idea I'm most proud of because it's the most crazy:

Imagine there's a tail of the call graph that's no-GC colored.  (This
is with respect to whether they can hold GC pointers in their locals;
not the other coloring that's about registration state.)

Further, imagine that all *other* uses of the functions, that make up
this region of the call graph, make them eligible for SE.

To ensure they remain eligible for SE, the compiler would normally
need to insert a deregistration point of the very expensive kind with
a thread-local mark phase.

But what if...

Instead of adding a deregistration point, we do a special thing:

1. The thread informs the GC that it's entering a no-GC tail region.

2. A special type of safepoint is added to the *return point* of the
   no-GC region; not one that checks for the GC flag saying "hey I'm
   waiting for threads to stop" but rather a special one saying "the
   GC is **currently scanning your stack**".

So long as the thread is within the no-GC tail of the call graph, the
GC is free to simply start scanning its stack, up to the point where
the no-GC region begins.  If the no-GC region returns while this is
happening, it hits the special type of safepoint mentioned above,
which simply makes it pause itself, after clearing the flag saying
"I'm in a no-GC region" so that the GC will know what happened and
resume it once it's done.

This retains the SE eligibility of the functions and doesn't require
the expensive deregistration.  Instead, it allows a limited form of
barrier-free concurrent mark phase under special conditions, ensuring
that any potential hot loops in tail regions of a call graph can run
completely uninhibited and without safe-points.

Did I invent something new??  How useful would this actually be??

Or did I goof up, forgot about some thorny detailed, and described
something that wouldn't even work in practice?