notes/equal.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255

# A novel approach to object equivalence

## Story time

In my past 5 years of developing a warehouse management application,
I've frequently found myself implementing equality the following way.

(Code massively reduced to get to the point.  Implementations of the
`hashCode` method, which must be compatible with `equals`, have been
left out for brevity.)

```java

class Warehouse {
  String code;
  String name;
  // ... more fields, many mutable

  public boolean equals(Object obj) {
    if (obj == this) {
      return true;
    }
    if (obj instanceof Warehouse w) {
      return code.equals(w.code);
    }
    return false;
  }
}

class Product {
  int id;
  LocalizedString name;
  LocalizedString description;
  // ... more fields, many mutable

  public boolean equals(Object obj) {
    if (obj == this) {
      return true;
    }
    if (obj instanceof Product p) {
      return id == p.id;
    }
    return false;
  }
}

```

And so on.  A type may have a code that is a String, an id that is an
int, or some other field that uniquely identifies members of it within
our application domain.

I'm speaking of types and their members, not classes and instances,
because I mean the application domain.  The user of our program has a
number of warehouses and a number of products, in reality, which we
represent via these classes, but there may be multiple instances of a
class representing the same real entity.

Oftentimes, there will be a database that contains unique entries for
these.  For example, an SQL table for products where the id is the
primary key, and an SQL table for warehouses where the code is the
primary key.

This way of implementing `equals()` and `hashCode()` may seem wrong to
some developers.  Since the classes have mutable fields, we may end up
with two `Product` instances which are not equal in their state, yet
`equals()` returns true for them since they represent the same real
life entity, just with different state in our program.  Perhaps one
has just been fetched from the database, whereas the other has been
modified to represent changes that are yet to be committed.

I've never found this strategy to lead to any problems.  Never have I
had a need to know whether two `Product` instances have the same state
right now.  What I care about is which product is being represented.
This is useful in various ways.

### Map keys, sets, and so on

First of all, it allows us to create maps where the keys are instances
of Product or Warehouse.  For example, the application may want to
create a map of which products each warehouse contains:

```java

Map<Warehouse, List<Product>> productsByWarehouse;

```

Given the `equals()` implementation of the Warehouse class, this map
can now be given any instance of Warehouse as a key, and we need not
ensure that we have one canonical instance used as the key.

There are of course many alternatives to this.  One may not want this
map to keep Warehouse instances alive, in which case it would be more
sensible to use the codes (Strings) for the keys.  One could also add
a field to Warehouse such as `List<Product> products;`.  However, in
some circumstances, a map such as the above may be most natural.

Other data structures and operations depend on equality checks as
well.  For example: finding duplicates in a collection, checking
whether a collection already contains a given item, managing a set
data type, and so on.  The way we implement equality is also useful
for such purposes.

### Elegance and encapsulation

Secondly, we may have subsystems in the application that communicate
by passing Product or Warehouse instances to each other.  For example,
the user may be looking at a user interface displaying the products in
a warehouse with their current quantities.  The user may then click on
an "update quantities" button which opens a sub-interface where it's
possible to add an arbitrary number of entries (products) with the
desired quantity change for each.  This sub-interface may then return
a list of Product instances with the associated quantity change.  The
code of the warehouse overview interface can now use `equals()` to
determine which instance of Product it received corresponds to which
Product instance it's already holding on to.

Again, there are of course many alternative strategies which don't
require such a use of `equals()`.  One may explicitly compare the id
fields, or the sub-interface may return ids associated with quantity
changes.  However, the `equals()` based implementation may offer the
cleanest possible code:

```java

public void receiveQuantityChanges(Map<Product, Integer> changes) {
  for (Row<Product> row : this.productRows) {
    Integer change = changes.get(row.product);
    row.quantity += change;
  }
}

```

As far as readability and elegance is concerned, this can hardly be
improved on, as far as one is familiar with Java.  Although using a
`Map<Integer, Integer>` would hardly make the code any more verbose,
it immediately causes it to lose out on self-documentation:

```java

// Quantity changes of what?  Should we rename this to the awfully
// long name receiveProductQuantityChanges to make it clear?
public void receiveQuantityChanges(Map<Integer, Integer> changes) {
  for (Row<Product> row : this.productRows) {
    Integer change = changes.get(row.product.id);
    row.quantity += change;
  }
}

```

This is a minor change as far as readability is concerned, but it also
decreases encapsulation.  The code needs to be aware that products are
represented uniquely by an id that is an integer, and changing this
implementation detail of the class may have far-reaching consequences
throughout the code base.

The self-documenting property may not apply to a Scheme-based language
without static types, but it makes the encapsulation aspect all the
more significant, since we won't have a compiler or static analyzer
which immediately tells us about a change in the type of `id`.  We
must hope that the field `id` has been removed and replaced with a
different one, not reusing the `id` identifier.  (For example, this
would mean a procedure like `product-id` stops existing, so a static
analyzer or compiler can immediately warn us about its absence.  Had
we reused the field but changed its type, we would have a very hard
time finding all the places in the code we now need to fix.)

## Get to the point!

I've explained all this to arrive at one simple conclusion:

Perhaps it would be a good idea for a language to have a mechanism in
which compound types like records or classes can simply declare that
one of the defined fields is the "unique identifier" of the type.

Imitating SRFI 9:

```scheme

(define-record-type <product>
  (make-product id name description ...)
  product?
  (identity: id)   ; <- see here
  (id product-id)
  (name product-name)
  (description product-description)
  ...)

(define-record-type <warehouse>
  (make-warehouse code name ...)
  warehouse?
  (identity: code)  ; <- see here
  (code warehouse-code)
  (name warehouse-name)
  ...)

```

Now the million dollar question is whether it should be `equal?` that
makes use of this information, or `eqv?`, or both.

Although `eqv?` is intended to approximate operational equivalence,
and should therefore not consider separate mutable objects to be the
same, it's not clear to me how useful this is on user-defined types.

The rationale for defining `eqv?` in terms of operational equivalence
is that this allows implementing memoization.  If a memoized function
depends on one of the mutable fields of a record, yet `eqv?` returns
true on separate instances with different state, then the function
will return an incorrect result.  But I'm skeptical as to whether a
function like that would see much practical use.  It seems to me like
a memoized function that is used to compute some property of an entity
in our application domain should probably depend only on immutable
properties of that object.

Another way to look at `eqv?` is that it implements "mostly constant
time" equality.  (Strictly speaking, it's a linear time operation on
numbers, assuming the implementation offers arbitrary-size numbers.
This rarely matters, as few programs make use of numbers larger than
what fits in a few dozen bytes.)  Conversely, `equal?` is responsible
for deep equality testing, which is inherently linear time, though
record types were overlooked in its definition in R7RS-small.

## "Why not neither?"

(I'm coining this as a new meme, in contrast to "why not both?")

I think I want Zisp to support the following:

* `eq?`: As in Scheme, but maybe allowed to fail on procedures.
* `eqv?`: As in Scheme.
* `equiv?`: An even closer approximation of operational equivalence,
  by diving into immutable compound data types that `eqv?` doesn't
  handle due to trying to be constant-time.  I believe this is the
  same as `equal-always?` in Racket.
* `equal?`: As in Scheme, but also dives into records because why not?

The reason `eq?` is allowed to fail on procedures is an optimization
strategy that duplicates/inlines procedures.  (citation needed)

And which one should make use of the `identity` field of user types?
Only `equiv?`, because the identity object may be a string or other
immutable but compound object.  (Thinking back to my Java programs,
I've had an extremely common type that was identified by a pair of
integers, for example: document and line number.  One can also
conceive of N-dimensional coordinates and other such examples.)

We want `eqv?` to remain "pretty much constant time" with the only
exception being numbers, which seem highly unlikely to reach such a
large size that it would matter.  Recursive data structures, and
uninterned strings, on the other hand, could become quite large and
costly to compare.