Skip to content

Commit cb1005c

Browse files
arendjrchansukeematipico
authored
docs: type architecture (#5978)
Co-authored-by: Yusuke Abe <[email protected]> Co-authored-by: Emanuele Stoppa <[email protected]>
1 parent 905c760 commit cb1005c

File tree

1 file changed

+306
-0
lines changed

1 file changed

+306
-0
lines changed
Lines changed: 306 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,306 @@
1+
# Biome Type Architecture
2+
3+
In order to contribute to Biome's type inference, it's good to understand our
4+
type architecture.
5+
6+
## Architecture Constraints
7+
8+
The main thing to understand about Biome is that we put our **User Experience**
9+
front and center. Whether it's our
10+
[Rule Pillars](https://biomejs.dev/linter/#rule-pillars), our Batteries-Included
11+
approach, the
12+
[`biome migrate`](https://biomejs.dev/guides/migrate-eslint-prettier/) command
13+
for users coming from other tools, or our focus on IDE support, we know that
14+
without users we are nowhere.
15+
16+
And it's precisely this last point, our IDE support, that's so important here.
17+
IDE support was already an important consideration in our
18+
[approach to multi-file support](https://github.com/biomejs/biome/discussions/4653),
19+
and this seeps through into our type inference architecture.
20+
21+
For many tools, such as bundlers, it is sufficient to optimise the performance
22+
for CLI usage. Development servers may have an interest in optimising hot-reload
23+
performance as well, but they tend to do so by pushing responsibility to the
24+
client instead of rebuilding their bundles faster.
25+
26+
For Biome, priorities are different: If a user changes file A, they want the
27+
diagnostics for file B to update in their IDE, regardless of whether it has
28+
dependencies on file A. Updates need to happen near-instantaneously, and
29+
the IDE is not a client we can offload responsibility to.
30+
31+
## Module Graph
32+
33+
Biome's [module graph](../biome_module_graph/) is central to our multi-file
34+
support and is designed with these considerations in mind. And our type
35+
architecture is built upon this module graph. The module graph is effectively
36+
just a [fancy hash map](https://github.com/ibraheemdev/papaya/) that contains
37+
entries for every module (every JS/TS file in a repository), including metadata
38+
such as which other modules that module depends on, which symbols it exports,
39+
and yes, also which types it contains.
40+
41+
The key constraint the module graph operates under is this: No module may copy
42+
or clone data from another module, not even if that data is behind an
43+
[`Arc`](https://doc.rust-lang.org/std/sync/struct.Arc.html).
44+
The reason for this is simple: Because of our focus on IDE support, we maintain
45+
the idea that any module in the module graph may be updated at any point in time
46+
due to a user action. Whenever that happens, we shouldn't have trouble figuring
47+
out which other modules need their data to be invalidated, which might happen if
48+
modules were to copy each other's data.
49+
50+
Some other tools use complex systems to track dependencies between modules, both
51+
explicit dependencies as well as implicit ones, so they can do very granular
52+
cache invalidation. With Biome we're trying radical simplicity instead: just
53+
make sure we don't have such dependencies between entries in our module graph.
54+
So far, that appears to be working well enough, but naturally, it comes with its
55+
own challenges.
56+
57+
## Type Data Structures
58+
59+
In Biome, the most basic data structure for type information is a giant `enum`,
60+
called `TypeData`, defined in [type_info.rs](src/type_info.rs).
61+
62+
This enum has many different variants in order to cover all the different kinds
63+
of types that TypeScript supports. But a few are specifically
64+
interesting to mention here:
65+
66+
* `TypeData::Unknown` is important because our implementation of type inference
67+
is only a partial implementation. Whenever something is not implemented, we
68+
default to `Unknown` to indicate that, well, the type is unknown. This is
69+
practically identical to the `unknown` keyword that exists in TypeScript, but
70+
we do have a separate `TypeData::UnknownKeyword` variant for that so that we
71+
can distinguish between situations where our inference falls short versus
72+
situations where we _can't_ infer because the user explicitly used `unknown`.
73+
They're semantically identical, so the difference is only for measuring the
74+
effectiveness of our inference.
75+
* Complex types such as `TypeData::Function` and `TypeData::Object` carry extra
76+
information, such as definitions of function parameters and object properties.
77+
Because function parameters and object properties themselves also have a type,
78+
we can recognise that `TypeData` is potentially a circular data structure.
79+
* Rather than allowing the data structure itself to become circular/recursive,
80+
we use `TypeReference` to refer to other types. And because we try to avoid
81+
duplicating types if we can, we have `TypeData::Reference` to indicate a type
82+
is nothing but a reference to another type.
83+
84+
## Why Use Type References?
85+
86+
Theoretically, we _could_ use `Arc` and let types reference each other directly.
87+
But remember that module graph mentioned above? If a type from module A were to
88+
reference a type from module B, and we'd store the type from module B behind an
89+
`Arc`, then what would happen if module B were replaced in our module graph?
90+
91+
The result would be that the module graph would have an updated version of
92+
module B, but the types in module A would hang on to old versions of those
93+
types, because the `Arc` would keep those old versions alive. Of course we could
94+
try to mitigate that, but solutions tend to become either very complex or very
95+
slow, and possibly both.
96+
97+
We wanted simplicity, so we opted to sidestep this problem using
98+
`TypeReference`s instead.
99+
100+
But even though the constraints of our module graph were our primary reason for
101+
choosing to use type references, they have other advantages too:
102+
103+
* By not putting the type data behind `Arc`s, we can store data for multiple
104+
types in a linear vector. This improves data locality, and with it,
105+
performance.
106+
* Storing type data in a vector also makes it more convenient to see which types
107+
have been registered, which in turn helps with debugging and test snapshots.
108+
* Not having to deal with recursive data structures made some of our algorithms
109+
easier to reason about as well. If we want to perform some action on every
110+
type, we just run it on the vector instead of traversing a graph
111+
while tracking which parts of the graph have already been visited.
112+
113+
## Type Resolution Phases
114+
115+
Type references come in multiple variants:
116+
117+
```rs
118+
enum TypeReference {
119+
Qualifier(TypeReferenceQualifier),
120+
Resolved(ResolvedTypeId),
121+
Import(TypeImportQualifier),
122+
Unknown,
123+
}
124+
```
125+
126+
The reason for these variants is that _type resolution_, the process of
127+
resolving type references, works in multiple phases.
128+
129+
Biome recognises three levels of type inference, and has different resolution
130+
phases to support those...
131+
132+
### Local Inference
133+
134+
_Local inference_ is when we look at an expression and derive a type definition.
135+
For example, consider this seemingly trivial example:
136+
137+
```js
138+
a + b
139+
```
140+
141+
It looks like this should be easy, but because local inference doesn't have any
142+
context such as definitions from surrounding scopes, it will never be able to
143+
understand what `a` or `b` refers to.
144+
145+
Therefore, local inference cannot resolve this to a _concrete_ type. But with
146+
the help of type references, we can rewrite the expression into something
147+
useful:
148+
149+
```rs
150+
TypeData::TypeofExpression(TypeofExpression::Addition {
151+
left: TypeReference::from(TypeReferenceQualifier::from_name("a")),
152+
right: TypeReference::from(TypeReferenceQualifier::from_name("b"))
153+
})
154+
```
155+
156+
Local inference doesn't do any type resolution yet, it only creates type
157+
references. So in most cases we won't know a concrete type yet, but it still
158+
provides a useful starting point for later inference.
159+
160+
Local inference is implemented in [local_inference.rs](src/local_inference.rs).
161+
162+
### Module-Level ("Thin") Inference
163+
164+
_Module-level inference_, sometimes called: "thin inference", allows us to put
165+
those types from the local inference phase into context. This is where we look
166+
at a module as a whole, take its import and export definitions, look at the
167+
scopes that are created, as well as the types derived using local inference, and
168+
apply another round of inference to it.
169+
170+
Within the scope of a module, we do our first round of type resolution: We take
171+
all the references of the variant `TypeReference::Qualifier` (the only ones
172+
created thus far), and attempt to look them up in the relevant scopes. If a
173+
local scope declaration is found, we consider the type _resolved_ and convert
174+
the reference into a `TypeReference::Resolved` variant with an associated
175+
`ResolvedTypeId` structure, which looks like this:
176+
177+
```rs
178+
struct ResolvedTypeId(ResolverId, TypeId)
179+
```
180+
181+
Both `ResolverId` and `TypeId` are a `u32` internally, so this is a really
182+
compact representation for referencing another type, not bigger than a regular
183+
64-bit pointer. The `TypeId` is a literal index into a vector where types are
184+
stored, while the `ResolverId` is a slightly more complex identifier that allows
185+
us to determine _which_ vector we need to look in, because every module will
186+
have its own vector (and there are a few more places to look besides).
187+
188+
Another possibility is that the qualifier references a binding from an
189+
_import statement_, such as `import { a } from "./a.ts"`. In this case, we
190+
cannot fully resolve the type yet, because thin inference cannot look beyond the
191+
boundaries of its own module. But we can mark this case as an explicit import
192+
reference. This is what the `TypeReference::Import` variant is for.
193+
194+
And if the qualifier exists neither as a local declaration, nor as an imported
195+
binding, then we know it must come from the global scope, where we can find
196+
predefined bindings such as `Array` and `Promise`, or the `window` object. If a
197+
global reference is found, it also gets converted to a `TypeReference::Resolved`
198+
variant, where the `ResolverId` can be used to indicate this type can be looked
199+
up from a vector of predefined types.
200+
201+
But ultimately, if not even a global declaration was found, then we're at a loss
202+
and fall back to `TypeReference::Unknown`.
203+
204+
Thin inference is implemented in
205+
[js_module_info/collector.rs](../biome_module_graph/src/js_module_info/collector.rs).
206+
207+
## Full Inference
208+
209+
_Full inference_ is where we can tie all the loose ends together. It's where we
210+
have the entire module graph at our disposal, so that whenever we run into an
211+
unresolved `TypeReference::Import` variant, we can resolve it on the spot, at
212+
which point it becomes a `TypeReference::Resolved` variant again.
213+
214+
Today, results from our full inference cannot be cached for the same reason
215+
we've seen before: Such a cache would get stale the moment a module is replaced,
216+
and we don't want to have complex cache invalidation schemes.
217+
218+
Full inference is implemented in
219+
[scoped_resolver.rs](../biome_module_graph/src/js_module_info/scoped_resolver.rs).
220+
221+
## Type Resolvers
222+
223+
The thing about having all these type references all over the place is that you
224+
need to perform explicit type resolution to follow these references. That's why
225+
we have _type resolvers_. There's a `TypeResolver` trait, defined in
226+
[resolver.rs](src/resolver.rs). As of today, we have 6 implementations of it:
227+
228+
* **`HardcodedSymbolResolver`**. This one is purely for test purposes.
229+
* **`GlobalsResolver`**. This is the one that is responsible for resolving
230+
globals such as `Promise` and `Array`. The way we do this is still rather
231+
primitive with hardcoded, predefined symbols. At some point we probably should
232+
be able to use TypeScript's own global `.d.ts` files, such as
233+
[es2023.array.d.ts](https://github.com/microsoft/TypeScript/blob/main/src/lib/es2023.array.d.ts),
234+
directly.
235+
* **`JsModuleInfoCollector`**. This one is responsible for collecting
236+
information about a module, and for performing our module-level inference.
237+
* **`JsModuleInfo`**. Once the `JsModuleInfoCollector` has done its job, a
238+
`JsModuleInfo` instance is created, which is stored as an entry in our module
239+
graph. But this data structure also implements `TypeResolver` so that our full
240+
inference can access the module's types too.
241+
* **`ScopedResolver`**. This is the one that is responsible for our actual full
242+
inference. It's named as it is because it is the only resolver that can really
243+
resolve things in any arbitrary scope. Compare this to the
244+
`JsModuleInfoCollector` which only cares about the global scope of a module,
245+
because at least so far that's all we need to determine types of exports
246+
(we don't determine the return type of functions without annotations yet, and
247+
it's not yet decided when or if we'll do this).
248+
* **`ScopeRestrictedRegistrationResolver`** may sound impressive, but is but a
249+
helper for `ScopedResolver` to conveniently set the correct scope ID on
250+
certain references, so that when the time comes for the `ScopedResolver` to
251+
resolve it, it will still know which scope should be used for resolving it.
252+
253+
I've mentioned before that types are stored in vectors. Those type vectors are
254+
stored inside the structures that implement `TypeResolver`, and with the
255+
exception of `ScopeRestrictedRegistrationResolver`, they all have their own
256+
internal storage for types.
257+
258+
## Flattening
259+
260+
Apart from type resolution, there's one other, last important piece to type
261+
inference: _type flattening_.
262+
263+
Let's look at the `a + b` expression again. After local inference, it was
264+
interpreted as this:
265+
266+
```rs
267+
TypeData::TypeofExpression(TypeofExpression::Addition {
268+
left: TypeReference::from(TypeReferenceQualifier::from_name("a")),
269+
right: TypeReference::from(TypeReferenceQualifier::from_name("b"))
270+
})
271+
```
272+
273+
But at some point, supposedly one of the resolvers is going to be able to
274+
resolve `a` and `b`, and the expression becomes something such as:
275+
276+
```rs
277+
TypeData::TypeofExpression(TypeofExpression::Addition {
278+
left: TypeReference::from(ResolvedTypeId(/* resolver ID and type ID */)),
279+
right: TypeReference::from(ResolvedTypeId(/* resolver ID and type ID */))
280+
})
281+
```
282+
283+
At this point we know the actual types we are dealing with. If the types for
284+
both `left` and `right` resolve to `TypeData::Number`, the entire expression can
285+
be _flattened_ to `TypeData::Number`, because that's the result of adding two
286+
numbers. And in most other cases it will become `TypeData::String` instead.
287+
288+
Flattening is implemented in [flattening.rs](src/flattening.rs).
289+
290+
## `ResolvedTypeData`
291+
292+
One more important data structure to be aware of is `ResolvedTypeData`. Whenever
293+
we request type data from a resolver, we don't receive a `&TypeData` reference,
294+
but `ResolvedTypeData`.
295+
296+
The reason for this structure is that it tracks the `ResolverId` so we remember
297+
where this type data was found. This is important if you want to resolve
298+
`TypeReference`s that are part of the type data and you need to make subsequent
299+
calls to the resolver.
300+
301+
`ResolvedTypeData` has an `as_raw_data()` method that returns the raw
302+
`&TypeData` reference. This is often used for matching against the variants of
303+
the `TypeData` enum. But keep in mind that any data that you retrieve this way
304+
cannot be used with a resolver unless you explicitly and manually apply the
305+
right `ResolverId` to it! Unfortunately we cannot enforce this through the type
306+
system, and **mistakes can lead to panics**.

0 commit comments

Comments
 (0)