|
| 1 | +# Biome Type Architecture |
| 2 | + |
| 3 | +In order to contribute to Biome's type inference, it's good to understand our |
| 4 | +type architecture. |
| 5 | + |
| 6 | +## Architecture Constraints |
| 7 | + |
| 8 | +The main thing to understand about Biome is that we put our **User Experience** |
| 9 | +front and center. Whether it's our |
| 10 | +[Rule Pillars](https://biomejs.dev/linter/#rule-pillars), our Batteries-Included |
| 11 | +approach, the |
| 12 | +[`biome migrate`](https://biomejs.dev/guides/migrate-eslint-prettier/) command |
| 13 | +for users coming from other tools, or our focus on IDE support, we know that |
| 14 | +without users we are nowhere. |
| 15 | + |
| 16 | +And it's precisely this last point, our IDE support, that's so important here. |
| 17 | +IDE support was already an important consideration in our |
| 18 | +[approach to multi-file support](https://github.com/biomejs/biome/discussions/4653), |
| 19 | +and this seeps through into our type inference architecture. |
| 20 | + |
| 21 | +For many tools, such as bundlers, it is sufficient to optimise the performance |
| 22 | +for CLI usage. Development servers may have an interest in optimising hot-reload |
| 23 | +performance as well, but they tend to do so by pushing responsibility to the |
| 24 | +client instead of rebuilding their bundles faster. |
| 25 | + |
| 26 | +For Biome, priorities are different: If a user changes file A, they want the |
| 27 | +diagnostics for file B to update in their IDE, regardless of whether it has |
| 28 | +dependencies on file A. Updates need to happen near-instantaneously, and |
| 29 | +the IDE is not a client we can offload responsibility to. |
| 30 | + |
| 31 | +## Module Graph |
| 32 | + |
| 33 | +Biome's [module graph](../biome_module_graph/) is central to our multi-file |
| 34 | +support and is designed with these considerations in mind. And our type |
| 35 | +architecture is built upon this module graph. The module graph is effectively |
| 36 | +just a [fancy hash map](https://github.com/ibraheemdev/papaya/) that contains |
| 37 | +entries for every module (every JS/TS file in a repository), including metadata |
| 38 | +such as which other modules that module depends on, which symbols it exports, |
| 39 | +and yes, also which types it contains. |
| 40 | + |
| 41 | +The key constraint the module graph operates under is this: No module may copy |
| 42 | +or clone data from another module, not even if that data is behind an |
| 43 | +[`Arc`](https://doc.rust-lang.org/std/sync/struct.Arc.html). |
| 44 | +The reason for this is simple: Because of our focus on IDE support, we maintain |
| 45 | +the idea that any module in the module graph may be updated at any point in time |
| 46 | +due to a user action. Whenever that happens, we shouldn't have trouble figuring |
| 47 | +out which other modules need their data to be invalidated, which might happen if |
| 48 | +modules were to copy each other's data. |
| 49 | + |
| 50 | +Some other tools use complex systems to track dependencies between modules, both |
| 51 | +explicit dependencies as well as implicit ones, so they can do very granular |
| 52 | +cache invalidation. With Biome we're trying radical simplicity instead: just |
| 53 | +make sure we don't have such dependencies between entries in our module graph. |
| 54 | +So far, that appears to be working well enough, but naturally, it comes with its |
| 55 | +own challenges. |
| 56 | + |
| 57 | +## Type Data Structures |
| 58 | + |
| 59 | +In Biome, the most basic data structure for type information is a giant `enum`, |
| 60 | +called `TypeData`, defined in [type_info.rs](src/type_info.rs). |
| 61 | + |
| 62 | +This enum has many different variants in order to cover all the different kinds |
| 63 | +of types that TypeScript supports. But a few are specifically |
| 64 | +interesting to mention here: |
| 65 | + |
| 66 | +* `TypeData::Unknown` is important because our implementation of type inference |
| 67 | + is only a partial implementation. Whenever something is not implemented, we |
| 68 | + default to `Unknown` to indicate that, well, the type is unknown. This is |
| 69 | + practically identical to the `unknown` keyword that exists in TypeScript, but |
| 70 | + we do have a separate `TypeData::UnknownKeyword` variant for that so that we |
| 71 | + can distinguish between situations where our inference falls short versus |
| 72 | + situations where we _can't_ infer because the user explicitly used `unknown`. |
| 73 | + They're semantically identical, so the difference is only for measuring the |
| 74 | + effectiveness of our inference. |
| 75 | +* Complex types such as `TypeData::Function` and `TypeData::Object` carry extra |
| 76 | + information, such as definitions of function parameters and object properties. |
| 77 | + Because function parameters and object properties themselves also have a type, |
| 78 | + we can recognise that `TypeData` is potentially a circular data structure. |
| 79 | +* Rather than allowing the data structure itself to become circular/recursive, |
| 80 | + we use `TypeReference` to refer to other types. And because we try to avoid |
| 81 | + duplicating types if we can, we have `TypeData::Reference` to indicate a type |
| 82 | + is nothing but a reference to another type. |
| 83 | + |
| 84 | +## Why Use Type References? |
| 85 | + |
| 86 | +Theoretically, we _could_ use `Arc` and let types reference each other directly. |
| 87 | +But remember that module graph mentioned above? If a type from module A were to |
| 88 | +reference a type from module B, and we'd store the type from module B behind an |
| 89 | +`Arc`, then what would happen if module B were replaced in our module graph? |
| 90 | + |
| 91 | +The result would be that the module graph would have an updated version of |
| 92 | +module B, but the types in module A would hang on to old versions of those |
| 93 | +types, because the `Arc` would keep those old versions alive. Of course we could |
| 94 | +try to mitigate that, but solutions tend to become either very complex or very |
| 95 | +slow, and possibly both. |
| 96 | + |
| 97 | +We wanted simplicity, so we opted to sidestep this problem using |
| 98 | +`TypeReference`s instead. |
| 99 | + |
| 100 | +But even though the constraints of our module graph were our primary reason for |
| 101 | +choosing to use type references, they have other advantages too: |
| 102 | + |
| 103 | +* By not putting the type data behind `Arc`s, we can store data for multiple |
| 104 | + types in a linear vector. This improves data locality, and with it, |
| 105 | + performance. |
| 106 | +* Storing type data in a vector also makes it more convenient to see which types |
| 107 | + have been registered, which in turn helps with debugging and test snapshots. |
| 108 | +* Not having to deal with recursive data structures made some of our algorithms |
| 109 | + easier to reason about as well. If we want to perform some action on every |
| 110 | + type, we just run it on the vector instead of traversing a graph |
| 111 | + while tracking which parts of the graph have already been visited. |
| 112 | + |
| 113 | +## Type Resolution Phases |
| 114 | + |
| 115 | +Type references come in multiple variants: |
| 116 | + |
| 117 | +```rs |
| 118 | +enum TypeReference { |
| 119 | + Qualifier(TypeReferenceQualifier), |
| 120 | + Resolved(ResolvedTypeId), |
| 121 | + Import(TypeImportQualifier), |
| 122 | + Unknown, |
| 123 | +} |
| 124 | +``` |
| 125 | + |
| 126 | +The reason for these variants is that _type resolution_, the process of |
| 127 | +resolving type references, works in multiple phases. |
| 128 | + |
| 129 | +Biome recognises three levels of type inference, and has different resolution |
| 130 | +phases to support those... |
| 131 | + |
| 132 | +### Local Inference |
| 133 | + |
| 134 | +_Local inference_ is when we look at an expression and derive a type definition. |
| 135 | +For example, consider this seemingly trivial example: |
| 136 | + |
| 137 | +```js |
| 138 | +a + b |
| 139 | +``` |
| 140 | + |
| 141 | +It looks like this should be easy, but because local inference doesn't have any |
| 142 | +context such as definitions from surrounding scopes, it will never be able to |
| 143 | +understand what `a` or `b` refers to. |
| 144 | + |
| 145 | +Therefore, local inference cannot resolve this to a _concrete_ type. But with |
| 146 | +the help of type references, we can rewrite the expression into something |
| 147 | +useful: |
| 148 | + |
| 149 | +```rs |
| 150 | +TypeData::TypeofExpression(TypeofExpression::Addition { |
| 151 | + left: TypeReference::from(TypeReferenceQualifier::from_name("a")), |
| 152 | + right: TypeReference::from(TypeReferenceQualifier::from_name("b")) |
| 153 | +}) |
| 154 | +``` |
| 155 | + |
| 156 | +Local inference doesn't do any type resolution yet, it only creates type |
| 157 | +references. So in most cases we won't know a concrete type yet, but it still |
| 158 | +provides a useful starting point for later inference. |
| 159 | + |
| 160 | +Local inference is implemented in [local_inference.rs](src/local_inference.rs). |
| 161 | + |
| 162 | +### Module-Level ("Thin") Inference |
| 163 | + |
| 164 | +_Module-level inference_, sometimes called: "thin inference", allows us to put |
| 165 | +those types from the local inference phase into context. This is where we look |
| 166 | +at a module as a whole, take its import and export definitions, look at the |
| 167 | +scopes that are created, as well as the types derived using local inference, and |
| 168 | +apply another round of inference to it. |
| 169 | + |
| 170 | +Within the scope of a module, we do our first round of type resolution: We take |
| 171 | +all the references of the variant `TypeReference::Qualifier` (the only ones |
| 172 | +created thus far), and attempt to look them up in the relevant scopes. If a |
| 173 | +local scope declaration is found, we consider the type _resolved_ and convert |
| 174 | +the reference into a `TypeReference::Resolved` variant with an associated |
| 175 | +`ResolvedTypeId` structure, which looks like this: |
| 176 | + |
| 177 | +```rs |
| 178 | +struct ResolvedTypeId(ResolverId, TypeId) |
| 179 | +``` |
| 180 | + |
| 181 | +Both `ResolverId` and `TypeId` are a `u32` internally, so this is a really |
| 182 | +compact representation for referencing another type, not bigger than a regular |
| 183 | +64-bit pointer. The `TypeId` is a literal index into a vector where types are |
| 184 | +stored, while the `ResolverId` is a slightly more complex identifier that allows |
| 185 | +us to determine _which_ vector we need to look in, because every module will |
| 186 | +have its own vector (and there are a few more places to look besides). |
| 187 | + |
| 188 | +Another possibility is that the qualifier references a binding from an |
| 189 | +_import statement_, such as `import { a } from "./a.ts"`. In this case, we |
| 190 | +cannot fully resolve the type yet, because thin inference cannot look beyond the |
| 191 | +boundaries of its own module. But we can mark this case as an explicit import |
| 192 | +reference. This is what the `TypeReference::Import` variant is for. |
| 193 | + |
| 194 | +And if the qualifier exists neither as a local declaration, nor as an imported |
| 195 | +binding, then we know it must come from the global scope, where we can find |
| 196 | +predefined bindings such as `Array` and `Promise`, or the `window` object. If a |
| 197 | +global reference is found, it also gets converted to a `TypeReference::Resolved` |
| 198 | +variant, where the `ResolverId` can be used to indicate this type can be looked |
| 199 | +up from a vector of predefined types. |
| 200 | + |
| 201 | +But ultimately, if not even a global declaration was found, then we're at a loss |
| 202 | +and fall back to `TypeReference::Unknown`. |
| 203 | + |
| 204 | +Thin inference is implemented in |
| 205 | +[js_module_info/collector.rs](../biome_module_graph/src/js_module_info/collector.rs). |
| 206 | + |
| 207 | +## Full Inference |
| 208 | + |
| 209 | +_Full inference_ is where we can tie all the loose ends together. It's where we |
| 210 | +have the entire module graph at our disposal, so that whenever we run into an |
| 211 | +unresolved `TypeReference::Import` variant, we can resolve it on the spot, at |
| 212 | +which point it becomes a `TypeReference::Resolved` variant again. |
| 213 | + |
| 214 | +Today, results from our full inference cannot be cached for the same reason |
| 215 | +we've seen before: Such a cache would get stale the moment a module is replaced, |
| 216 | +and we don't want to have complex cache invalidation schemes. |
| 217 | + |
| 218 | +Full inference is implemented in |
| 219 | +[scoped_resolver.rs](../biome_module_graph/src/js_module_info/scoped_resolver.rs). |
| 220 | + |
| 221 | +## Type Resolvers |
| 222 | + |
| 223 | +The thing about having all these type references all over the place is that you |
| 224 | +need to perform explicit type resolution to follow these references. That's why |
| 225 | +we have _type resolvers_. There's a `TypeResolver` trait, defined in |
| 226 | +[resolver.rs](src/resolver.rs). As of today, we have 6 implementations of it: |
| 227 | + |
| 228 | +* **`HardcodedSymbolResolver`**. This one is purely for test purposes. |
| 229 | +* **`GlobalsResolver`**. This is the one that is responsible for resolving |
| 230 | + globals such as `Promise` and `Array`. The way we do this is still rather |
| 231 | + primitive with hardcoded, predefined symbols. At some point we probably should |
| 232 | + be able to use TypeScript's own global `.d.ts` files, such as |
| 233 | + [es2023.array.d.ts](https://github.com/microsoft/TypeScript/blob/main/src/lib/es2023.array.d.ts), |
| 234 | + directly. |
| 235 | +* **`JsModuleInfoCollector`**. This one is responsible for collecting |
| 236 | + information about a module, and for performing our module-level inference. |
| 237 | +* **`JsModuleInfo`**. Once the `JsModuleInfoCollector` has done its job, a |
| 238 | + `JsModuleInfo` instance is created, which is stored as an entry in our module |
| 239 | + graph. But this data structure also implements `TypeResolver` so that our full |
| 240 | + inference can access the module's types too. |
| 241 | +* **`ScopedResolver`**. This is the one that is responsible for our actual full |
| 242 | + inference. It's named as it is because it is the only resolver that can really |
| 243 | + resolve things in any arbitrary scope. Compare this to the |
| 244 | + `JsModuleInfoCollector` which only cares about the global scope of a module, |
| 245 | + because at least so far that's all we need to determine types of exports |
| 246 | + (we don't determine the return type of functions without annotations yet, and |
| 247 | + it's not yet decided when or if we'll do this). |
| 248 | +* **`ScopeRestrictedRegistrationResolver`** may sound impressive, but is but a |
| 249 | + helper for `ScopedResolver` to conveniently set the correct scope ID on |
| 250 | + certain references, so that when the time comes for the `ScopedResolver` to |
| 251 | + resolve it, it will still know which scope should be used for resolving it. |
| 252 | + |
| 253 | +I've mentioned before that types are stored in vectors. Those type vectors are |
| 254 | +stored inside the structures that implement `TypeResolver`, and with the |
| 255 | +exception of `ScopeRestrictedRegistrationResolver`, they all have their own |
| 256 | +internal storage for types. |
| 257 | + |
| 258 | +## Flattening |
| 259 | + |
| 260 | +Apart from type resolution, there's one other, last important piece to type |
| 261 | +inference: _type flattening_. |
| 262 | + |
| 263 | +Let's look at the `a + b` expression again. After local inference, it was |
| 264 | +interpreted as this: |
| 265 | + |
| 266 | +```rs |
| 267 | +TypeData::TypeofExpression(TypeofExpression::Addition { |
| 268 | + left: TypeReference::from(TypeReferenceQualifier::from_name("a")), |
| 269 | + right: TypeReference::from(TypeReferenceQualifier::from_name("b")) |
| 270 | +}) |
| 271 | +``` |
| 272 | + |
| 273 | +But at some point, supposedly one of the resolvers is going to be able to |
| 274 | +resolve `a` and `b`, and the expression becomes something such as: |
| 275 | + |
| 276 | +```rs |
| 277 | +TypeData::TypeofExpression(TypeofExpression::Addition { |
| 278 | + left: TypeReference::from(ResolvedTypeId(/* resolver ID and type ID */)), |
| 279 | + right: TypeReference::from(ResolvedTypeId(/* resolver ID and type ID */)) |
| 280 | +}) |
| 281 | +``` |
| 282 | + |
| 283 | +At this point we know the actual types we are dealing with. If the types for |
| 284 | +both `left` and `right` resolve to `TypeData::Number`, the entire expression can |
| 285 | +be _flattened_ to `TypeData::Number`, because that's the result of adding two |
| 286 | +numbers. And in most other cases it will become `TypeData::String` instead. |
| 287 | + |
| 288 | +Flattening is implemented in [flattening.rs](src/flattening.rs). |
| 289 | + |
| 290 | +## `ResolvedTypeData` |
| 291 | + |
| 292 | +One more important data structure to be aware of is `ResolvedTypeData`. Whenever |
| 293 | +we request type data from a resolver, we don't receive a `&TypeData` reference, |
| 294 | +but `ResolvedTypeData`. |
| 295 | + |
| 296 | +The reason for this structure is that it tracks the `ResolverId` so we remember |
| 297 | +where this type data was found. This is important if you want to resolve |
| 298 | +`TypeReference`s that are part of the type data and you need to make subsequent |
| 299 | +calls to the resolver. |
| 300 | + |
| 301 | +`ResolvedTypeData` has an `as_raw_data()` method that returns the raw |
| 302 | +`&TypeData` reference. This is often used for matching against the variants of |
| 303 | +the `TypeData` enum. But keep in mind that any data that you retrieve this way |
| 304 | +cannot be used with a resolver unless you explicitly and manually apply the |
| 305 | +right `ResolverId` to it! Unfortunately we cannot enforce this through the type |
| 306 | +system, and **mistakes can lead to panics**. |
0 commit comments