You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The Reader converts text into data. It's a key component in making Convex based apps work effectively in multiple ways:
9
+
10
+
-**Source Code** like `(transfer #101 1000000)` is transformed into trees of code ready for execution on the CVM.
11
+
-**REST APIs** can use Convex data in text form with the MIME type `application/cvx`
12
+
-**Arbitrary Data** can be specified in `.cvx` files like `[{:name "Bob" :age 42} {:name "Sarah" :age 37}]`
13
+
14
+
In preparation for Protonet, we've been putting the final touches on the Reader. So what's new?
15
+
16
+
<!-- truncate -->
17
+
18
+
### Performance Upgrades
19
+
20
+
The Convex Reader is now about **10x faster than before**. It can now parse roughly 15 MB/s of CVX data files into lattice data structures per thread, up from about 1.5 MB/s before.
21
+
22
+
That's pretty fast: remember we are transforming text into full cryptographically verifiable lattice data structures here, not simply scanning a file to gather statistics. It's certainly comparable to high-performance JSON parsing libraries that produce full object graphs.
23
+
24
+
This means that you can confidently implement high performance APIs that take `application/cvx` data as input, such as in the Convex REST API Server over HTTPS.
25
+
26
+
### Tagged Values
27
+
28
+
The Reader now supports **tagged values**. Tagged values are used to specify special data types that the Reader otherwise wouldn't be able to produce directly. As a motivating example, consider the `Index` type that maps blob keys to values:
29
+
30
+
```clojure
31
+
;; You can construct and Index with the `index` function
32
+
(index0x1234:bob)
33
+
=> #Index {0x1234:bob}
34
+
35
+
;; However if you try to specify it as a literal, you just get regular map:
36
+
{0x1234:bob}
37
+
=> {0x1234:bob}
38
+
39
+
;; These are not the same thing! An Index is a special type distinct from a map
40
+
;; NOTE: Different type => different lattice encoding => different hash => not equal!
41
+
(= {0x1234:bob} (index0x1234:bob))
42
+
=> false
43
+
44
+
;; But now you can use a tagged value to create an index directly :-)
45
+
#Index {0x1234:bob}
46
+
=> #Index {0x1234:bob}
47
+
48
+
;; This produces the exact Index value we expect
49
+
(= #Index {0x1234:bob} (index0x1234:bob))
50
+
=> true
51
+
```
52
+
53
+
Tagged values were inspired by Clojure's Extensible Data Notation (EDN) that allows developers to support custom types in the Clojure Reader. We don't need anything quite as sophisticated on the CVM yet (since customer user-defined types probably won't be coming until Convex v2), but it's a very handy tool already for dealing with the specialised CVM types that do exist.
54
+
55
+
### Stricter parsing
56
+
57
+
We've tightened some of the parsing rules so that potentially ambiguous input won't be misread. For example the `/` symbol as used for path lookup is now stricter with respect to whitespace:
58
+
59
+
```clojure
60
+
;; this won't work: the spaces mean that `/` is seen as a separate Symbol
61
+
(#9 / resolve 'convex.core)
62
+
63
+
;; This is OK
64
+
(#9/resolve 'convex.core)
65
+
=> #8
66
+
```
67
+
68
+
We have literally thousands of unit tests checking all kinds of input combinations to the Reader, so the intensive testing combined with the stricter parsing rules should ensure predictable and consistent Reader behaviour for Protonet.
69
+
70
+
### Learn More
71
+
72
+
Full Reader specifications are outlined in [CAD032](/docs/cad/reader). For anyone wanting to work on the Reader or CVM data translation in general it's a great place to get started!
@@ -10,15 +10,21 @@ Convex implements a standard **Encoding** format that represents any valid Conve
10
10
11
11
The Encoding model breaks Values into a Merkle DAG of one or more **Cells** that are individually encoded. Cells are immutable, and may therefore be safely shared by different values, or used multiple times in the the same DAG. This technique of "structural sharing" is extremely important for the performance and memory efficiency of Convex.
12
12
13
+
## Special Requirements
13
14
15
+
Convex and related lattice infrastructure places some very specific requirements on the encoding format which necessitate the design of the encoding scheme design here:
16
+
17
+
- Every distinct value must have one and only one unique valid encoding, so that it can be hashed to a stable ID
18
+
- It must be possible to read encode / decode `n` bytes of data in `O(n)` time (DoS resistance)
19
+
- There must be a fixed upper bound on the encoding size of any value (excluding referenced children) so that reading and writing can occur in fixed sized buffers
14
20
15
21
## Basic Rules
16
22
17
23
### Cells
18
24
19
25
The fundamental entities that are encoded are called Cells.
20
26
21
-
Cells may contain other cells by reference, and therefore a top-level cell can be regarded as a directed acyclic graph (DAG). Since cell encodings contain cryptographic hashes of the encodings of any branch referenced cells, this is furthermore a Merkle DAG.
27
+
Cells may contain other cells by reference, and therefore a top-level cell can be regarded as a directed acyclic graph (DAG). Since cell encodings contain cryptographic hashes of the encodings of any referenced cells, this is furthermore a Merkle DAG.
22
28
23
29
### Branches
24
30
@@ -32,7 +38,7 @@ Branches are an important optimisation, since they reduce the need to produce ma
32
38
33
39
The encoding MUST be a sequence of bytes.
34
40
35
-
Any given Cell MUST map to one and only one encoding.
41
+
Any given cell MUST map to one and only one encoding.
36
42
37
43
Any two distinct (non-identical) cells MUST map to different encoding
38
44
@@ -153,13 +159,13 @@ The two Boolean Values `true` or `false` have the Encodings `0xb1` and `0xb0` re
153
159
154
160
Note: These Tags are chosen to aid human readability, such that the first hexadecimal digit `b` suggests "binary" or "boolean", and the second hexadecimal digit represents the bit value.
155
161
156
-
### `0x10` - `0x18` Integer ("SmallInt")
162
+
### `0x10` - `0x18` Integer (Long)
157
163
158
164
```Encoding
159
165
0x1n <n bytes of numeric data>
160
166
```
161
167
162
-
A small integer value is encoded by the Tag byte followed by `n` bytes representing the signed 2's complement numeric value of the Integer. The integer must be represented in the minimum possible number of bytes (can be 0 additional bytes for the specific value `0`).
168
+
A Long value is encoded by the Tag byte followed by `n` bytes representing the signed two's complement numeric value of the Integer. The Integer MUST be represented in the minimum possible number of bytes - excess leading bytes are an invalid encoding.
163
169
164
170
Note: The value zero is conveniently encoded in this scheme as the single byte `0x10`
165
171
@@ -168,7 +174,7 @@ Note: This encoding is chosen in preference to a VLC encoding because:
168
174
- It is consistent with the natural encoding for two's complement integers on most systems
169
175
- The numerical part is consistent with the format for BigInts
170
176
171
-
### `0x19` Integer ("BigInt")
177
+
### `0x19` Integer (BigInt)
172
178
173
179
```
174
180
0x19 <VLC Count length of Integer = n> <n bytes of data>
@@ -188,21 +194,6 @@ With the exception of the Tag byte, The encoding of a BigInt is defined to be ex
188
194
189
195
A Double value is encoded as the Tag byte followed by 8 bytes standard representation of an IEEE 754 double-precision floating point value.
190
196
191
-
### `0x3c` - `0x3f` Character
192
-
193
-
```
194
-
Tag determines the length in bytes of the Unicode code point value
195
-
0x3c <1 Byte>
196
-
0x3d <2 Bytes>
197
-
0x3e <3 Bytes>
198
-
0x3f <4 Bytes> (reserved, not currently possible?)
199
-
```
200
-
201
-
A Character value is encoded by the Tag byte followed by 1-4 bytes representing the Unicode code point as an unsigned integer.
202
-
203
-
A Character encoding is invalid if:
204
-
- More bytes are used than necessary (i.e. a leading byte of zero)
205
-
- The code point is beyond the maximum allowable (0x10ffff)
206
197
207
198
### `0x20` Ref
208
199
@@ -282,52 +273,156 @@ Importantly, this design allows:
282
273
### 0x32 Symbol
283
274
284
275
```
285
-
0x32 <VLC Count = n> <n bytes UTF-8 String>
276
+
0x32 <Count Byte = n> <n bytes UTF-8 String>
286
277
```
287
278
288
-
A Symbol is encoded with the Tag byte, a VLC Count length`n`, and `n` bytes of UTF-8 encoded characters.
279
+
A Symbol is encoded with the Tag byte, an unsigned count byte`n`, and `n` bytes of UTF-8 encoded characters.
289
280
290
281
The Symbol MUST have a length of 1-128 UTF-8 bytes
291
282
292
283
### `0x33` Keyword
293
284
294
285
```
295
-
0x32 <VLC Count = n> <n bytes UTF-8 String>
286
+
0x32 <Count Byte = n> <n bytes UTF-8 String>
296
287
```
297
288
298
-
A Keyword is encoded with the Tag byte, a VLC Count length`n`, and `n` bytes of UTF-8 encoded characters.
289
+
A Keyword is encoded with the Tag byte, an unsigned count byte`n`, and `n` bytes of UTF-8 encoded characters.
299
290
300
291
The Keyword MUST have a length of 1-128 UTF-8 bytes
301
292
293
+
### `0x3c` - `0x3f` Character
294
+
295
+
```
296
+
Tag determines the length in bytes of the Unicode code point value
297
+
0x3c <1 Byte>
298
+
0x3d <2 Bytes>
299
+
0x3e <3 Bytes>
300
+
0x3f <4 Bytes> (reserved, not currently possible?)
301
+
```
302
+
303
+
A Character value is encoded by the Tag byte followed by 1-4 bytes representing the Unicode code point as an unsigned integer.
304
+
305
+
A Character encoding is invalid if:
306
+
- More bytes are used than necessary (i.e. a leading byte of zero)
307
+
- The code point is beyond the maximum allowable (0x10ffff)
A Leaf Count `n` is defined as 0, 16, or any other positive integer which is not an exact multiple of 16.
321
+
A leaf cell is a Vector with Count `n` being 0, 16, or any other positive integer which is not an exact multiple of 16.
322
+
323
+
A Vector is defined as "packed" if its count is a positive multiple of 16. A leaf vector which is packed must therefore have a count of exactly 16 - such vectors for the leaf nodes of a tree of non-leaf vectors.
315
324
316
-
A Vector is defined as "packed" if its Count is `16 ^ level`, where `level` is any positive integer. Intuitively, this represents a Vector which has the maximum number of elements before a new level in the tree must be added.
325
+
A Vector is defined as "fully packed" if its Count is `16 ^ level`, where `level` is any positive integer. Intuitively, this represents a Vector which has the maximum number of elements before a new level in the tree must be added.
317
326
318
327
All Vector encodings start with the tag byte and a VLC Count of elements in the Vector.
319
328
320
329
Subsequently:
321
-
- For Leaf Vectors, a Prefix Vector is encoded (which may be `nil`) that contains all elements up to the highest multiple of 16 less than the Count, followed by the Values
322
-
- For non-Leaf Vectors, Child Vectors are encoded where each child is the maximum size Packed Vector less than Count in length, except the last which is the Vector containing all remaining Values.
330
+
- For leaf cells, a packed prefix vector is encoded (which may be `nil`) that contains all elements up to the highest multiple of 16 less than the Count, followed by the Values
331
+
- For non-Leaf cells, Child Vectors are encoded where each child is the maximum size Packed Vector less than Count in length, except the last which is the Vector containing all remaining Values.
323
332
324
333
This Encoding has some elegant properties which make Convex Vectors particularly efficient in regular usage:
325
334
- Short Vectors (0-16 count) are always encoded in a single cell, which may require no further cell encodings in the common case that all elements are embedded.
326
335
- The last few elements of the Vector are usually in a Leaf Vector, which allows `O(1)` access and update to elements
327
-
- Append is always `O(1)` (since either it is a Leaf Vector, or the append creates a new Leaf Vector with the original Vector as its Prefix)
328
-
- For practical purposes, access and update is also `O(1)` (Note: technically `O(log n)` with a high branching factor, but upper bounds on vector size make this `O(1)` with a constant factor that accounts for the maximum possible depth)
336
+
- Append is `O(1)`, usually with a small constant (only extending the current leaf vector)
337
+
- Access and update are also `O(1)` (Note: could be considered `O(log n)` with a high branching factor, but upper bounds on vector size make this `O(1)` with a constant factor accounting for the maximum possible depth)
338
+
339
+
### `0x81` List
340
+
341
+
A List is encoded exactly the same as a Vector, except:
342
+
- The tag byte is `0x81`
343
+
- The elements are logically considered to be in reversed order (i.e. the last element encoded is the first element of the list)
344
+
345
+
### `0x82` Map
346
+
347
+
```
348
+
If a leaf cell:
349
+
350
+
0x80 <VLC Count = n> <Key Ref | Value Ref> (repeated n times, in order of key hashes)
- 0x00 (if no entry present at this position in Index)
380
+
- 0x20 <Key Ref> <Value Ref> (if entry present)
381
+
382
+
<Depth> is an unsigned byte indicating the hex digit at which the entry / branch occurs
383
+
384
+
<Mask> is a 16 bit bitmap of which child Index nodes are present at the given depth (low bit = `0` ... high bit = `F`)
385
+
386
+
Special cases:
387
+
- If Count is 0 everything following the Count is omitted (the empty Index)
388
+
- If Count is 1 the first byte of <Entry> and everything following the Entry is omitted (single Key / Value pair)
389
+
```
390
+
391
+
An Index serves as a specialised map with BlobLike keys (Blobs, Strings, Addresses etc.). Logically, it is a mapping from byte arrays to values.
392
+
393
+
This encoding ensures that entries are encoded in lexicographic ordering. Unlike the hash based Maps, an Index is constrained to use only BlobLike keys, and cannot store two keys which have the same Blob representation (though the keys will retain their original type).
394
+
395
+
### `0x88` Syntax
396
+
397
+
```
398
+
0x88 <Meta Ref> <Value Ref>
399
+
400
+
Where <Meta Ref> is either:
401
+
- 0x00 (nil) if there is no metadata (considered as empty map)
402
+
- A Ref to a non-empty map
403
+
404
+
The <Value Ref> can be any value.
405
+
```
406
+
407
+
Logically, a `Syntax` value is a wrapped value with a metadata map.
408
+
409
+
### `0x90` Signed
410
+
411
+
Represents a digitally signed data value.
412
+
413
+
```
414
+
`0x90` <Public Key> <Signature> <Value Ref>
415
+
416
+
Where:
417
+
- Public Key is 32 bytes Ed25519 public key
418
+
- Signature is 64 bytes Ed25519 signature
419
+
```
420
+
421
+
The Signature is expected to be the Ed25519 signature of the Value Ref encoding. This means that the signed bytes will be either an embedded value (1-140 bytes), or an `0x20` Ref to a branch cell (33 bytes). This format is effective because it means the encoding of the Signed data value is sufficient to validate the Signature without any external references.
422
+
423
+
The signature may or may not be valid: an invalid signature is still a valid value from an encoding perspective.
0 commit comments