Skip to content

Commit f3ae12f

Browse files
authored
Merge pull request #48 from Convex-Dev/develop
More Doc updates
2 parents 155ddd4 + e279d62 commit f3ae12f

File tree

17 files changed

+707
-271
lines changed

17 files changed

+707
-271
lines changed

blog/2024-09-30-cvm-reader/index.md

Lines changed: 72 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,72 @@
1+
---
2+
slug: tagged-values
3+
title: Reader upgrades
4+
authors: [mikera]
5+
tags: [convex, reader, lisp]
6+
---
7+
8+
The Reader converts text into data. It's a key component in making Convex based apps work effectively in multiple ways:
9+
10+
- **Source Code** like `(transfer #101 1000000)` is transformed into trees of code ready for execution on the CVM.
11+
- **REST APIs** can use Convex data in text form with the MIME type `application/cvx`
12+
- **Arbitrary Data** can be specified in `.cvx` files like `[{:name "Bob" :age 42} {:name "Sarah" :age 37}]`
13+
14+
In preparation for Protonet, we've been putting the final touches on the Reader. So what's new?
15+
16+
<!-- truncate -->
17+
18+
### Performance Upgrades
19+
20+
The Convex Reader is now about **10x faster than before**. It can now parse roughly 15 MB/s of CVX data files into lattice data structures per thread, up from about 1.5 MB/s before.
21+
22+
That's pretty fast: remember we are transforming text into full cryptographically verifiable lattice data structures here, not simply scanning a file to gather statistics. It's certainly comparable to high-performance JSON parsing libraries that produce full object graphs.
23+
24+
This means that you can confidently implement high performance APIs that take `application/cvx` data as input, such as in the Convex REST API Server over HTTPS.
25+
26+
### Tagged Values
27+
28+
The Reader now supports **tagged values**. Tagged values are used to specify special data types that the Reader otherwise wouldn't be able to produce directly. As a motivating example, consider the `Index` type that maps blob keys to values:
29+
30+
```clojure
31+
;; You can construct and Index with the `index` function
32+
(index 0x1234 :bob)
33+
=> #Index {0x1234 :bob}
34+
35+
;; However if you try to specify it as a literal, you just get regular map:
36+
{0x1234 :bob}
37+
=> {0x1234 :bob}
38+
39+
;; These are not the same thing! An Index is a special type distinct from a map
40+
;; NOTE: Different type => different lattice encoding => different hash => not equal!
41+
(= {0x1234 :bob} (index 0x1234 :bob))
42+
=> false
43+
44+
;; But now you can use a tagged value to create an index directly :-)
45+
#Index {0x1234 :bob}
46+
=> #Index {0x1234 :bob}
47+
48+
;; This produces the exact Index value we expect
49+
(= #Index {0x1234 :bob} (index 0x1234 :bob))
50+
=> true
51+
```
52+
53+
Tagged values were inspired by Clojure's Extensible Data Notation (EDN) that allows developers to support custom types in the Clojure Reader. We don't need anything quite as sophisticated on the CVM yet (since customer user-defined types probably won't be coming until Convex v2), but it's a very handy tool already for dealing with the specialised CVM types that do exist.
54+
55+
### Stricter parsing
56+
57+
We've tightened some of the parsing rules so that potentially ambiguous input won't be misread. For example the `/` symbol as used for path lookup is now stricter with respect to whitespace:
58+
59+
```clojure
60+
;; this won't work: the spaces mean that `/` is seen as a separate Symbol
61+
(#9 / resolve 'convex.core)
62+
63+
;; This is OK
64+
(#9/resolve 'convex.core)
65+
=> #8
66+
```
67+
68+
We have literally thousands of unit tests checking all kinds of input combinations to the Reader, so the intensive testing combined with the stricter parsing rules should ensure predictable and consistent Reader behaviour for Protonet.
69+
70+
### Learn More
71+
72+
Full Reader specifications are outlined in [CAD032](/docs/cad/reader). For anyone wanting to work on the Reader or CVM data translation in general it's a great place to get started!

blog/tags.yml

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -12,3 +12,13 @@ community:
1212
label: Community
1313
permalink: /community
1414
description: Convex Developer Community
15+
16+
reader:
17+
label: Reader
18+
permalink: /reader
19+
description: Convex Reader
20+
21+
lisp:
22+
label: Lisp
23+
permalink: /lisp
24+
description: Convex Reader

docs/cad/003_encoding/README.md

Lines changed: 128 additions & 33 deletions
Original file line numberDiff line numberDiff line change
@@ -10,15 +10,21 @@ Convex implements a standard **Encoding** format that represents any valid Conve
1010

1111
The Encoding model breaks Values into a Merkle DAG of one or more **Cells** that are individually encoded. Cells are immutable, and may therefore be safely shared by different values, or used multiple times in the the same DAG. This technique of "structural sharing" is extremely important for the performance and memory efficiency of Convex.
1212

13+
## Special Requirements
1314

15+
Convex and related lattice infrastructure places some very specific requirements on the encoding format which necessitate the design of the encoding scheme design here:
16+
17+
- Every distinct value must have one and only one unique valid encoding, so that it can be hashed to a stable ID
18+
- It must be possible to read encode / decode `n` bytes of data in `O(n)` time (DoS resistance)
19+
- There must be a fixed upper bound on the encoding size of any value (excluding referenced children) so that reading and writing can occur in fixed sized buffers
1420

1521
## Basic Rules
1622

1723
### Cells
1824

1925
The fundamental entities that are encoded are called Cells.
2026

21-
Cells may contain other cells by reference, and therefore a top-level cell can be regarded as a directed acyclic graph (DAG). Since cell encodings contain cryptographic hashes of the encodings of any branch referenced cells, this is furthermore a Merkle DAG.
27+
Cells may contain other cells by reference, and therefore a top-level cell can be regarded as a directed acyclic graph (DAG). Since cell encodings contain cryptographic hashes of the encodings of any referenced cells, this is furthermore a Merkle DAG.
2228

2329
### Branches
2430

@@ -32,7 +38,7 @@ Branches are an important optimisation, since they reduce the need to produce ma
3238

3339
The encoding MUST be a sequence of bytes.
3440

35-
Any given Cell MUST map to one and only one encoding.
41+
Any given cell MUST map to one and only one encoding.
3642

3743
Any two distinct (non-identical) cells MUST map to different encoding
3844

@@ -153,13 +159,13 @@ The two Boolean Values `true` or `false` have the Encodings `0xb1` and `0xb0` re
153159

154160
Note: These Tags are chosen to aid human readability, such that the first hexadecimal digit `b` suggests "binary" or "boolean", and the second hexadecimal digit represents the bit value.
155161

156-
### `0x10` - `0x18` Integer ("SmallInt")
162+
### `0x10` - `0x18` Integer (Long)
157163

158164
```Encoding
159165
0x1n <n bytes of numeric data>
160166
```
161167

162-
A small integer value is encoded by the Tag byte followed by `n` bytes representing the signed 2's complement numeric value of the Integer. The integer must be represented in the minimum possible number of bytes (can be 0 additional bytes for the specific value `0`).
168+
A Long value is encoded by the Tag byte followed by `n` bytes representing the signed two's complement numeric value of the Integer. The Integer MUST be represented in the minimum possible number of bytes - excess leading bytes are an invalid encoding.
163169

164170
Note: The value zero is conveniently encoded in this scheme as the single byte `0x10`
165171

@@ -168,7 +174,7 @@ Note: This encoding is chosen in preference to a VLC encoding because:
168174
- It is consistent with the natural encoding for two's complement integers on most systems
169175
- The numerical part is consistent with the format for BigInts
170176

171-
### `0x19` Integer ("BigInt")
177+
### `0x19` Integer (BigInt)
172178

173179
```
174180
0x19 <VLC Count length of Integer = n> <n bytes of data>
@@ -188,21 +194,6 @@ With the exception of the Tag byte, The encoding of a BigInt is defined to be ex
188194

189195
A Double value is encoded as the Tag byte followed by 8 bytes standard representation of an IEEE 754 double-precision floating point value.
190196

191-
### `0x3c` - `0x3f` Character
192-
193-
```
194-
Tag determines the length in bytes of the Unicode code point value
195-
0x3c <1 Byte>
196-
0x3d <2 Bytes>
197-
0x3e <3 Bytes>
198-
0x3f <4 Bytes> (reserved, not currently possible?)
199-
```
200-
201-
A Character value is encoded by the Tag byte followed by 1-4 bytes representing the Unicode code point as an unsigned integer.
202-
203-
A Character encoding is invalid if:
204-
- More bytes are used than necessary (i.e. a leading byte of zero)
205-
- The code point is beyond the maximum allowable (0x10ffff)
206197

207198
### `0x20` Ref
208199

@@ -282,52 +273,156 @@ Importantly, this design allows:
282273
### 0x32 Symbol
283274

284275
```
285-
0x32 <VLC Count = n> <n bytes UTF-8 String>
276+
0x32 <Count Byte = n> <n bytes UTF-8 String>
286277
```
287278

288-
A Symbol is encoded with the Tag byte, a VLC Count length `n`, and `n` bytes of UTF-8 encoded characters.
279+
A Symbol is encoded with the Tag byte, an unsigned count byte `n`, and `n` bytes of UTF-8 encoded characters.
289280

290281
The Symbol MUST have a length of 1-128 UTF-8 bytes
291282

292283
### `0x33` Keyword
293284

294285
```
295-
0x32 <VLC Count = n> <n bytes UTF-8 String>
286+
0x32 <Count Byte = n> <n bytes UTF-8 String>
296287
```
297288

298-
A Keyword is encoded with the Tag byte, a VLC Count length `n`, and `n` bytes of UTF-8 encoded characters.
289+
A Keyword is encoded with the Tag byte, an unsigned count byte `n`, and `n` bytes of UTF-8 encoded characters.
299290

300291
The Keyword MUST have a length of 1-128 UTF-8 bytes
301292

293+
### `0x3c` - `0x3f` Character
294+
295+
```
296+
Tag determines the length in bytes of the Unicode code point value
297+
0x3c <1 Byte>
298+
0x3d <2 Bytes>
299+
0x3e <3 Bytes>
300+
0x3f <4 Bytes> (reserved, not currently possible?)
301+
```
302+
303+
A Character value is encoded by the Tag byte followed by 1-4 bytes representing the Unicode code point as an unsigned integer.
304+
305+
A Character encoding is invalid if:
306+
- More bytes are used than necessary (i.e. a leading byte of zero)
307+
- The code point is beyond the maximum allowable (0x10ffff)
308+
302309
### `0x80` Vector
303310

304311
```
305-
If a Leaf Count:
312+
If a leaf cell:
306313
307314
0x80 <VLC Count = n> <Prefix Vector> <Value>(repeated 0-16 times)
308315
309-
If a non-Leaf Count:
316+
If a non-leaf cell:
310317
311318
0x80 <VLC Count = n> <Child Vector>(repeated 2-16 times)
312319
```
313320

314-
A Leaf Count `n` is defined as 0, 16, or any other positive integer which is not an exact multiple of 16.
321+
A leaf cell is a Vector with Count `n` being 0, 16, or any other positive integer which is not an exact multiple of 16.
322+
323+
A Vector is defined as "packed" if its count is a positive multiple of 16. A leaf vector which is packed must therefore have a count of exactly 16 - such vectors for the leaf nodes of a tree of non-leaf vectors.
315324

316-
A Vector is defined as "packed" if its Count is `16 ^ level`, where `level` is any positive integer. Intuitively, this represents a Vector which has the maximum number of elements before a new level in the tree must be added.
325+
A Vector is defined as "fully packed" if its Count is `16 ^ level`, where `level` is any positive integer. Intuitively, this represents a Vector which has the maximum number of elements before a new level in the tree must be added.
317326

318327
All Vector encodings start with the tag byte and a VLC Count of elements in the Vector.
319328

320329
Subsequently:
321-
- For Leaf Vectors, a Prefix Vector is encoded (which may be `nil`) that contains all elements up to the highest multiple of 16 less than the Count, followed by the Values
322-
- For non-Leaf Vectors, Child Vectors are encoded where each child is the maximum size Packed Vector less than Count in length, except the last which is the Vector containing all remaining Values.
330+
- For leaf cells, a packed prefix vector is encoded (which may be `nil`) that contains all elements up to the highest multiple of 16 less than the Count, followed by the Values
331+
- For non-Leaf cells, Child Vectors are encoded where each child is the maximum size Packed Vector less than Count in length, except the last which is the Vector containing all remaining Values.
323332

324333
This Encoding has some elegant properties which make Convex Vectors particularly efficient in regular usage:
325334
- Short Vectors (0-16 count) are always encoded in a single cell, which may require no further cell encodings in the common case that all elements are embedded.
326335
- The last few elements of the Vector are usually in a Leaf Vector, which allows `O(1)` access and update to elements
327-
- Append is always `O(1)` (since either it is a Leaf Vector, or the append creates a new Leaf Vector with the original Vector as its Prefix)
328-
- For practical purposes, access and update is also `O(1)` (Note: technically `O(log n)` with a high branching factor, but upper bounds on vector size make this `O(1)` with a constant factor that accounts for the maximum possible depth)
336+
- Append is `O(1)`, usually with a small constant (only extending the current leaf vector)
337+
- Access and update are also `O(1)` (Note: could be considered `O(log n)` with a high branching factor, but upper bounds on vector size make this `O(1)` with a constant factor accounting for the maximum possible depth)
338+
339+
### `0x81` List
340+
341+
A List is encoded exactly the same as a Vector, except:
342+
- The tag byte is `0x81`
343+
- The elements are logically considered to be in reversed order (i.e. the last element encoded is the first element of the list)
344+
345+
### `0x82` Map
346+
347+
```
348+
If a leaf cell:
349+
350+
0x80 <VLC Count = n> <Key Ref | Value Ref> (repeated n times, in order of key hashes)
351+
352+
If a non-leaf cell:
353+
354+
0x80 <VLC Count = n> <Shift Byte> <Mask> <Child Refs> (repeated 2-16 times)
355+
356+
Where:
357+
- <Shift Byte> specifies the hex position where the map branches (0 = at the fist hex digit, etc.)
358+
- <Mask> is a 16-bit bitmask indicating key hash hex valeus are included (low bit = `0` ... high bit = `F`)
359+
- <Child Refs> are Refs to Map cells which can be Leaf or non-Leaf nodes
360+
```
361+
362+
This encoding guaranteed that all entries are encoded in the order of key hashes.
363+
364+
365+
### `0x83` Set
366+
367+
A Set is encoded exactly the same as a Map, except:
368+
- The tag byte is `0x83`
369+
- The Value Refs are omitted
370+
371+
### `0x84` Index
372+
373+
```
374+
0x84 <VLC Count = n> <Entry> <Depth> <Mask> <Child Refs> (repeated 1-16 times)
375+
376+
Where:
377+
378+
<Entry> is either:
379+
- 0x00 (if no entry present at this position in Index)
380+
- 0x20 <Key Ref> <Value Ref> (if entry present)
381+
382+
<Depth> is an unsigned byte indicating the hex digit at which the entry / branch occurs
383+
384+
<Mask> is a 16 bit bitmap of which child Index nodes are present at the given depth (low bit = `0` ... high bit = `F`)
385+
386+
Special cases:
387+
- If Count is 0 everything following the Count is omitted (the empty Index)
388+
- If Count is 1 the first byte of <Entry> and everything following the Entry is omitted (single Key / Value pair)
389+
```
390+
391+
An Index serves as a specialised map with BlobLike keys (Blobs, Strings, Addresses etc.). Logically, it is a mapping from byte arrays to values.
392+
393+
This encoding ensures that entries are encoded in lexicographic ordering. Unlike the hash based Maps, an Index is constrained to use only BlobLike keys, and cannot store two keys which have the same Blob representation (though the keys will retain their original type).
394+
395+
### `0x88` Syntax
396+
397+
```
398+
0x88 <Meta Ref> <Value Ref>
399+
400+
Where <Meta Ref> is either:
401+
- 0x00 (nil) if there is no metadata (considered as empty map)
402+
- A Ref to a non-empty map
403+
404+
The <Value Ref> can be any value.
405+
```
406+
407+
Logically, a `Syntax` value is a wrapped value with a metadata map.
408+
409+
### `0x90` Signed
410+
411+
Represents a digitally signed data value.
412+
413+
```
414+
`0x90` <Public Key> <Signature> <Value Ref>
415+
416+
Where:
417+
- Public Key is 32 bytes Ed25519 public key
418+
- Signature is 64 bytes Ed25519 signature
419+
```
420+
421+
The Signature is expected to be the Ed25519 signature of the Value Ref encoding. This means that the signed bytes will be either an embedded value (1-140 bytes), or an `0x20` Ref to a branch cell (33 bytes). This format is effective because it means the encoding of the Signed data value is sufficient to validate the Signature without any external references.
422+
423+
The signature may or may not be valid: an invalid signature is still a valid value from an encoding perspective.
329424

330-
### TODO: More tags
425+
### TODO: A few remaining tags
331426

332427
## Implementation Notes
333428

0 commit comments

Comments
 (0)