feat(spec): v3: standardized encoding for list-item objects

johannschopplich · johannschopplich · commit 0bd6c9638c8d · 2025-11-24T14:38:02.000+01:00
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -5,6 +5,27 @@ All notable changes to the TOON specification will be documented in this file.
 The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
 and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
 
+## [3.0] - 2025-11-24
+
+### Breaking Changes
+
+- Standardized encoding for list-item objects whose first field is a tabular array (§10):
+  - Encoders MUST emit `- key[N]{fields}:` on the hyphen line.
+  - Tabular rows MUST appear at depth +2 relative to the hyphen line.
+  - All other fields of the same object MUST appear at depth +1.
+  - The v2.0 shallow form (rows and fields at the same depth) and the v2.1 bare-hyphen form are no longer normative and MUST NOT be emitted by conforming encoders.
+
+### Changed
+
+- Encoding/decoding rules (§10) simplified to describe only the YAML-style pattern; legacy layouts are treated as generic nesting and are not covered by conformance tests.
+- Nested tabular list-item example in Appendix A updated to the canonical v3.0 form.
+
+### Migration from v2.1
+
+- Update encoders to emit the YAML-style form for list-item objects whose first field is a tabular array.
+- If you rely on v2.0/v2.1 layouts, keep decoder compatibility in non-strict or implementation-defined modes; the spec no longer requires or tests these patterns.
+- Optionally regenerate existing `.toon` files for consistent v3 formatting.
+
 ## [2.1] - 2025-11-23
 
 ### Changed
diff --git a/README.md b/README.md
@@ -1,7 +1,7 @@
 # TOON Format Specification
 
-[![SPEC v2.1](https://img.shields.io/badge/spec-v2.1-lightgrey)](./SPEC.md)
-[![Tests](https://img.shields.io/badge/tests-344-green)](./tests/fixtures/)
+[![SPEC v3.0](https://img.shields.io/badge/spec-v3.0-lightgrey)](./SPEC.md)
+[![Tests](https://img.shields.io/badge/tests-345-green)](./tests/fixtures/)
 [![License: MIT](https://img.shields.io/badge/license-MIT-blue.svg)](./LICENSE)
 
 This repository contains the official specification for **Token-Oriented Object Notation (TOON)**, a compact, human-readable encoding of the JSON data model for LLM prompts. It provides a lossless serialization of the same objects, arrays, and primitives as JSON, but in a syntax that minimizes tokens and makes structure easy for models to follow.
@@ -10,7 +10,7 @@ This repository contains the official specification for **Token-Oriented Object
 
 [→ Read the full specification (SPEC.md)](./SPEC.md)
 
-- **Version:** 2.1 (2025-11-23)
+- **Version:** 3.0 (2025-11-24)
 - **Status:** Working Draft
 - **License:** MIT
 
diff --git a/SPEC.md b/SPEC.md
@@ -2,9 +2,9 @@
 
 ## Token-Oriented Object Notation
 
-**Version:** 2.1
+**Version:** 3.0
 
-**Date:** 2025-11-23
+**Date:** 2025-11-24
 
 **Status:** Working Draft
 
@@ -20,7 +20,7 @@ Token-Oriented Object Notation (TOON) is a line-oriented, indentation-based text
 
 ## Status of This Document
 
-This document is a Working Draft v2.1 and may be updated, replaced, or obsoleted. Implementers should monitor the canonical repository at https://github.com/toon-format/spec for changes.
+This document is a Working Draft v3.0 and may be updated, replaced, or obsoleted. Implementers should monitor the canonical repository at https://github.com/toon-format/spec for changes.
 
 This specification is stable for implementation but not yet finalized. Breaking changes may occur in future major versions.
 
@@ -227,12 +227,11 @@ Implementations that fail to conform to any MUST or REQUIRED level requirement a
 
 ## 3. Encoding Normalization (Reference Encoder)
 
-Encoders MUST normalize non-JSON values to the JSON data model before encoding:
+Encoders MUST normalize non-JSON values to the JSON data model before encoding. The mapping from host-specific types to JSON model is implementation-defined and MUST be documented.
 
 - Number:
   - Finite → number (canonical decimal form per Section 2). -0 → 0.
   - NaN, +Infinity, -Infinity → null.
-- Non-JSON types MUST be normalized to the JSON data model (object, array, string, number, boolean, or null) before encoding. The mapping from host-specific types to JSON model is implementation-defined and MUST be documented.
 - Examples of host-type normalization (non-normative):
   - Date/time objects → ISO 8601 string representation.
   - Set-like collections → array.
@@ -384,9 +383,9 @@ A string value MUST be quoted if any of the following is true:
 - It contains a colon (:), double quote ("), or backslash (\).
 - It contains brackets or braces ([, ], {, }).
 - It contains control characters: newline, carriage return, or tab.
-- It contains the relevant delimiter:
-  - Inside array scope: the active delimiter (Section 1).
-  - Outside array scope: the document delimiter (Section 1).
+- It contains the relevant delimiter (see §11 for complete delimiter rules):
+  - For inline array values and tabular row cells: the active delimiter from the nearest array header.
+  - For object field values (key: value): the document delimiter, even when the object is within an array's scope.
 - It equals "-" or starts with "-" (any hyphen at position 0).
 
 Otherwise, the string MAY be emitted without quotes. Unicode, emoji, and strings with internal (non-leading/trailing) spaces are safe unquoted provided they do not violate the conditions.
@@ -403,12 +402,10 @@ Encoders MAY perform key folding when enabled (see §13.4 for complete folding r
 
 ### 7.4 Decoding Rules for Strings and Keys (Decoding)
 
-- Quoted strings and keys MUST be unescaped per Section 7.1; any other escape MUST error. Quoted primitives remain strings.
-- Unquoted values:
-  - true/false/null → boolean/null
-  - Numeric tokens → numbers (with the leading-zero rule in Section 4)
-  - Otherwise → strings
-- Keys (quoted or unquoted) MUST be followed by ":"; missing colon MUST error.
+Decoding of value tokens follows §4 (unquoted type inference, quoted strings, numeric rules). This section adds key-specific requirements:
+
+- Quoted keys MUST be unescaped per Section 7.1; any other escape MUST error.
+- Keys (quoted or unquoted) MUST be followed by ":"; missing colon MUST error (see also §14.2).
 
 ## 8. Objects
 
@@ -421,7 +418,6 @@ Encoders MAY perform key folding when enabled (see §13.4 for complete folding r
 - Decoding:
   - A line "key:" with nothing after the colon at depth d opens an object; subsequent lines at depth > d belong to that object until the depth decreases to ≤ d.
   - Lines "key: value" at the same depth are sibling fields.
-  - Missing colon after a key MUST error.
 
 ## 9. Arrays
 
@@ -474,6 +470,7 @@ Decoding:
     - Delimiter before colon → row.
     - Colon before delimiter → key-value line (end of rows).
   - If a line has an unquoted colon but no unquoted active delimiter → key-value line (end of rows).
+- When a tabular array appears as the first field of a list-item object, indentation is governed by Section 10.
 
 ### 9.4 Mixed / Non-Uniform Arrays — Expanded List
 
@@ -499,48 +496,44 @@ Decoding:
 For an object appearing as a list item:
 
 - Empty object list item: a single "-" at the list-item indentation level.
-- Encoding selection (normative):
-  - When an object has **exactly one field** and that field encodes to a tabular array, encoders SHOULD use the compact form with the tabular header on the hyphen line:
-    - Tabular array: - key[N<delim?>]{fields}:
-      - Followed by tabular rows at depth +1 (relative to the hyphen line).
-  - For all other cases (multiple fields, or single non-tabular field), encoders SHOULD emit a bare hyphen on its own line:
-    - Bare hyphen: -
-    - All fields appear at depth +1 under the hyphen line in encounter order, using normal object field rules (Section 8).
-    - When a field is a tabular array, its header appears at depth +1 and its rows at depth +2 (relative to the hyphen line).
-- First field on the hyphen line (legacy encoding, still valid for decoding):
-  - Primitive: - key: value
-  - Primitive array: - key[M<delim?>]: v1<delim>…
-  - Tabular array: - key[N<delim?>]{fields}:
-    - Followed by tabular rows at depth +1 (relative to the hyphen line).
-  - Non-uniform array: - key[N<delim?>]:
-    - Followed by list items at depth +1.
-  - Object: - key:
-    - Nested object fields appear at depth +2 (i.e., one deeper than subsequent sibling fields of the same list item).
-  - Remaining fields of the same object appear at depth +1 under the hyphen line in encounter order, using normal object field rules.
-
-Decoding:
-- The first field is parsed from the hyphen line. If it is a nested object (- key:), nested fields are at +2 relative to the hyphen line; subsequent fields of the same list item are at +1.
-- If the first field is a tabular header on the hyphen line, its rows are at +1; subsequent sibling fields continue at +1 after the rows.
+- Encoding (normative):
+  - When a list-item object has a tabular array (Section 9.3) as its first field in encounter order, encoders MUST emit the tabular header on the hyphen line:
+    - The hyphen and tabular header appear on the same line at the list-item depth: - key[N<delim?>]{fields}:
+    - Tabular rows MUST appear at depth +2 (relative to the hyphen line).
+    - All other fields of the same object MUST appear at depth +1 under the hyphen line, in encounter order, using normal object field rules (Section 8).
+    - Encoders MUST NOT emit tabular rows at depth +1 or sibling fields at the same depth as rows when the first field is a tabular array.
+  - For all other cases (first field is not a tabular array), encoders SHOULD place the first field on the hyphen line. A bare hyphen on its own line is used only for empty list-item objects.
+- Decoding (normative):
+  - When a decoder encounters a list-item line of the form - key[N<delim?>]{fields}: at depth d, it MUST treat this as the start of a tabular array field named key in the list-item object.
+  - Lines at depth d+2 that conform to tabular row syntax (Section 9.3) are rows of that tabular array.
+  - Lines at depth d+1 are additional fields of the same list-item object; the presence of a line at depth d+1 after rows terminates the rows.
+  - All other object-as-list-item patterns (bare hyphen, first field on hyphen line for non-tabular values) are decoded according to the general rules in Section 8 and Section 9.
 
 ## 11. Delimiters
 
 - Supported delimiters:
   - Comma (default): header omits the delimiter symbol.
   - Tab: header includes HTAB inside brackets and braces (e.g., [N<TAB>], {a<TAB>b}); rows/inline arrays use tabs.
   - Pipe: header includes "|" inside brackets and braces; rows/inline arrays use "|".
-- Document vs Active delimiter:
-  - Encoders select a document delimiter (option) that influences quoting for all object values (key: value) throughout the document.
-  - Inside an array header's scope, the active delimiter governs splitting and quoting only for inline arrays and tabular rows that the header introduces. Object values (key: value) follow document-delimiter quoting rules regardless of array scope.
-- Delimiter-aware quoting (encoding):
-  - Inline array values and tabular row cells: strings containing the active delimiter MUST be quoted to avoid splitting.
-  - Object values (key: value): encoders use the document delimiter to decide delimiter-aware quoting, regardless of whether the object appears within an array's scope.
-  - Strings containing non-active delimiters do not require quoting unless another quoting condition applies (Section 7.2).
-- Delimiter-aware parsing (decoding):
-  - Inline arrays and tabular rows MUST be split only on the active delimiter declared by the nearest array header.
+
+### 11.1 Encoding Rules (Normative for Encoders)
+
+- Document delimiter: Encoders select a document delimiter (option: comma, tab, pipe; default comma) that influences quoting for all object field values (key: value) throughout the document.
+- Active delimiter: Inside an array header's scope, the active delimiter governs quoting only for inline array values and tabular row cells.
+- Delimiter-aware quoting:
+  - Inline array values and tabular row cells: strings containing the active delimiter MUST be quoted.
+  - Object field values (key: value): encoders use the document delimiter to decide delimiter-aware quoting, regardless of whether the object appears within an array's scope.
+  - Strings containing non-active delimiters do not require quoting unless another condition applies (§7.2).
+
+### 11.2 Decoding Rules (Normative for Decoders)
+
+- Active delimiter: Decoders use only the active delimiter declared by the nearest array header to split inline arrays and tabular rows.
+- Delimiter-aware parsing:
+  - Inline arrays and tabular rows MUST be split only on the active delimiter.
   - Splitting MUST preserve empty tokens; surrounding spaces are trimmed, and empty tokens decode to the empty string.
-  - Strings containing the active delimiter MUST be quoted to avoid splitting; non-active delimiters MUST NOT cause splits.
   - Nested headers may change the active delimiter; decoding MUST use the delimiter declared by the nearest header.
-  - If the bracket declares tab or pipe, the same symbol MUST be used in the fields segment and for splitting all rows/values in that scope.
+  - If the bracket declares tab or pipe, the same symbol MUST be used in the fields segment and for splitting all rows/values in that scope (§6).
+- Object field values (key: value): Decoders parse the entire post-colon token as a single value; document delimiter is not a decoder concept.
 
 ## 12. Indentation and Whitespace
 
@@ -738,12 +731,14 @@ When strict mode is enabled (default), decoders MUST error on the following cond
 
 ### 14.3 Indentation Errors
 
+See §12 for indentation semantics. In strict mode, decoders MUST error on:
 - Leading spaces not a multiple of indentSize.
 - Any tab used in indentation (tabs allowed in quoted strings and as HTAB delimiter).
 
 ### 14.4 Structural Errors
 
-- Blank lines inside arrays/tabular rows.
+See §12 for blank line semantics. In strict mode, decoders MUST error on:
+- Blank lines inside arrays/tabular rows (between the first and last item/row).
 
 For root-form rules, including handling of empty documents, see §5.
 
@@ -1000,14 +995,13 @@ items[2]:
 Nested tabular inside a list item:
 ```
 items[1]:
-  -
-    users[2]{id,name}:
+  - users[2]{id,name}:
       1,Ada
       2,Bob
     status: active
 ```
 
-Note: Encoders use this format (bare hyphen with all fields indented) for objects with multiple fields. Older encodings may place the first field on the hyphen line; both are valid for decoders.
+Note: When a list-item object has a tabular array as its first field, encoders emit the tabular header on the hyphen line with rows at depth +2 and other fields at depth +1. This is the canonical encoding for list-item objects whose first field is a tabular array.
 
 Delimiter variations:
 ```
@@ -1235,6 +1229,10 @@ Note: Host-type normalization tests (e.g., BigInt, Date, Set, Map) are language-
 
 This appendix summarizes major changes between spec versions. For the complete changelog, see [`CHANGELOG.md`](./CHANGELOG.md) in the specification repository.
 
+### v3.0 (2025-11-24)
+
+- Standardized encoding for list-item objects whose first field is a tabular array (§10).
+
 ### v2.1 (2025-11-23)
 
 - Tightened canonical encoding for objects as list items (§10): bare `-` for multi-field objects, compact `- key[N]{fields}:` only for single-field tabular arrays, to improve visual consistency and LLM readability.
diff --git a/tests/fixtures/decode/arrays-nested.json b/tests/fixtures/decode/arrays-nested.json
@@ -1,5 +1,5 @@
 {
-  "version": "2.1",
+  "version": "3.0",
   "category": "decode",
   "description": "Nested and mixed array decoding - list format, arrays of arrays, root arrays, mixed types",
   "tests": [
@@ -52,8 +52,8 @@
       "specSection": "9.4"
     },
     {
-      "name": "parses nested tabular arrays as first field on hyphen line (legacy)",
-      "input": "items[1]:\n  - users[2]{id,name}:\n    1,Ada\n    2,Bob\n    status: active",
+      "name": "parses list items whose first field is a tabular array",
+      "input": "items[1]:\n  - users[2]{id,name}:\n      1,Ada\n      2,Bob\n    status: active",
       "expected": {
         "items": [
           {
@@ -66,25 +66,23 @@
         ]
       },
       "specSection": "10",
-      "note": "Still valid for backward compatibility"
+      "note": "Canonical encoding: tabular header on hyphen line, rows at depth +2, sibling fields at depth +1"
     },
     {
-      "name": "parses nested tabular arrays in list items with bare hyphen",
-      "input": "items[1]:\n  -\n    users[2]{id,name}:\n      1,Ada\n      2,Bob\n    status: active",
+      "name": "parses single-field list-item object with tabular array",
+      "input": "items[1]:\n  - users[2]{id,name}:\n      1,Ada\n      2,Bob",
       "expected": {
         "items": [
           {
             "users": [
               { "id": 1, "name": "Ada" },
               { "id": 2, "name": "Bob" }
-            ],
-            "status": "active"
+            ]
           }
         ]
       },
       "specSection": "10",
-      "minSpecVersion": "2.1",
-      "note": "Canonical v2.1+ encoding (bare hyphen with all fields indented)"
+      "note": "Single-field list-item object: only the tabular array, no sibling fields"
     },
     {
       "name": "parses objects containing arrays (including empty arrays) in list format",
@@ -98,7 +96,7 @@
     },
     {
       "name": "parses arrays of arrays within objects",
-      "input": "items[1]:\n  - matrix[2]:\n    - [2]: 1,2\n    - [2]: 3,4\n    name: grid",
+      "input": "items[1]:\n  - matrix[2]:\n      - [2]: 1,2\n      - [2]: 3,4\n    name: grid",
       "expected": {
         "items": [
           { "matrix": [[1, 2], [3, 4]], "name": "grid" }
diff --git a/tests/fixtures/encode/arrays-nested.json b/tests/fixtures/encode/arrays-nested.json
@@ -1,5 +1,5 @@
 {
-  "version": "2.1",
+  "version": "3.0",
   "category": "encode",
   "description": "Nested and mixed array encoding - arrays of arrays, mixed type arrays, root arrays",
   "tests": [
@@ -50,16 +50,14 @@
     {
       "name": "encodes root-level array of non-uniform objects in list format",
       "input": [{ "id": 1 }, { "id": 2, "name": "Ada" }],
-      "expected": "[2]:\n  -\n    id: 1\n  -\n    id: 2\n    name: Ada",
-      "specSection": "9.4",
-      "minSpecVersion": "2.1"
+      "expected": "[2]:\n  - id: 1\n  - id: 2\n    name: Ada",
+      "specSection": "9.4"
     },
     {
       "name": "encodes root-level array mixing primitive, object, and array of objects in list format",
       "input": ["summary", { "id": 1, "name": "Ada" }, [{ "id": 2 }, { "status": "draft" }]],
-      "expected": "[3]:\n  - summary\n  -\n    id: 1\n    name: Ada\n  - [2]:\n    -\n      id: 2\n    -\n      status: draft",
-      "specSection": "9.4",
-      "minSpecVersion": "2.1"
+      "expected": "[3]:\n  - summary\n  - id: 1\n    name: Ada\n  - [2]:\n    - id: 2\n    - status: draft",
+      "specSection": "9.4"
     },
     {
       "name": "encodes root-level arrays of arrays",
@@ -92,18 +90,16 @@
       "input": {
         "items": [1, { "a": 1 }, "text"]
       },
-      "expected": "items[3]:\n  - 1\n  -\n    a: 1\n  - text",
-      "specSection": "9.4",
-      "minSpecVersion": "2.1"
+      "expected": "items[3]:\n  - 1\n  - a: 1\n  - text",
+      "specSection": "9.4"
     },
     {
       "name": "uses list format for arrays mixing objects and arrays",
       "input": {
         "items": [{ "a": 1 }, [1, 2]]
       },
-      "expected": "items[2]:\n  -\n    a: 1\n  - [2]: 1,2",
-      "specSection": "9.4",
-      "minSpecVersion": "2.1"
+      "expected": "items[2]:\n  - a: 1\n  - [2]: 1,2",
+      "specSection": "9.4"
     }
   ]
 }
diff --git a/tests/fixtures/encode/arrays-objects.json b/tests/fixtures/encode/arrays-objects.json