feat(spec): parse nested tabular arrays in list items with bare hyphen

johannschopplich · johannschopplich · commit a5c25a1b9e15 · 2025-11-24T08:28:14.000+01:00
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -5,87 +5,83 @@ All notable changes to the TOON specification will be documented in this file.
 The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
 and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
 
+## [2.1] - 2025-11-23
+
+### Changed
+
+- Canonical encoding for objects as list items (§10):
+  - Encoders SHOULD emit `- key[N]{fields}:` only when the list-item object has exactly one field and that field is a tabular array.
+  - In all other cases, encoders SHOULD emit a bare `-` line and place all fields at depth +1; tabular array headers then appear at depth +1 and their rows at depth +2.
+
 ## [2.0] - 2025-11-10
 
 ### Breaking Changes
 
-- **Removed:** Length marker (`#`) prefix in array headers has been completely removed from the specification
-- The `[#N]` format is no longer valid syntax. All array headers MUST use `[N]` format only
-- Encoders MUST NOT emit `[#N]` format
-- Decoders MUST NOT accept `[#N]` format (breaking change from v1.5)
+- Removed `[#N]` length-marker syntax in array headers; `[N]` is now the only valid format.
+- Encoders MUST NOT emit `[#N]`; decoders MUST reject it.
 
 ### Removed
 
-- All references to length marker from terminology (§1.4), header syntax (§6), ABNF grammar, conformance requirements (§13.2), and parsing helpers (Appendix B)
-- `lengthMarker` encoder option removed from all implementations
-- Length marker test fixtures removed
+- The `lengthMarker` encoder option and any CLI flags exposing it.
 
 ### Migration from v1.5
 
-- Update decoder implementations to reject `[#N]` syntax
-- Convert any existing `.toon` files using `[#N]` format to `[N]` format
-- Remove `lengthMarker` option from encoder configurations
-- Remove `--length-marker` CLI flags if present
+- Update decoders to reject `[#N]` syntax.
+- Convert existing `.toon` files using `[#N]` to `[N]`.
+- Remove `lengthMarker` configuration and CLI options.
 
 ## [1.5] - 2025-11-08
 
 ### Added
 
-- Optional key folding for encoders: `keyFolding="safe"` mode with `flattenDepth` control to collapse single-key object chains into dotted-path notation (§13.4)
-- Optional path expansion for decoders: `expandPaths="safe"` mode to split dotted keys into nested objects, with conflict resolution tied to `strict` option (§13.4, §14.5)
-- IdentifierSegment terminology and path separator definition (fixed to `"."` in v1.5) (§1.9)
-- Deep-merge semantics for path expansion: recursive merge for objects, error on conflict when `strict=true`, last-write-wins (LWW) when `strict=false` (§13.4)
+- Optional key folding for encoders: `keyFolding="safe"` with `flattenDepth` to collapse single-key object chains into dotted paths (§13.4).
+- Optional path expansion for decoders: `expandPaths="safe"` to split dotted keys into nested objects with deep-merge semantics and conflict handling tied to `strict` (§13.4, §14.5).
+- IdentifierSegment terminology and fixed `"."` path separator for safe folding/expansion (§1.9).
 
 ### Changed
 
-- Both new features default to OFF and are fully backward-compatible
-- Safe-mode folding requires IdentifierSegment validation, collision avoidance, and no quoting
+- Safe-mode folding requires IdentifierSegment-only segments, no path separator in segments, no quoting, and collision avoidance.
+- Both features default to `off` and are backward-compatible.
 
 ## [1.4] - 2025-11-05
 
 ### Changed
 
-- Removed JavaScript-specific normalization details from specification; replaced with language-agnostic requirements (Section 3)
-- Defined canonical number format for encoders: no exponent notation, no trailing zeros, no leading zeros except "0" (Section 2)
-- Clarified decoder handling of exponent notation and out-of-range numbers (Section 2)
-- Expanded `\w` regex notation to explicit character class `[A-Za-z0-9_]` for cross-language clarity (Section 7.3)
-- Clarified non-strict mode tab handling as implementation-defined (Section 12)
+- Generalized normalization rules and defined canonical number format for encoders (no exponent notation, no trailing zeros, no leading zeros except `"0"`), plus decoder handling of exponent forms and out-of-range numbers (§2-§3).
+- Replaced `\w` with explicit `[A-Za-z0-9_]` in key regexes for cross-language clarity (§7.3).
+- Clarified non-strict mode tab handling as implementation-defined (§12).
 
 ### Added
 
-- Appendix G: Host Type Normalization Examples with guidance for Go, JavaScript, Python, and Rust implementations
+- Appendix G with host-type normalization examples for Go, JavaScript, Python, and Rust.
 
 ## [1.3] - 2025-10-31
 
 ### Added
 
-- Numeric precision requirements: JavaScript implementations SHOULD use `Number.toString()` precision (15-17 digits), all implementations MUST preserve round-trip fidelity (Section 2)
-- RFC 5234 core rules (ALPHA, DIGIT, DQUOTE, HTAB, LF, SP) to ABNF grammar definitions (Section 6)
+- Numeric precision requirements: JavaScript implementations SHOULD use `Number.toString()` precision (15–17 digits); all implementations MUST preserve round-trip fidelity (§2).
+- RFC 5234 core rules (ALPHA, DIGIT, DQUOTE, HTAB, LF, SP) to ABNF grammar definitions (§6).
 
 ## [1.2] - 2025-10-29
 
 ### Changed
 
-- Clarified delimiter scoping behavior between array headers
-- Tightened strict-mode indentation requirements: leading spaces MUST be exact multiples of indentSize; tabs in indentation MUST error
-- Defined blank-line and trailing-newline decoding behavior with explicit skipping rules outside arrays
-- Clarified hyphen-based quoting: "-" or any string starting with "-" MUST be quoted
-- Clarified BigInt normalization: values outside safe integer range are converted to quoted decimal strings
-- Clarified row/key disambiguation: uses first unquoted delimiter vs colon position
+- Tightened delimiter scoping, indentation, blank-line handling, and hyphen-based quoting rules (§11-§12).
+- Clarified BigInt normalization (out-of-range values → quoted decimal strings) and row/key disambiguation (first unquoted delimiter vs colon) (§2, §9.3).
 
 ## [1.1] - 2025-10-29
 
 ### Added
 
-- Strict-mode rules
-- Delimiter-aware parsing
-- Decoder options (indent, strict)
+- Strict-mode rules.
+- Delimiter-aware parsing.
+- Decoder options (`indent`, `strict`).
 
 ## [1.0] - 2025-10-28
 
 ### Added
 
-- Initial specification release
-- Encoding normalization rules
-- Decoding interpretation guidelines
-- Conformance requirements
+- Initial specification release.
+- Encoding normalization rules.
+- Decoding interpretation guidelines.
+- Conformance requirements.
diff --git a/README.md b/README.md
@@ -1,7 +1,7 @@
 # TOON Format Specification
 
-[![SPEC v2.0](https://img.shields.io/badge/spec-v2.0-lightgrey)](./SPEC.md)
-[![Tests](https://img.shields.io/badge/tests-342-green)](./tests/fixtures/)
+[![SPEC v2.1](https://img.shields.io/badge/spec-v2.1-lightgrey)](./SPEC.md)
+[![Tests](https://img.shields.io/badge/tests-344-green)](./tests/fixtures/)
 [![License: MIT](https://img.shields.io/badge/license-MIT-blue.svg)](./LICENSE)
 
 This repository contains the official specification for **Token-Oriented Object Notation (TOON)**, a compact, human-readable encoding of the JSON data model for LLM prompts. It provides a lossless serialization of the same objects, arrays, and primitives as JSON, but in a syntax that minimizes tokens and makes structure easy for models to follow.
@@ -10,7 +10,7 @@ This repository contains the official specification for **Token-Oriented Object
 
 [→ Read the full specification (SPEC.md)](./SPEC.md)
 
-- **Version:** 2.0 (2025-11-10)
+- **Version:** 2.1 (2025-11-23)
 - **Status:** Working Draft
 - **License:** MIT
 
diff --git a/SPEC.md b/SPEC.md
@@ -2,9 +2,9 @@
 
 ## Token-Oriented Object Notation
 
-**Version:** 2.0
+**Version:** 2.1
 
-**Date:** 2025-11-10
+**Date:** 2025-11-23
 
 **Status:** Working Draft
 
@@ -20,7 +20,7 @@ Token-Oriented Object Notation (TOON) is a line-oriented, indentation-based text
 
 ## Status of This Document
 
-This document is a Working Draft v2.0 and may be updated, replaced, or obsoleted. Implementers should monitor the canonical repository at https://github.com/toon-format/spec for changes.
+This document is a Working Draft v2.1 and may be updated, replaced, or obsoleted. Implementers should monitor the canonical repository at https://github.com/toon-format/spec for changes.
 
 This specification is stable for implementation but not yet finalized. Breaking changes may occur in future major versions.
 
@@ -499,7 +499,15 @@ Decoding:
 For an object appearing as a list item:
 
 - Empty object list item: a single "-" at the list-item indentation level.
-- First field on the hyphen line:
+- Encoding selection (normative):
+  - When an object has **exactly one field** and that field encodes to a tabular array, encoders SHOULD use the compact form with the tabular header on the hyphen line:
+    - Tabular array: - key[N<delim?>]{fields}:
+      - Followed by tabular rows at depth +1 (relative to the hyphen line).
+  - For all other cases (multiple fields, or single non-tabular field), encoders SHOULD emit a bare hyphen on its own line:
+    - Bare hyphen: -
+    - All fields appear at depth +1 under the hyphen line in encounter order, using normal object field rules (Section 8).
+    - When a field is a tabular array, its header appears at depth +1 and its rows at depth +2 (relative to the hyphen line).
+- First field on the hyphen line (legacy encoding, still valid for decoding):
   - Primitive: - key: value
   - Primitive array: - key[M<delim?>]: v1<delim>…
   - Tabular array: - key[N<delim?>]{fields}:
@@ -508,7 +516,7 @@ For an object appearing as a list item:
     - Followed by list items at depth +1.
   - Object: - key:
     - Nested object fields appear at depth +2 (i.e., one deeper than subsequent sibling fields of the same list item).
-- Remaining fields of the same object appear at depth +1 under the hyphen line in encounter order, using normal object field rules.
+  - Remaining fields of the same object appear at depth +1 under the hyphen line in encounter order, using normal object field rules.
 
 Decoding:
 - The first field is parsed from the hyphen line. If it is a nested object (- key:), nested fields are at +2 relative to the hyphen line; subsequent fields of the same list item are at +1.
@@ -992,12 +1000,15 @@ items[2]:
 Nested tabular inside a list item:
 ```
 items[1]:
-  - users[2]{id,name}:
-    1,Ada
-    2,Bob
+  -
+    users[2]{id,name}:
+      1,Ada
+      2,Bob
     status: active
 ```
 
+Note: Encoders use this format (bare hyphen with all fields indented) for objects with multiple fields. Older encodings may place the first field on the hyphen line; both are valid for decoders.
+
 Delimiter variations:
 ```
 items[2	]{sku	name	qty	price}:
@@ -1222,52 +1233,39 @@ Note: Host-type normalization tests (e.g., BigInt, Date, Set, Map) are language-
 
 ## Appendix D: Document Changelog (Informative)
 
+This appendix summarizes major changes between spec versions. For the complete changelog, see [`CHANGELOG.md`](./CHANGELOG.md) in the specification repository.
+
+### v2.1 (2025-11-23)
+
+- Tightened canonical encoding for objects as list items (§10): bare `-` for multi-field objects, compact `- key[N]{fields}:` only for single-field tabular arrays, to improve visual consistency and LLM readability.
+
 ### v2.0 (2025-11-10)
 
-- Breaking change: Length marker (`#`) prefix in array headers has been completely removed from the specification.
-- The `[#N]` format is no longer valid syntax. All array headers MUST use `[N]` format only.
-- Encoders MUST NOT emit `[#N]` format.
-- Decoders MUST NOT accept `[#N]` format (breaking change from v1.5).
-- Removed all references to length marker from terminology, grammar, conformance requirements, and parsing helpers.
+- Removed `[#N]` length-marker syntax from array headers; `[N]` is now the only valid form.
 
 ### v1.5 (2025-11-08)
 
-- Added optional key folding for encoders: `keyFolding='safe'` mode with `flattenDepth` control (§13.4).
-- Added optional path expansion for decoders: `expandPaths='safe'` mode with conflict resolution tied to existing `strict` option (§13.4).
-- Defined safe-mode requirements for folding: IdentifierSegment validation, no path separator in segments, collision avoidance, no quoting required (§7.3, §13.4).
-- Specified deep-merge semantics for expansion: recursive merge for objects; conflict policy (error in strict mode, LWW when strict=false) for non-objects (§13.4).
-- Added strict-mode error category for path expansion conflicts (§14.5).
-- Both features default to OFF; fully backward-compatible.
+- Added optional key folding (`keyFolding="safe"`) and path expansion (`expandPaths="safe"`) with deep-merge semantics and strict-mode conflict handling (§13.4, §14.5).
 
 ### v1.4 (2025-11-05)
 
-- Removed JavaScript-specific normalization details; replaced with language-agnostic requirements (Section 3).
-- Defined canonical number format for encoders and decoder acceptance rules (Section 2).
-- Added Appendix G with host-type normalization examples for Go, JavaScript, Python, and Rust.
-- Clarified non-strict mode tab handling as implementation-defined (Section 12).
-- Expanded regex notation for cross-language clarity (Section 7.3).
+- Generalized normalization and numeric canonicalization rules, and added host-type normalization guidance (Appendix G).
 
 ### v1.3 (2025-10-31)
 
-- Added numeric precision requirements: JavaScript implementations SHOULD use Number.toString() precision (15-17 digits), all implementations MUST preserve round-trip fidelity (Section 2).
-- Added RFC 5234 core rules (ALPHA, DIGIT, DQUOTE, HTAB, LF, SP) to ABNF grammar definitions (Section 6).
+- Added numeric precision guidance and ABNF core rules for headers and keys (§2, §6).
 
 ### v1.2 (2025-10-29)
 
-- Clarified delimiter scoping behavior between array headers.
-- Tightened strict-mode indentation requirements: leading spaces MUST be exact multiples of indentSize; tabs in indentation MUST error.
-- Defined blank-line and trailing-newline decoding behavior with explicit skipping rules outside arrays.
-- Clarified hyphen-based quoting: "-" or any string starting with "-" MUST be quoted.
-- Clarified BigInt normalization: values outside safe integer range are converted to quoted decimal strings.
-- Clarified row/key disambiguation: uses first unquoted delimiter vs colon position.
+- Tightened delimiter scoping, indentation, blank-line handling, hyphen-based quoting, BigInt normalization, and row/key disambiguation rules (§2, §9, §11-§12).
 
 ### v1.1 (2025-10-29)
 
-Added strict-mode rules, delimiter-aware parsing, and decoder options (indent, strict).
+- Introduced strict-mode validation, delimiter-aware parsing, and decoder options (indent, strict).
 
 ### v1.0 (2025-10-28)
 
-Initial encoding, normalization, and conformance rules.
+- Initial specification: encoding normalization, decoding interpretation, and conformance requirements.
 
 ## Appendix E: Acknowledgments and License
 
diff --git a/tests/fixtures/decode/arrays-nested.json b/tests/fixtures/decode/arrays-nested.json
@@ -1,5 +1,5 @@
 {
-  "version": "1.4",
+  "version": "2.1",
   "category": "decode",
   "description": "Nested and mixed array decoding - list format, arrays of arrays, root arrays, mixed types",
   "tests": [
@@ -52,7 +52,7 @@
       "specSection": "9.4"
     },
     {
-      "name": "parses nested tabular arrays as first field on hyphen line",
+      "name": "parses nested tabular arrays as first field on hyphen line (legacy)",
       "input": "items[1]:\n  - users[2]{id,name}:\n    1,Ada\n    2,Bob\n    status: active",
       "expected": {
         "items": [
@@ -65,7 +65,26 @@
           }
         ]
       },
-      "specSection": "10"
+      "specSection": "10",
+      "note": "Still valid for backward compatibility"
+    },
+    {
+      "name": "parses nested tabular arrays in list items with bare hyphen",
+      "input": "items[1]:\n  -\n    users[2]{id,name}:\n      1,Ada\n      2,Bob\n    status: active",
+      "expected": {
+        "items": [
+          {
+            "users": [
+              { "id": 1, "name": "Ada" },
+              { "id": 2, "name": "Bob" }
+            ],
+            "status": "active"
+          }
+        ]
+      },
+      "specSection": "10",
+      "minSpecVersion": "2.1",
+      "note": "Canonical v2.1+ encoding (bare hyphen with all fields indented)"
     },
     {
       "name": "parses objects containing arrays (including empty arrays) in list format",
diff --git a/tests/fixtures/encode/arrays-nested.json b/tests/fixtures/encode/arrays-nested.json
@@ -1,5 +1,5 @@
 {
-  "version": "1.4",
+  "version": "2.1",
   "category": "encode",
   "description": "Nested and mixed array encoding - arrays of arrays, mixed type arrays, root arrays",
   "tests": [
@@ -50,14 +50,16 @@
     {
       "name": "encodes root-level array of non-uniform objects in list format",
       "input": [{ "id": 1 }, { "id": 2, "name": "Ada" }],
-      "expected": "[2]:\n  - id: 1\n  - id: 2\n    name: Ada",
-      "specSection": "9.4"
+      "expected": "[2]:\n  -\n    id: 1\n  -\n    id: 2\n    name: Ada",
+      "specSection": "9.4",
+      "minSpecVersion": "2.1"
     },
     {
       "name": "encodes root-level array mixing primitive, object, and array of objects in list format",
       "input": ["summary", { "id": 1, "name": "Ada" }, [{ "id": 2 }, { "status": "draft" }]],
-      "expected": "[3]:\n  - summary\n  - id: 1\n    name: Ada\n  - [2]:\n    - id: 2\n    - status: draft",
-      "specSection": "9.4"
+      "expected": "[3]:\n  - summary\n  -\n    id: 1\n    name: Ada\n  - [2]:\n    -\n      id: 2\n    -\n      status: draft",
+      "specSection": "9.4",
+      "minSpecVersion": "2.1"
     },
     {
       "name": "encodes root-level arrays of arrays",
@@ -90,16 +92,18 @@
       "input": {
         "items": [1, { "a": 1 }, "text"]
       },
-      "expected": "items[3]:\n  - 1\n  - a: 1\n  - text",
-      "specSection": "9.4"
+      "expected": "items[3]:\n  - 1\n  -\n    a: 1\n  - text",
+      "specSection": "9.4",
+      "minSpecVersion": "2.1"
     },
     {
       "name": "uses list format for arrays mixing objects and arrays",
       "input": {
         "items": [{ "a": 1 }, [1, 2]]
       },
-      "expected": "items[2]:\n  - a: 1\n  - [2]: 1,2",
-      "specSection": "9.4"
+      "expected": "items[2]:\n  -\n    a: 1\n  - [2]: 1,2",
+      "specSection": "9.4",
+      "minSpecVersion": "2.1"
     }
   ]
 }
diff --git a/tests/fixtures/encode/arrays-objects.json b/tests/fixtures/encode/arrays-objects.json