Skip to content

Latest commit

 

History

History
476 lines (366 loc) · 12.8 KB

File metadata and controls

476 lines (366 loc) · 12.8 KB

senax-encoder Binary Format Specification

Version: 1.1
Date: 2025
Status: Draft

Table of Contents

  1. Overview
  2. Format Basics
  3. Tag System
  4. Data Type Specifications
  5. Struct and Enum Encoding
  6. Schema Evolution
  7. Implementation Notes

1. Overview

The senax-encoder binary format is designed for efficient, compact serialization with a focus on forward and backward compatibility. Each value is tagged with a type identifier, enabling schema evolution and version compatibility.

Key Design Principles

  • Compact Representation: Variable-length encoding for common values
  • Self-describing: Each value includes type information
  • Version Resilience: Unknown fields/types can be safely skipped
  • Little Endian: Consistent byte order across platforms

2. Format Basics

2.1 Byte Order

All multi-byte integers are encoded in little-endian format.

2.2 Basic Structure

All encoded values follow this pattern:

[TAG:u8] [DATA:variable]

Where:

  • TAG is a single byte identifying the type and encoding method
  • DATA is the encoded value, format depends on the tag

2.3 Variable-Length Integer Encoding

For optimal space efficiency, integers use variable-length encoding:

  • Values 0-127: Encoded directly in the tag byte
  • Larger values: Use dedicated tag + payload encoding
  • Signed integers: Negative values use bit-inverted encoding (not ZigZag)

2.4 Optimized Field ID Encoding

Field IDs and variant IDs use an optimized encoding scheme for space efficiency:

Encoding Rules:

  • Field IDs 1-250: Encoded as single u8 byte
  • Field IDs 251+: Encoded as 0xFF marker byte followed by u64 little-endian
  • Terminator: Encoded as 0x00 byte to mark end of fields

Format:

// Small field ID (1-250)
[field_id:u8] [field_value]

// Large field ID (251+)  
[0xFF] [field_id:u64_le] [field_value]

// Terminator
[0x00]

Size Benefits:

  • Most field IDs (1-250) use only 1 byte instead of 8 bytes
  • Terminator uses 1 byte instead of 8 bytes
  • Large field IDs (rare) use 9 bytes (1 marker + 8 data)

Examples:

field_id=1   -> [0x01]              // Direct u8 encoding
field_id=250 -> [0xFA]              // Direct u8 encoding (250 = 0xFA)
field_id=251 -> [0xFF, 0xFB, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00]  // Marker + u64_le
terminator   -> [0x00]              // End of fields

This optimization significantly reduces binary size for typical structs and enums while maintaining full u64 field ID range support.

3. Tag System

3.1 Tag Assignment

Tags are assigned in ranges for semantic grouping:

pub const TAG_ZERO: u8 = 0;
pub const TAG_ONE: u8 = 1;
// 2-127: Direct encoding for values 2-127
pub const TAG_U8_127: u8 = 127;      // Value 127
// Extended integer types
pub const TAG_NONE: u8 = 128;
pub const TAG_SOME: u8 = 129;
pub const TAG_U8: u8 = 131;
pub const TAG_U16: u8 = 132;
pub const TAG_U32: u8 = 133;
pub const TAG_U64: u8 = 134;
pub const TAG_U128: u8 = 135;
pub const TAG_NEGATIVE: u8 = 136;
// Floating point
pub const TAG_F32: u8 = 137;
pub const TAG_F64: u8 = 138;
// Strings
pub const TAG_STRING_BASE: u8 = 139;  // 139-179: Short strings (0-40 chars)
pub const TAG_STRING_LONG: u8 = 180;
// Collections and containers
pub const TAG_BINARY: u8 = 181;
pub const TAG_STRUCT_UNIT: u8 = 182;
pub const TAG_STRUCT_NAMED: u8 = 183;
pub const TAG_STRUCT_UNNAMED: u8 = 184;
pub const TAG_ENUM: u8 = 185;
pub const TAG_ENUM_NAMED: u8 = 186;
pub const TAG_ENUM_UNNAMED: u8 = 187;
pub const TAG_ARRAY_VEC_SET_BASE: u8 = 188;  // 188-193: Short arrays (0-5 elements)
pub const TAG_ARRAY_VEC_SET_LONG: u8 = 194;
pub const TAG_TUPLE: u8 = 195;
pub const TAG_MAP: u8 = 196;
// Extended types (optional features)
pub const TAG_CHRONO_DATETIME: u8 = 197;
pub const TAG_CHRONO_NAIVE_DATE: u8 = 198;
pub const TAG_CHRONO_NAIVE_TIME: u8 = 199;
pub const TAG_DECIMAL: u8 = 200;
pub const TAG_UUID: u8 = 201;  // Shared by UUID and ULID

4. Data Type Specifications

4.1 Boolean

Encoding:

  • false: TAG_ZERO (0x00)
  • true: TAG_ONE (0x01)

Example:

true  -> [0x01]
false -> [0x00]

4.2 Unsigned Integers

Compact Encoding (0-127):

value -> [TAG_ZERO + value]

Extended Encoding:

u8    -> [TAG_U8] [value-128:u8]        (range: 128-383)
u16   -> [TAG_U16] [value:u16_le]       (range: 256-65535)
u32   -> [TAG_U32] [value:u32_le]       (range: 65536-4294967295)
u64   -> [TAG_U64] [value:u64_le]       (range: 4294967296-18446744073709551615)
u128  -> [TAG_U128] [value:u128_le]     (range: 18446744073709551616+)

Size Selection:

  • 0-127: Direct encoding (1 byte total)
  • 128-383: u8 encoding (2 bytes total) - stores value-128
  • 384-65535: u16 encoding (3 bytes total)
  • etc.

Examples:

42     -> [0x2A]           // TAG_ZERO + 42 = 0 + 42 = 42 = 0x2A
128    -> [0x83, 0x00]     // TAG_U8, 128-128=0
255    -> [0x83, 0x7F]     // TAG_U8, 255-128=127
383    -> [0x83, 0xFF]     // TAG_U8, 383-128=255
384    -> [0x84, 0x80, 0x01]  // TAG_U16, 384 in LE

4.3 Signed Integers

Special Cases:

  • 0: TAG_ZERO (0x00)
  • 1: TAG_ONE (0x01)

Encoding Rule:

  • 0 and positive values: Encoded as unsigned integers
  • Negative values: TAG_NEGATIVE (0x88) + bit-inverted encoding

Format:

// 0, positive values
[value:variable_uint]
// Negative values
[TAG_NEGATIVE] [(!n):variable_uint]

Examples:

0      -> [0x00]              // TAG_ZERO
1      -> [0x01]              // TAG_ONE
2      -> [0x02]              // TAG_ZERO+2
-1     -> [0x88, 0x00]        // TAG_NEGATIVE, !(-1)=0 -> TAG_ZERO
-2     -> [0x88, 0x01]        // TAG_NEGATIVE, !(-2)=1 -> TAG_ONE
-128   -> [0x88, 0x7F]        // TAG_NEGATIVE, !(-128)=127 -> TAG_ZERO+127

4.4 Floating Point

Format:

f32 -> [TAG_F32] [value:f32_le]
f64 -> [TAG_F64] [value:f64_le]

Cross-Type Decoding:

  • f64 can be decoded as f32 (with potential precision loss)
  • f32 to f64 cross-decoding is not supported due to precision ambiguity

4.5 Strings

Short Strings (0-40 bytes):

[TAG_STRING_BASE + length] [utf8_bytes]

Long Strings:

[TAG_STRING_LONG] [length:variable_uint] [utf8_bytes]

Examples:

""      -> [0x8B]                    // TAG_STRING_BASE + 0
"hi"    -> [0x8D, 0x68, 0x69]       // TAG_STRING_BASE + 2, "hi"
"long"  -> [0xB4, 0x04, 0x6C, 0x6F, 0x6E, 0x67]  // TAG_STRING_LONG, length=4, "long"

4.6 Option Types

Format:

None    -> [TAG_NONE]     // 0x80 (128)
Some(v) -> [TAG_SOME] [encoded_value]  // 0x81 (129) + value

4.7 Collections

Arrays, Vectors, Sets

Short Collections (0-5 elements):

[TAG_ARRAY_VEC_SET_BASE + count] [element1] [element2] ...

Long Collections:

[TAG_ARRAY_VEC_SET_LONG] [count:variable_uint] [element1] [element2] ...

Maps

Format:

[TAG_MAP] [count:variable_uint] [key1] [value1] [key2] [value2] ...

Tuples

Format:

[TAG_TUPLE] [element_count:variable_uint] [element1] [element2] ...

4.8 Binary Data

Vec and Bytes:

[TAG_BINARY] [length:variable_uint] [raw_bytes]

4.9 Extended Types (Feature-Dependent)

DateTime (chrono feature)

Format:

[TAG_CHRONO_DATETIME] [seconds:i64] [nanos:u32]

All DateTime types (UTC, Local) are normalized to UTC for storage.

NaiveDate (chrono feature)

Format:

[TAG_CHRONO_NAIVE_DATE] [days_from_epoch:i64]

Epoch: 1970-01-01

NaiveTime (chrono feature)

Format:

[TAG_CHRONO_NAIVE_TIME] [seconds_from_midnight:u32] [nanoseconds:u32]

NaiveDateTime (chrono feature)

Format:

[TAG_CHRONO_NAIVE_DATETIME] [seconds:i64] [nanos:u32]

Stores as seconds and nanoseconds since Unix epoch (1970-01-01 00:00:00 UTC).

Decimal (rust_decimal feature)

Format:

[TAG_DECIMAL] [mantissa:i128] [scale:u32]

UUID/ULID (uuid/ulid features)

Format:

[TAG_UUID] [value:u128_le]

Note: UUID and ULID share the same tag and are binary compatible at the encoding level.

4.14 serde_json::Value (Feature: serde_json)

Dynamic JSON values are supported when the serde_json feature is enabled. Each JSON value variant has its own tag:

  • TAG_JSON_NULL (202): JSON null value
  • TAG_JSON_BOOL (203): JSON boolean (uses existing boolean encoding)
  • TAG_JSON_NUMBER (204): JSON number with type preservation
  • TAG_JSON_STRING (205): JSON string (uses existing string encoding)
  • TAG_JSON_ARRAY (206): JSON array
  • TAG_JSON_OBJECT (207): JSON object

JSON Number Encoding

JSON numbers are encoded with type preservation to maintain integer/float distinction:

Format: TAG_JSON_NUMBER + type_marker + value

  • type_marker = 0: Unsigned integer, followed by u64 encoding
  • type_marker = 1: Signed integer, followed by i64 encoding
  • type_marker = 2: Float, followed by f64 encoding

Examples:

  • 42 (integer) → [204, 0, ...] (TAG_JSON_NUMBER, unsigned integer marker, i64 encoding)
  • 3.14159 (float) → [204, 2, ...] (TAG_JSON_NUMBER, float marker, f64 encoding)

JSON Array Encoding

Format: TAG_JSON_ARRAY + length + elements...

JSON Object Encoding

Format: TAG_JSON_OBJECT + length + (key, value)...

Keys are encoded as strings, values are recursively encoded as JSON values.

Examples:

  • null[202]
  • true[203, 4] (TAG_JSON_BOOL, TAG_ONE)
  • "hello"[205, 144] (TAG_JSON_STRING, string encoding)
  • [][206, 3] (TAG_JSON_ARRAY, length 0)
  • {}[207, 3] (TAG_JSON_OBJECT, length 0)

5. Struct and Enum Encoding

5.1 Unit Structs

Format:

[TAG_STRUCT_UNIT]

5.2 Named Field Structs

Format:

[TAG_STRUCT_NAMED] [field_id_optimized] [field_value] ... [0x00]

Field Encoding Rules:

  • Each field is encoded as [field_id_optimized] [field_value]
  • Field IDs are derived from field names (CRC64(ECMA-182) hash) or custom #[senax(id=n)] attributes
  • Field IDs 1-250 are encoded as single u8 bytes
  • Field IDs 251+ are encoded as 0xFF marker + u64 little-endian
  • Optional fields with None values are omitted entirely
  • Terminator: single zero byte (0x00) marks end of fields

5.3 Unnamed Field Structs (Tuples)

Format:

[TAG_STRUCT_UNNAMED] [field_count:variable_uint] [field1] [field2] ...

5.4 Enums

Unit Variants

Format:

[TAG_ENUM] [variant_id_optimized]

Named Field Variants

Format:

[TAG_ENUM_NAMED] [variant_id_optimized] [field_id_optimized] [field_value] ... [0x00]

Unnamed Field Variants

Format:

[TAG_ENUM_UNNAMED] [variant_id_optimized] [field_count:variable_uint] [field1] [field2] ...

Variant ID Assignment:

  • Derived from variant name (CRC64 hash) or custom #[senax(id=n)] attributes
  • Variant IDs 1-250 are encoded as single u8 bytes
  • Variant IDs 251+ are encoded as 0xFF marker + u64 little-endian
  • Must be stable across versions for compatibility

6. Schema Evolution

6.1 Forward Compatibility

Adding Fields:

  • New optional fields: Automatically handled (default to None)
  • New required fields: Must have defaults or be made optional
    • In addition to having a Rust default value, you must explicitly annotate the field with #[senax(default)] to ensure forward/backward compatibility.
  • Fields with #[senax(skip_default)]: Only encoded when value differs from default, automatically use default value when missing during decode

Adding Enum Variants:

  • Use custom #[senax(id=n)] for stable IDs
  • Unknown variants cause decode errors

6.2 Backward Compatibility

Removing Fields:

  • Unknown field IDs are automatically skipped during decoding
  • No decoder changes required

Removing Enum Variants:

  • May cause decode errors if old data contains removed variants
  • Consider deprecation strategy

6.3 Field Reordering

Field order changes are automatically handled due to ID-based encoding.

6.4 Type Changes

Compatible Changes:

  • u32i64 (if values fit)
  • f32f64
  • u32Option<u32>

Incompatible Changes:

  • Stringu32
  • Vec<T>HashMap<K,V>
  • None → Required

7. Implementation Notes

7.1 Skip Function

Decoders must implement a skip_value() function that can skip unknown tagged values without parsing them. This enables forward compatibility.

7.2 Error Handling

Decode Errors:

  • Invalid UTF-8 in strings
  • Unknown enum variants
  • Malformed data (unexpected EOF, invalid tags)
  • Type conversion failures

7.3 Endianness

All multi-byte values use little-endian encoding for consistency across platforms.


This specification defines the complete binary format for senax-encoder. Implementations should follow these rules exactly to ensure cross-version and cross-platform compatibility.