Skip to content

Latest commit

 

History

History
530 lines (401 loc) · 14.8 KB

File metadata and controls

530 lines (401 loc) · 14.8 KB

Sabot Architecture

Deep dive into Sabot's architecture, design decisions, and implementation details.

Overview

Sabot is built on three key principles:

  1. Flink-inspired semantics - Event-time, watermarks, exactly-once
  2. Python-native implementation - Clean API, easy to use
  3. Cython acceleration - Performance-critical paths in compiled code
┌─────────────────────────────────────────────────────────┐
│                  User Application                       │
│   Python code using @app.agent() and clean API         │
└─────────────────────────────────────────────────────────┘
                         ↓
┌─────────────────────────────────────────────────────────┐
│              Sabot Python Layer                         │
│   import sabot as sb                                    │
│   - App (app.py)                                        │
│   - Agent decorators                                    │
│   - Clean API wrappers                                  │
└─────────────────────────────────────────────────────────┘
                         ↓
┌─────────────────────────────────────────────────────────┐
│         Cython-Accelerated Core Modules                 │
│   sabot/_cython/                                        │
│   - checkpoint/ (distributed snapshots)                 │
│   - state/ (state backends)                             │
│   - time/ (watermarks, timers)                          │
│   Performance: 10-100x faster than pure Python          │
└─────────────────────────────────────────────────────────┘
                         ↓
┌─────────────────────────────────────────────────────────┐
│                  Infrastructure                         │
│   Kafka | RocksDB | PostgreSQL | Redis                 │
└─────────────────────────────────────────────────────────┘

Core Modules

1. Checkpoint Module (sabot/_cython/checkpoint/)

Implements Chandy-Lamport distributed snapshot algorithm for exactly-once semantics.

Files

File Description Performance
barrier.pyx Checkpoint barrier markers <1μs creation
barrier_tracker.pyx Track barriers across channels <10μs registration
coordinator.pyx Orchestrate distributed checkpoints <50μs coordination
storage.pyx Persist checkpoint metadata <1ms I/O
recovery.pyx Recover from checkpoints <10s for 10GB

How It Works

Chandy-Lamport Algorithm:

  1. Trigger: Coordinator triggers checkpoint N
  2. Broadcast: Send barrier markers to all channels
  3. Snapshot: Each operator snapshots its state when barrier arrives
  4. Align: Wait for barriers from all input channels
  5. Complete: All operators aligned → checkpoint complete

Code Example:

# Create tracker for 3 channels (Kafka partitions)
tracker = sb.BarrierTracker(num_channels=3)

# Each channel registers barrier
aligned = tracker.register_barrier(
    channel=0,
    checkpoint_id=1,
    total_inputs=3
)

# aligned = True when all channels have received barrier
if aligned:
    # Take state snapshot
    state_snapshot = await take_snapshot()
    # Persist to durable storage
    await persist(checkpoint_id, state_snapshot)

Performance Characteristics:

  • Barrier creation: <1μs (Cython struct)
  • Registration: <10μs (hash table lookup)
  • Memory: O(num_checkpoints × num_channels)
  • Throughput: 1M+ barriers/second

Why Cython?

Pure Python version had 100μs overhead per barrier registration. Cython optimization:

  • Direct memory access (no Python object overhead)
  • C-level hash tables
  • No GIL for pure C operations

2. State Module (sabot/_cython/state/)

High-performance state management with multiple backends.

Files

File Description Backend
state_backend.pyx Abstract backend interface -
value_state.pyx Single-value state All
map_state.pyx Key-value map state All
list_state.pyx Ordered list state All
reducing_state.pyx Reduce aggregations All
aggregating_state.pyx Complex aggregations All
rocksdb_state.pyx RocksDB backend Persistent

State Backends

Memory Backend (stores_memory.pyx):

cdef class OptimizedMemoryBackend(StoreBackend):
    cdef:
        unordered_map[string, PyObject*] _data  # C++ hash map
        pthread_mutex_t _lock                    # Thread-safe

    async def get(self, key):
        # Direct C++ map access (no Python dict overhead)
        cdef string c_key = key.encode('utf-8')
        return self._data[c_key]  # <1μs

Performance:

  • Get/Set: 1M+ ops/second
  • Memory overhead: <10% vs pure Python dict
  • Thread-safe: Uses C-level mutexes

RocksDB Backend:

  • Persistent key-value store
  • Optimized for large state (>1GB)
  • Cython bindings to RocksDB C++ API

Comparison:

Operation Memory (Cython) Memory (Python) RocksDB
Get 1μs 50μs 100μs
Set 2μs 100μs 500μs
Batch (1K) 2ms 100ms 50ms
Memory 10MB 100MB 1MB

State Types Implementation

ValueState - Single value:

counter = sb.ValueState(backend, "count")
await counter.update(42)
value = await counter.value()  # O(1)

MapState - Key-value mapping:

profiles = sb.MapState(backend, "users")
await profiles.put("alice", {"age": 30})
user = await profiles.get("alice")  # O(1) hash lookup

ListState - Ordered list:

events = sb.ListState(backend, "events")
await events.add({"type": "login"})
all_events = await events.get()  # O(n)

3. Time Module (sabot/_cython/time/)

Event-time processing with watermarks and timers.

Files

File Description Purpose
watermark_tracker.pyx Track watermarks Out-of-order handling
timers.pyx Timer service Delayed processing
event_time.pyx Event-time utilities Time extraction
time_service.pyx Unified time service All time operations

Watermark Tracking

Problem: Events arrive out-of-order in distributed systems.

Solution: Watermarks signal "all events before time T have arrived"

tracker = sb.WatermarkTracker(num_partitions=3)

# Partition 0: events at time 100, 105, 110
tracker.update_watermark(0, 110)

# Partition 1: events at time 95, 120
tracker.update_watermark(1, 120)

# Partition 2: events at time 90, 100
tracker.update_watermark(2, 100)

# Global watermark = min(110, 120, 100) = 100
# Safe to process all events before time 100
global_wm = tracker.get_global_watermark()  # 100

Implementation:

cdef class WatermarkTracker:
    cdef:
        vector[int64_t] _partition_watermarks  # C++ vector
        int _num_partitions

    cdef int64_t get_global_watermark(self):
        cdef int64_t min_wm = LLONG_MAX
        for i in range(self._num_partitions):
            if self._partition_watermarks[i] < min_wm:
                min_wm = self._partition_watermarks[i]
        return min_wm  # <1μs for 1000 partitions

Performance:

  • Update: <5μs per partition
  • Global watermark: <1μs (single scan)
  • Memory: O(num_partitions)

4. Agent Module (sabot/_cython/agents.pyx)

Actor-based stream processors (planned for Cython optimization).

Currently in pure Python (agent_manager.py), will be Cython-ized in v0.2.0.

Target performance:

  • Message processing: <10μs per message
  • State access: <1μs with Cython state backends
  • Throughput: 100K+ messages/second per agent

App Lifecycle

1. Application Creation

import sabot as sb

app = sb.App('fraud-detection', broker='kafka://localhost:19092')

What happens:

  1. Create App instance (app.py)
  2. Initialize agent manager (agent_manager.py)
  3. Setup state backends (if enabled)
  4. Connect to Redis (if distributed state enabled)
  5. Connect to PostgreSQL (if durable execution enabled)

2. Agent Registration

@app.agent('transactions')
async def detect_fraud(stream):
    async for transaction in stream:
        yield alert

What happens:

  1. Decorator calls app.agent()
  2. Agent registered with DurableAgentManager
  3. Kafka consumer group created (if broker present)
  4. State allocated for agent

3. CLI Start

sabot -A fraud_app:app worker

What happens:

  1. CLI imports module (fraud_app)
  2. Gets app variable
  3. Calls app.run()
  4. app.run() calls app.start()
  5. Start stream engine
  6. Start all registered agents
  7. Begin consuming from Kafka
  8. Enter event loop

4. Message Processing

Kafka Message
     ↓
Consumer
     ↓
Agent Stream
     ↓
User Function (async for)
     ↓
Process + Update State
     ↓
Yield Result
     ↓
Output Topic

Performance Path:

  1. Kafka Consumer: ~100μs (confluent-kafka C library)
  2. Deserialization: ~50μs (JSON) or ~10μs (Arrow)
  3. Agent Processing: User code (varies)
  4. State Update: ~2μs (Cython memory backend)
  5. Yield: ~50μs (async generator)
  6. Total: ~200μs + user code

5. Checkpointing

Every 5 seconds (configurable):

  1. Trigger: Coordinator triggers checkpoint N
  2. Barrier injection: Insert barrier into all Kafka partitions
  3. Barrier propagation: Barriers flow through processing graph
  4. State snapshot: Each agent snapshots state when barrier arrives
  5. Alignment: Wait for all agents to snapshot
  6. Persistence: Write snapshots to RocksDB/PostgreSQL
  7. Commit: Commit Kafka offsets
  8. Complete: Checkpoint N is durable

Recovery:

  1. Detect failure (agent crash, network partition)
  2. Load latest checkpoint N
  3. Restore state from checkpoint
  4. Seek Kafka to checkpoint offsets
  5. Resume processing

Performance Optimizations

1. Zero-Copy with Arrow

# Instead of Python objects (slow)
for i in range(len(batch)):
    process(batch[i])  # Creates Python objects

# Use Arrow zero-copy (fast)
import pyarrow as pa
for batch in arrow_stream:
    # batch is Arrow RecordBatch (columnar, zero-copy)
    amounts = batch.column('amount').to_numpy()  # Zero-copy view
    np.sum(amounts)  # SIMD-accelerated

Speedup: 10-100x for analytical operations

2. Cython State Backends

# Pure Python (slow)
state = {}
state['key'] = value  # 100μs

# Cython backend (fast)
await backend.set('key', value)  # 2μs

Speedup: 50x for state operations

3. Batching

# Process one-by-one (slow)
for msg in stream:
    process(msg)  # 1K msgs/sec

# Batch processing (fast)
async for batch in stream.batch(size=100):
    process_batch(batch)  # 10K msgs/sec

Speedup: 10x with batching

4. Async I/O

# Synchronous (slow)
result = external_api.call()  # Blocks thread

# Async (fast)
result = await external_api.call_async()  # Non-blocking

Concurrency: 1K+ concurrent operations

Comparison to Flink

Similarities

Feature Flink Sabot
Event Time
Watermarks
Exactly-Once
State Management
Checkpointing ✅ Async barriers ✅ Chandy-Lamport
Windowing ✅ (planned)
CEP 🚧 (planned)

Differences

Aspect Flink Sabot
Language Java/Scala Python
Performance 100K+ txn/s 5-10K txn/s
Startup Time ~30s (JVM) ~2s
Memory 2-4GB min 100MB min
Deployment JAR + cluster pip install + CLI
Development Java IDE Python REPL

When to Use Each

Use Flink when:

  • Need 100K+ messages/second
  • Large JVM ecosystem integration
  • Enterprise deployment with YARN/K8s

Use Sabot when:

  • Python-native development
  • Rapid prototyping
  • 5-10K messages/second sufficient
  • Want simple pip install + CLI

Architecture Decisions

Why Cython?

Alternatives considered:

  1. Pure Python: Too slow (100x slower)
  2. Rust/C++ with bindings: Complex, hard to debug
  3. PyPy: Incompatible with C extensions (NumPy, Arrow)
  4. Numba: Limited to numerical code

Cython wins:

  • ✅ 10-100x speedup
  • ✅ Seamless Python integration
  • ✅ Easy to debug (generates C code)
  • ✅ Compatible with all Python libraries

Why Arrow?

Alternatives:

  1. Python objects: Slow, memory inefficient
  2. Pandas: Too heavy, not streaming-friendly
  3. Protocol Buffers: Row-oriented, no SIMD

Arrow wins:

  • ✅ Columnar format (SIMD-friendly)
  • ✅ Zero-copy operations
  • ✅ Language interop (C++, Java, Rust)
  • ✅ Streaming-friendly (RecordBatch)

Why Kafka?

Alternatives:

  1. RabbitMQ: Not designed for streaming
  2. Pulsar: Less mature ecosystem
  3. Kinesis: AWS-only
  4. Redpanda: Compatible with Kafka API ✅

Kafka/Redpanda wins:

  • ✅ Industry standard
  • ✅ Proven at scale
  • ✅ Rich ecosystem
  • ✅ Redpanda: No ZooKeeper, easier ops

Why PostgreSQL for Durable Execution?

Inspired by DBOS:

  • ✅ ACID transactions
  • ✅ Mature, well-understood
  • ✅ Good performance
  • ✅ Rich querying (vs key-value stores)

Future Architecture

v0.2.0 (Coming Soon)

  • 🚧 Arrow batch processing (Cython)
  • 🚧 Redis Cython extension
  • 🚧 SQL/Table API
  • 🚧 Web UI

v0.3.0+ (Future)

  • 📋 GPU acceleration (RAFT integration)
  • 📋 Advanced CEP
  • 📋 Query optimizer
  • 📋 Native S3/HDFS sources

Performance Targets

Current (v0.1.0)

  • Throughput: 5-10K txn/s
  • Latency p99: <1ms
  • Memory: <500MB
  • Startup: <2s

Target (v0.3.0)

  • Throughput: 50-100K txn/s (10x improvement)
  • Latency p99: <500μs (2x improvement)
  • Memory: <500MB (same)
  • Startup: <1s (2x improvement)

How?

  • Arrow batch processing (10x)
  • More Cython modules (2x)
  • Zero-copy everywhere (2x)
  • Better async I/O (2x)

Next: See CLI Guide for deployment.