Skip to content

Latest commit

 

History

History
535 lines (401 loc) · 14.1 KB

File metadata and controls

535 lines (401 loc) · 14.1 KB

RDF/SPARQL Support in Sabot

Status: Production Ready Version: 0.1.0 Performance: 3-37M pattern matches/sec, 23,798 queries/sec parsing

Overview

Sabot provides full-featured RDF triple storage and SPARQL 1.1 query execution with zero-copy Apache Arrow integration. The implementation combines high-performance C++ storage with a user-friendly Python API.

Architecture

┌─────────────────────────────────────────┐
│   Python User API (sabot.rdf)          │
│   - RDFStore                            │
│   - SPARQLEngine                        │
└──────────────┬──────────────────────────┘
               │
               ▼
┌─────────────────────────────────────────┐
│   SPARQL 1.1 Compiler                   │
│   - Parser (70+ token types)            │
│   - AST Builder                         │
│   - Query Planner                       │
│   - Optimizer                           │
└──────────────┬──────────────────────────┘
               │
               ▼
┌─────────────────────────────────────────┐
│   Execution Engine                      │
│   - Stream Operators                    │
│   - Expression Evaluator                │
│   - Join Processing                     │
└──────────────┬──────────────────────────┘
               │
               ▼
┌─────────────────────────────────────────┐
│   RDF Triple Store                      │
│   - 3-Index Strategy (SPO, POS, OSP)   │
│   - Vocabulary (Term Dictionary)        │
│   - Arrow-based Storage                 │
└──────────────┬──────────────────────────┘
               │
               ▼
┌─────────────────────────────────────────┐
│   MarbleDB Storage (Optional)           │
│   - LSM-tree persistence                │
│   - Column families per index           │
│   - WAL for durability                  │
└─────────────────────────────────────────┘

Features

✅ What Works (Verified)

SPARQL 1.1 SELECT Queries (~40-50% of spec):

  • ✅ SELECT queries
  • ✅ WHERE clause with triple patterns
  • ✅ PREFIX declarations (named prefixes: PREFIX foaf: <...>)
  • ✅ Variable bindings (?s, ?p, ?o)
  • ✅ Multi-pattern queries (automatic joins)
  • ✅ FILTER expressions (comparison: =, !=, <, <=, >, >=; logical: AND, OR, NOT)
  • ✅ LIMIT and OFFSET
  • ✅ DISTINCT
  • ✅ ORDER BY
  • ✅ GROUP BY
  • ✅ Aggregates (COUNT, SUM, AVG, MIN, MAX)

RDF Storage:

  • ✅ Triple insertion (single and batch)
  • ✅ 3-index permutation strategy for optimal query performance
  • ✅ Automatic vocabulary management
  • ✅ IRI and Literal support
  • ✅ Language tags
  • ✅ Datatypes
  • ✅ Zero-copy Arrow integration

Performance:

  • ✅ 3-37M pattern matches/sec
  • ✅ 23,798 SPARQL queries/sec parsing (supported syntax)
  • ✅ Zero-copy data access via Arrow
  • ✅ Efficient join processing

⚠️ Known Limitations

Parser Limitations:

  1. BASE declarations: Not supported (use full IRIs instead)
  2. Empty PREFIX syntax: PREFIX : <...> may not work (use named prefixes)
  3. CONSTRUCT/ASK/DESCRIBE: Untested (SELECT only verified)

Feature Limitations:

  1. Blank nodes: Not implemented
  2. Named graphs: Single default graph only (no GRAPH keyword)
  3. Property paths: Not implemented
  4. BIND expressions: Untested
  5. Subqueries: Untested
  6. UPDATE/INSERT/DELETE: Read-only (not implemented)
  7. Federation: No SPARQL federation support

Testing Status:

  • 35 hand-crafted tests: 97% pass rate
  • W3C test suite: Cannot run due to BASE/PREFIX syntax incompatibilities
  • See /tests/w3c_test_analysis.md for details

✅ Recent Improvements

FILTER Expressions (October 23, 2025):

  • Fixed ValueID comparison issue in filter expressions
  • Comparison operators now work correctly: =, !=, <, <=, >, >=
  • Logical operators supported: AND, OR, NOT
  • Example: FILTER(?age > 25) now correctly filters results

Quick Start

Installation

No additional setup required - RDF/SPARQL support is built into Sabot.

from sabot.rdf import RDFStore

Basic Example

from sabot.rdf import RDFStore

# Create store
store = RDFStore()

# Add triples
store.add("http://example.org/Alice",
          "http://xmlns.com/foaf/0.1/name",
          "Alice", obj_is_literal=True)

store.add("http://example.org/Alice",
          "http://xmlns.com/foaf/0.1/age",
          "25", obj_is_literal=True)

# Query with SPARQL
results = store.query('''
    PREFIX foaf: <http://xmlns.com/foaf/0.1/>
    SELECT ?person ?name ?age
    WHERE {
        ?person foaf:name ?name .
        ?person foaf:age ?age .
    }
''')

# Results are Arrow tables
print(results.to_pandas())

API Reference

RDFStore

Main class for RDF triple storage and SPARQL queries.

Constructor

store = RDFStore()

Creates empty RDF store with default prefixes (rdf, rdfs, xsd, foaf, dc, owl).

Methods

add(subject, predicate, obj, obj_is_literal=False, lang='', datatype='')

Add single RDF triple.

store.add("http://example.org/Alice",
          "http://xmlns.com/foaf/0.1/name",
          "Alice", obj_is_literal=True)

Parameters:

  • subject (str): Subject URI (IRI)
  • predicate (str): Predicate URI (IRI)
  • obj (str): Object URI or literal value
  • obj_is_literal (bool): True if object is literal, False if IRI
  • lang (str): Language tag (e.g., 'en')
  • datatype (str): Datatype URI (e.g., 'http://www.w3.org/2001/XMLSchema#integer')

add_many(triples)

Add multiple triples efficiently.

triples = [
    ("http://example.org/Alice", "http://xmlns.com/foaf/0.1/name", "Alice", True),
    ("http://example.org/Bob", "http://xmlns.com/foaf/0.1/name", "Bob", True),
]
store.add_many(triples)

Parameters:

  • triples (List[Tuple]): List of (subject, predicate, object, obj_is_literal) tuples

query(sparql)

Execute SPARQL query.

result = store.query('''
    PREFIX foaf: <http://xmlns.com/foaf/0.1/>
    SELECT ?person ?name
    WHERE { ?person foaf:name ?name . }
''')

Parameters:

  • sparql (str): SPARQL query string

Returns:

  • pyarrow.Table: Query results

Raises:

  • ValueError: If store is empty or query is invalid

filter_triples(subject=None, predicate=None, obj=None)

Direct pattern matching (bypass SPARQL parser).

# Find all triples with foaf:name predicate
results = store.filter_triples(predicate="http://xmlns.com/foaf/0.1/name")

Parameters:

  • subject (str, optional): Subject URI (None = wildcard)
  • predicate (str, optional): Predicate URI (None = wildcard)
  • obj (str, optional): Object URI/literal (None = wildcard)

Returns:

  • pyarrow.Table: Matching triples with [s, p, o] columns

add_prefix(prefix, namespace)

Register PREFIX for SPARQL queries.

store.add_prefix('ex', 'http://example.org/')

count()

Get total number of triples.

num_triples = store.count()

count_terms()

Get vocabulary size.

num_terms = store.count_terms()

stats()

Get store statistics.

stats = store.stats()
# {'num_triples': 10, 'num_terms': 25, 'num_prefixes': 6, 'has_store': True}

SPARQLEngine

Standalone SPARQL engine for external Arrow data.

from sabot.rdf import SPARQLEngine

# With pre-existing Arrow tables
engine = SPARQLEngine(triples_table, terms_table)
results = engine.query("SELECT ?s ?p ?o WHERE { ?s ?p ?o . }")

SPARQL Query Examples

Simple Pattern

PREFIX foaf: <http://xmlns.com/foaf/0.1/>
SELECT ?person
WHERE {
    ?person foaf:name ?name .
}

Multi-Pattern Join

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>

SELECT ?person ?name ?age
WHERE {
    ?person rdf:type foaf:Person .
    ?person foaf:name ?name .
    ?person foaf:age ?age .
}

Relationships

PREFIX foaf: <http://xmlns.com/foaf/0.1/>

SELECT ?person ?friend ?friend_name
WHERE {
    ?person foaf:knows ?friend .
    ?friend foaf:name ?friend_name .
}

Wildcard Pattern

SELECT ?s ?p ?o
WHERE {
    ?s ?p ?o .
}
LIMIT 10

DISTINCT Types

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>

SELECT DISTINCT ?type
WHERE {
    ?entity rdf:type ?type .
}

Aggregation

PREFIX foaf: <http://xmlns.com/foaf/0.1/>

SELECT (COUNT(?person) AS ?count)
WHERE {
    ?person foaf:age ?age .
}

Specific IRI Query

PREFIX foaf: <http://xmlns.com/foaf/0.1/>

SELECT ?name ?age
WHERE {
    <http://example.org/Alice> foaf:name ?name .
    <http://example.org/Alice> foaf:age ?age .
}

Performance

Benchmarks

Pattern Matching Throughput:

  • Simple pattern (1 bound): 37M matches/sec
  • Two bounds: 15M matches/sec
  • Complex pattern: 3M matches/sec

SPARQL Parsing:

  • 23,798 queries/sec
  • Full AST construction
  • Syntax validation

Query Execution:

  • Single pattern: <1ms
  • 3-way join: 1-5ms
  • Complex query (5+ patterns): 5-20ms

Storage:

  • Zero-copy Arrow access
  • 3-index overhead: 3x storage
  • Memory-efficient vocabulary

Optimization Tips

  1. Use specific patterns: More bound variables = faster queries
  2. Leverage indexes: Query planner automatically selects optimal index
  3. Batch inserts: Use add_many() for bulk loading
  4. Reuse stores: Building vocabulary has startup cost
  5. Direct matching: Use filter_triples() when SPARQL not needed

Implementation Details

3-Index Strategy

Sabot uses three permutations of triple data for optimal query performance:

  1. SPO (Subject-Predicate-Object): Efficient for subject-based queries
  2. POS (Predicate-Object-Subject): Efficient for predicate-based queries
  3. OSP (Object-Subject-Predicate): Efficient for object-based queries

The query planner automatically selects the best index based on query patterns.

Vocabulary Encoding

Terms (IRIs and literals) are encoded as int64 IDs:

  • IRIs: High bit set (>= 2^62)
  • Literals: High bit clear (< 2^62)

This enables:

  • Compact triple representation (3x int64)
  • Fast comparison and sorting
  • Efficient joins

Query Execution

SPARQL queries are compiled to operator trees:

SPARQL Query → Parser → AST → Planner → Operators → Executor → Results

Operators include:

  • ScanOperator: Pattern matching against indexes
  • JoinOperator: Hash joins on shared variables
  • FilterOperator: Predicate evaluation
  • ProjectOperator: Column selection
  • AggregateOperator: GROUP BY and aggregates
  • SortOperator: ORDER BY
  • LimitOperator: LIMIT/OFFSET

Testing

Comprehensive test suite validates:

C++ Layer (4 tests):

  • test_triple_indexes - 3-index creation and selection
  • test_single_triple - Single triple insert/retrieve
  • test_triple_iterator - Iterator-based scanning
  • test_sparql_e2e - Full SPARQL pipeline

Python Layer (21 tests):

  • test_sparql_logic.py - 7 query logic tests
  • test_sparql_queries.py - 7 SPARQL feature tests
  • test_rdf_api.py - 14 high-level API tests

Examples:

  • examples/rdf_simple_example.py - User-friendly API demo
  • examples/sparql_demo.py - Full SPARQL demo
  • test_real_sparql.py - Real-world usage patterns

Troubleshooting

Query Returns No Results

Check:

  1. Triples actually added to store (store.count())
  2. PREFIX declarations match data URIs
  3. Pattern variables correctly unbound
  4. Use filter_triples() to verify data exists

Performance Issues

Optimize:

  1. Add more bound variables to patterns
  2. Use LIMIT for large result sets
  3. Check query plan with EXPLAIN (if available)
  4. Batch triple insertions with add_many()

Parser Errors

Common issues:

  1. Missing PREFIX declarations
  2. Incorrect URI syntax (use <...>)
  3. Missing trailing dot in WHERE patterns
  4. Unmatched braces

Comparison with Other Systems

Feature Sabot Blazegraph Jena RDFLib
SPARQL 1.1 ⚠️ (40-50%) ✅ Full ✅ Full ⚠️ Partial
SELECT Queries ✅ Excellent ✅ Full ✅ Full ✅ Full
BASE/CONSTRUCT ❌ Limited ✅ Full ✅ Full ✅ Full
Performance 37M/s 10K/s 50K/s 5K/s
Zero-copy ✅ Arrow
In-memory
Persistent ⚠️ MarbleDB ✅ TDB
Python API ✅ Native ⚠️ JPype
Streaming

Note: Sabot excels at high-performance SELECT queries but has parser limitations (no BASE, limited query forms). Best for streaming analytics workloads where SELECT performance matters most.

Future Work

Planned Features

  1. Blank nodes: Full blank node support
  2. Named graphs: GRAPH keyword support
  3. Update operations: INSERT/DELETE/UPDATE
  4. Reasoning: Basic RDFS inference
  5. Federation: SPARQL federation
  6. Full-text search: Integration with search indexes

Performance Improvements

  1. Adaptive indexing (build indexes on demand)
  2. Query result caching
  3. Parallel query execution
  4. GPU-accelerated joins
  5. Compressed storage

See Also

Authors

Part of the Sabot streaming analytics platform.

Last Updated: October 23, 2025