Skip to content

Exploring Serialization via Protobuf and Others #150

@prestonvanloon

Description

@prestonvanloon

This issue exists to track progress on exploration of other serialization strategies for sharding and Ethereum. We'll likely want to move this into a new repository once work has been started.

Motivation

With RLP and other serialization mechanisms for Ethereum, it feels a bit like reinventing the wheel when there may be a more supported open source library.

The main motivation for RLP:

The alternative to RLP would have been using an existing algorithm such as protobuf or BSON; however, we prefer RLP because of (1) simplicity of implementation, and (2) guaranteed absolute byte-perfect consistency.

The question we try to answer is whether or not this is an issue that is not already solved by protocol buffers or other mechanisms.

Challenges with Hashing in Different languages

Key/value maps in many languages don't have an explicit ordering, and floating point formats have many special cases, potentially leading to the same data leading to different encodings and thus different hashes.

See RLP design rationale for more context.

Google Protobuf

Protocol buffers are Google's language-neutral, platform-neutral, extensible mechanism for serializing structured data – think XML, but smaller, faster, and simpler. You define how you want your data to be structured once, then you can use special generated source code to easily write and read your structured data to and from a variety of data streams and using a variety of languages.

How to test consistency across all languages?

One option is to write a gRPC service definition and implement the test in each popular language. The test would be easy to extend to another language, provided that it implements the service.

gRPC server for each language

Example service defintion:

service SerializerTest {
  rpc TestHash(Block) returns (Hash) {}
}

message Block {
  Header header = 1;
  repeated Header uncles = 2;
  repeated Transaction transactions = 3;

  message Header {
    bytes parent_hash = 1;
    bytes uncles_hash = 2;
    ...
  }
}

message Transaction {
  uint64 nonce = 1;
  uint64 price = 2;
  ...
}

// Hash result 
message Hash {
  bytes hash = 1;
  Block block = 2;
}

The request proto has an object resymboling a block then the service response with the resulting hash. The test then compares this against the actual hash.

The test can and should be populated with real Ethereum blocks that have been mined and their associated hash. This provides solid evidence that these test cases are valid.

Why set up this infrastructure of gRPC services?

The main idea is that we can run these tests against each language with an agnostic client, in isolation.

Why gRPC?

Due to its low boilerplate, code generation, and structured payload.

List of official supported languages

  • c++
  • java
  • python
  • go
  • ruby
  • c#
  • node.js
  • android java
  • objective C
  • PHP
  • Dart

List of 3rd party supported languages

There are probably many more languages...

How does the test client work?

The test client will act as a command line tool and most likely read from a series of config files.

We can imagine at least config for service to hit and another config for the test cases.

The client will send the test proto to each of the services listed, in parallel. At the end of test execution, the client will print and/or write a report of pass/fail for test cases.

Example output of the client:

./run_tests

Running 5 test cases

Test 1
Java - PASS
Go - PASS
JavaScript - FAIL - Wanted hash ... got ...
Python - PASS

Test 2
...

Example services config:

services = [
   ["java", "127.0.0.1:5001"],
   ["go", "127.0.0.1:5002"],
   # ...
]

Example test protos:

TODO: Real blocks with hashs in proto supported format.

What about service orchestration?

Maybe using docker compose?

It would be annoying to start many gRPC services locally without a single command.

What about benchmarks?

Benchmarks are important, but we already know RLP is not as good in terms of performance for serialization.

We can add language specific benchmarks after we answer the question: will this work at all?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions