There are two serialization methods designed for Ethereum – Recursive-length prefix (RLP) and the newer Simple Serialize (SSZ). The problem they are trying to solve, a short overview of the format, and some other thoughts.
The problem: data needs to be encoded/decoded over the wire, but also for hash verification (a transaction is signed by signing the RLP hash of the transaction data, blocks are identified by the RLP hash of their header). Additionally, for some cases, there should be support for efficient encoding of the merkle tree data structure.
The properties needed by the serialization format are:
- Deterministic – The encoding must be unambiguous, as it is used to identify and verify data.
- Efficient – The discovery protocol works over UDP, so the message format has to be tiny.
A brief introduction to Recursive-length Prefix (RLP):
There are two encodable structures: a string (i.e., a byte array) and a list (of byte arrays).
The rules:
- If it is a byte array ("string") and can be stored in a single byte, it gets stored in a byte (
0x00-0x7f
) - If it is a byte array containing fewer than 56 bytes, first write the prefix
0x80
+len(string)
and then the byte array. - If the byte array is longer than 56 bytes, encode the prefix
0xb7
followed by the length of the byte array represented as a big-endian integer (in the minimal number of bytes). - For lists – if the concatenated serialization of the elements is less than 56 bytes, the output equals to input with a prefix of
0xc0
plus the length of the list. If it is longer, prefix with0xf7
plus the length in bytes of the payload (in binary form), plus the length of the payload.
On Simple Serialize (SSZ),
SSZ is new in Ethereum 2.0 and replaces RLP as the encoding for the new consensus layer. It requires a schema to be known by both parties ahead of time. Integers (only unsigned) and booleans are converted to little-endian bytes. "Composite" types that are fixed size are encoded as the concatenation of their bytestrings. Variable type containers specify an offset, and the actual data is stored on a heap at the end of the fixed-length schema.
You can read the full spec here.
Some thoughts:
- There's a cost to a non-standard encoding. There are no standard libraries that include implementations. However, today, RLP has implementations in 15 different languages.
- Upgradability – what happens when fields are added, removed, or modified? Protobuf handles this the best (in my opinion) – new fields are ignored by old clients, and old fields (should) never be deleted.
- RLP (as implemented in the main Go client) uses reflection to encode data. There's a significant performance hit to using reflection, especially for something as common as encoding. Protobuf gets around this by (1) requiring a schema and (2) using code generation for clients.
- SSZ can't be streamed easily.
- RLP is tied to the underlying data types – it cannot encode signed integers (no negative values) and only supports integers up to 264.
- RLP/SSZ can be ambiguous (at least in Go) when dealing with zero values.
- The obvious answer (to me) is protobuf. While it has poor support for some of the commonly used datatypes used by Ethereum, it is ubiquitous and can be made to support the properties that Ethereum requires (namely, deterministic messages).
- Most of these implementations will probably end up making data available in other encodings, e.g., gRPC or HTTP APIs.
- There might be a cost to supporting two separate serialization formats in the execution client (RLP) and consensus client (SSZ). To add to the confusion, SSZ encodes bytestrings as little-endian and RLP as big-endian.