Introducing Nachricht – a Self-describing Data Interchange Format

von Dr. Liv Fischer | 19. März 2021 | Allgemein, English, Software Engineering

Dr. Liv Fischer

Lead Developer und Expertin für Rust

Over the last months, I developed and implemented a new binary encoding scheme called “nachricht” to use as a drop-in replacement for JSON. You can find its specification and implementation online. If you are interested in my thought process and reasoning, read on!

Motivation

One of the things that distinguish me as a developer can be summarized as: I don’t like bloat. Why is this a distinguishing feature you might ask? After all, no one likes bloat. There are, however, different tiers of bloat-tolerance. Many colleagues accept a certain amount of redundancy if it allows them to be done faster. This is, of course, a trade-off between “developer productivity” and efficient implementations. We as an industry accept the overhead of virtual machines because it frees us of the burden of using developer time on memory management. Which is seemingly the economical decision in most of the cases: CPU cycles are far cheaper than programmer-brain-cycles these days! I still beg to differ: this thinking only creates economical solutions if you do short-term planning. Every line of code that needn’t be there but still is needs to be maintained for the whole lifecycle of your product. If it contains a bug, that bug needs to be fixed, if it is part of a feature that needs changing, a future programmer will have to read and understand it and so on. You save money now by writing more code than you necessarily had to but you pay it back with dividends over the next decade.
I believe this to be true to such an extent that I apply my philosophy of no bloat whatsoever to every aspect of my job, including the more lowlevel infrastructure stuff (where it might even be the most important: the hotter your loop the more optimizations pay off). One of these corners is the domain of data serialization. Here I especially loathe bloat “on the wire”, i.e. bytes getting needlessly transmitted over a network because an inefficient format was chosen.

What is serialization and when do you need it

This section gives a very brief introduction to serialization in itself. If you are already familiar with the concept, you can safely skip it.

In computing, data structures are very rarely laid out in one contiguous block of memory. Think for instance of a harmless string. This consists in most safe languages of a heap allocation which holds the actual data and an additional struct holding a pointer to the allocation, its size and the length of the string. Even if data and metadata are consecutive in memory in the beginning (they won’t be in most cases), as soon as you mutate the string by adding more characters to it than its allocation can hold, your runtime will silently reallocate the data region with increased capacity and copy the string over.
Now metadata and actual data are separated for sure. Usually a reference to the “string” resolves to the metadata struct. This makes string handling a breeze in any high level language so it is in general a good thing. But as soon as we want to send a message somewhere, we have a problem: Sending only the metadata struct to another process would not work since it contains a pointer into our address space. Dereferencing that pointer in the address space of our receiving process would result in a segmentation violation in the best case and in corrupt data in the worst. Thus if we need to communicate with anyone outside of our own address space we need to write all information into one coherent string of bytes which contains no absolute pointers. Relative pointers are okay if they only reference other areas within the same message.
This process is called serialization and since this problem is so common, there are a lot of solutions already in place, each one picking a different point in the parameter space of trade-offs between serialization efficiency, deserialization efficiency and message size. From old veterans like ASN.1 to new challengers like YAML, from binary formats like Protocol Buffers to ones optimized for human readability like TOML, every possible taste is accounted for. But of course, the dominating format in this day and age is JSON.

Why JSON is not good enough

JSON rose to popularity not on technical merit but pragmatism: it is natively understood by JavaScript and therefore if you want to send a message to a browser and do so in JSON, you don’t have to write any additional glue code to make it work. Apart from this admittedly strong argument, JSON is just “good enough” for most people. It allows you to serialize bools, numbers, strings, arrays and maps and with enough yoga you can press pretty much every data type and structure you can think off onto this base. Here is an example that describes the basic properties of a cat:

{
    "name": "Sphinx",
    "species": "FelisCatus"
}

JSON is often said to be human-readable, i.e. you, the reader, can understand that the above message describes a cat named “Sphinx” which is a common house cat. This is only partly true though: usually, JSON gets generated without any insignificant whitespace so the above message would actually read {"name":"Sphinx","species":"FelisCatus"} and while that is still somewhat readable, it gets out of hand rather quickly as message sizes increase. Thus people usually use formatting tools in their workflows with JSON to help themselves. Keep this in mind for later in the discussion of binary formats.

As for the level of bloat in a typical JSON message, this first one doesn’t look too bad but it already uses too many bytes: everything is enclosed in double quotes even if it wouldn’t have to and the pair of enclosing braces doesn’t signify much either. This worsens fast with increasing message size. Here is one that describes a bunch of cats at once:

{
  "version": 1,
  "cats": [
    {
      "name": "Jessica",
      "species": "PrionailurusViverrinus"
    },
    {
      "name": "Wantan",
      "species": "LynxLynx"
    },
    {
      "name": "Sphinx",
      "species": "FelisCatus"
    },
    {
      "name": "Chandra",
      "species": "PrionailurusViverrinus"
    }
  ]
}

As you can see, the keys “name” and “species” are repeated four times while the (probably an enum constant) species name “PrionailurusViverrinus” is present twice as well. In listings with hundreds or even thousands of cats, this stacks up fast – and is the reason why CSV is still not dead. CSV on the other hand is a poor choice when any kind of nesting of containers is necessary. You might think that this doesn’t matter much and often it doesn’t: reading and writing JSON is incredibly fast, again thanks to pragmatism and not merit: enough people are relying on JSON that encoders and decoders have been optimized a lot to squeeze even the tiniest bit of additional performance out of it. And if message size is a problem, you can always gzip it, right? Well, not always. First, there is human readability. The above message looks like this after gzipping (printed with hexdump -C for your convenience):

00000000  1f 8b 08 08 cc 06 eb 5f  00 03 63 61 74 73 2e 6a  |......._..cats.j|
00000010  73 6f 6e 00 ab e6 52 50  50 2a 4b 2d 2a ce cc cf  |son...RPP*K-*...|
00000020  53 b2 52 30 d4 01 f1 93  13 4b 8a 81 9c 68 20 5b  |S.R0.....K...h [|
00000030  41 a1 1a 4c 02 45 f3 12  73 53 81 a2 4a 5e a9 c5  |A..L.E..sS..J^..|
00000040  c5 99 c9 89 4a 3a 30 89  e2 82 d4 e4 cc 54 90 0e  |....J:0......T..|
00000050  a5 80 22 a0 39 89 99 39  a5 45 a5 c5 61 99 40 73  |..".9..9.E..a.@s|
00000060  8b 32 f3 4a 8b 95 c0 2a  6b 75 b0 9b 17 9e 98 57  |.2.J...*ku.....W|
00000070  92 98 87 d5 38 9f ca bc  0a 10 c6 6f 40 70 41 46  |....8......o@pAF|
00000080  26 50 0d 36 03 dc 52 73  32 8b 9d 13 4b 08 b9 c1  |&P.6..Rs2...K...|
00000090  39 23 31 2f a5 88 0c 3f  01 c9 58 ae 5a 2e 00 59  |9#1/...?..X.Z..Y|
000000a0  d7 8c b2 47 01 00 00                              |...G...|
000000a7

If you can make sense of that, I congratulate you. I for sure cannot. The second more severe reason why compression is not always the answer is that it takes up valuable CPU and memory. Now if you’re a big fat web application running in the cloud and sending messages to mobile phones, this is a worthwhile trade-off. After all, you have almost limitless CPU to spare anyway. However, if you are a tiny IoT device with RAM measured in kilobytes and CPU frequencies measured in KHz, trying to send updates to the aforementioned fat cloud server, encoding efficiency is much more important than decoding efficiency. If additionally your network link is bad (not unusual for IoT devices with meager GPRS connectivity) you still want to reduce message size as much as possible without resorting to expensive operations like compression.

Binary formats

Enter binary formats. As soon as we drop the requirement of human readability, we can make use of eight bits per byte as opposed to ASCII or UTF-8. The biggest downside of this is that a plain message cannot be understood by a human anymore. But since we are so used to formatters in our workflows anyway, this isn’t a big problem: now our formatter just has to translate the binary representation into something more human-friendly as well.

There are, of course, a whole lot of formats
to choose from. So why does the world need yet another one? It probably doesn’t but the exercise was fun and I truly believe nachricht’s symbol tables offer something which hasn’t been utilized a lot in the past. Before developing nachricht I spent a lot of time deep-diving into flatbuffers, capnproto, msgpack, CBOR, and RION and trying to (badly) implement them, which was a great learning experience. But be warned: data encoding is a rabbit hole that will make your family forget you exist once you have fallen into it!

Schema or no schema

A schema is an additional piece of information negotiated beforehand between sender and receiver. It basically contains information of the form “the first field of the message is a string describing the name of the cat, while the second field is an enum constant defining its species”.
Having a schema in place has the upside of not having to convey this information in the message itself, basically reducing {"name":"Sphinx","species":"FelisCatus"} to ["Sphinx",0] (if 0 is the agreed upon variant selector for “FelisCatus”). The obvious downside is that the receiver needs to have access to the schema beforehand in order to be able to understand the message. Now this can be a good thing or a bad thing depending on how you look at it. Forcing an exchange of a machine-readable schema a priori between the two sides frees you of having field lists sent per Email as Word documents containing screenshots of Excel tables between product owners and other real-life atrocities. This is why I was advocating for formats that require a schema for a long time.
However, there are two problems with this approach: first, debugging, developing and exploring foreign APIs gets a lot harder. You no longer can just make a request to the other team’s service, look at the answer and figure out what’s wrong with your code: if you’ve been given the wrong schema you’re simply out of luck. And if you don’t have a schema at all you cannot do anything until the other team’s product owner returns from their vacation. Which brings us to the second problem of schema evolution: in our agility-centered industry, requirements change on a daily basis and the software has to follow suit. Thus schemas will evolve over time. If for example you deprecate a field, there still needs to be at least some remnant of that in every future message because an old, unupdated reader might not know about the deprecation. There are, of course, mitigations to this in modern formats: flatbuffers uses vtables so only the vtable slot needs to be allocated in the message, the field itself can be omitted. Capnproto on the other hand xors all fields with their default, effectively creating only zeroes for values that equal their default and then uses an optional run length packing of zero bytes to reduce their size as well. However, in a self-describing format like JSON you just omit
the deprecated value entirely – assuming all receivers are prepared to deal with absent fields, which they should anyway whereever reasonably possible. These points – schema evolution and developer experience – have nudged me to prefer self-describing formats nowadays, at least for agile projects. Schemas can still be useful though, for contract negotiation and to avoid the dreaded field-list-email, which leads to my new recommendation of using a self-describing format and a schema to go along with it.

Zero-copy formats

Traditional data interchange formats usually require some memory allocations when decoding a message. Casting a JSON array into a Java ArrayList requires allocating said list on the heap. This uses up precious memory and even more precious time, which prompted the development of so-called zero-copy formats. These formats encode pointers into the message which can be followed to find the actual data in the correct layout, almost as if the message was still a part of standard process memory. The difference being in this case that these pointers are relative, either to themselves or the head of the message. This allows the receiver to store the payload anywhere in memory and still just to follow the pointers to get access to the data. After the receiver is done with the message, they can just drop the corresponding buffer without having to copy anything in the meantime, very much like arena allocation (which this basically is).
Capnproto even went as far as to advertise an “infinitely faster” decoding step, since there is no decoding step at all!
While this sounds great on paper, there are a couple of consequences that should not be overlooked. First, if the receiver needs to hold on to parts of the message after reading it, a copy still needs to be made – hopefully directly from the buffer to the destination, saving the intermediate instance. And then there are the pointers themselves: they need to be computed by the sending side and they take up valuable space within the message itself. In conclusion, zero-copy formats are specially suited for cases where messages get written once and then read often. Bonus points if the decoding side is only interested in a select few fields within very big messages as the pointers allow you to jump directly to the data of interest without having to skim through all the fields lying before it. You also do not want your network connection to the be the bottleneck, as the message will be a bit bigger than if there had been no pointers. Lastly, your whole system should not be bottlenecked by CPU on the sending side, which could be the case if an IoT sensor encodes measurements for a cloud- or home-automation-server. Nowadays developers have incredible amounts of computing power at their fingertips so it is easy to forget that there are still devices in existence with constrained resources. Zero-copy might be the right thing for you or not, depending on whether the senders or receivers possess more computing power.

The design of nachricht

So after having learned so much about data encoding, what design decisions did I make and might nachricht be the right thing for you? The main idea is to create a form of binary JSON, i.e. have a couple of basic types like string, integer, boolean, etc. and encode them efficiently using one header byte. To the best of my knowledge, msgpack pioneered this approach while CBOR tidied it up a bit. First, we split the header byte into a three-bit code and a five-bit payload part. Having assigned three bits for a code allows for eight codes which are used to encode nachricht’s basic types as following:

Code Meaning Encoded types
0 Binary Null, Bool, float32, float64, byte arrays
1 Positive number integer
2 Negative number integer
3 Container arrays, objects
4 string UTF-8 strings
5 symbol UTF-8 strings that get inserted into the symbol table
6 key field names
7 reference keys and symbols

The five bit payload then either contains a value or describes how many of the following bytes contain the value. Integers with an absolute value of 23 or less are directly encoded within the header byte. Strings of length 23 or less directly follow their header byte and so on. Since a five bit payload allows for a maximum of 32 different values, we are left with eight additional values which determine the number of bytes used to encode the value, should it exceed 23.
For instance, strings of lengths between 24 and 255 encode in their header byte that one following byte contains the actual length of the string which is then, again, followed by the actual value bytes. For strings of lengths between 256 and 65335 there will be two bytes encoding the length and so on. This is all pretty much industry standard but the killer feature is the possibility to reuse values using an internal symbol table. Consider the above example again, this time in nachricht:

00000000  62 c7 76 65 72 73 69 6f  6e 21 c4 63 61 74 73 64  |b.version!.catsd|
00000010  62 c4 6e 61 6d 65 87 4a  65 73 73 69 63 61 c7 73  |b.name.Jessica.s|
00000020  70 65 63 69 65 73 b6 50  72 69 6f 6e 61 69 6c 75  |pecies.Prionailu|
00000030  72 75 73 56 69 76 65 72  72 69 6e 75 73 62 e2 86  |rusViverrinusb..|
00000040  57 61 6e 74 61 6e e3 a8  4c 79 6e 78 4c 79 6e 78  |Wantan..LynxLynx|
00000050  62 e2 86 53 70 68 69 6e  78 e3 aa 46 65 6c 69 73  |b..Sphinx..Felis|
00000060  43 61 74 75 73 62 e2 87  43 68 61 6e 64 72 61 e3  |Catusb..Chandra.|
00000070  e4                                                |.|

Already at the first glance this is significantly shorter. Also, no string is repeated thanks to the magic of the symbol table. nachricht also has a human- friendly textual representation which isn’t transmitted over the wire but is useful
for developers to make sense of intercepted data for debugging purposes; here it is:

(
  version = 1,
  cats = (
   (
     name = "Jessica",
     species = #PrionailurusViverrinus,
   ),
   (
     name = "Wantan",
     species = #LynxLynx,
   ),
   (
     name = "Sphinx",
     species = #FelisCatus,
   ),
   (
     name = "Chandra",
     species = #PrionailurusViverrinus,
   ),
 ),
)

The complete symbol table for this nachricht looks like this:

Index Type Value
0 key version
1 key cats
2 key name
3 key species
4 symbol PrionailurusViverrinus
5 symbol LynxLynx
6 symbol FelisCatus

As you can see, every key and symbol gets inserted at the next index. Also, en- and decoders need to track the actual type (key or symbol) of insertions as there is only one code for references which cannot distinguish between the two on their own. Every time an encoder encounters a key or symbol which is already present in the symbol table, it only encodes a reference to it, which in many cases (for indices less than 24) only takes up one single byte.

We only use one type of parenthesis (), because nachricht only knows one container type: arrays contain plain values whereas maps are containers with named values. In nachricht, every value can have a name, even the toplevel one: key = "value" is a legit nachricht message, whereas "key":"value" is invalid JSON (you need to enclose it into a map with {} first). Thus in the above example, cats behaves like an array in JSON as it only contains unnamed values, while the elements of that array resemble objects as they contain only named values. Also, there is this curious # before our species names. This is because they are considered symbols, not strings. They behave exactly the same semantically
but are encoded as a different type. In fact, a client application doesn’t even need to know whether a value came as a string or a symbol, although it might want to anyway. Symbols are meant to encode enum constants or atoms and because
their presence in the symbol table makes repetition cheap. For instance in our example, only the first
#PrionailurusViverrinus is encoded in its stringy form. The second is just a single header byte saying “this is a reference to the fifth entry (at index four) in the symbol table” and nothing else, saving significant space.

Conclusion and outlook

nachricht is a space-efficient schemaless or self-describing format best suited for cases where encoders do not have significant CPU advantages over decoders, developing in an agile way and where decoders are usually interested in the whole content of a message, not just parts of it. In rare cases where a rather big chunk of data is relevant only to a portion of decoders – think for instance of a header and payload part where routing services only need access to the header and business services to the whole message – the payload part can be encoded as a separate nachricht and then written into the envelope as a single byte string. If this sounds like you, maybe give nachricht a spin! At the moment, only a Rust implementation exists (and makes great use of the power of serde, allowing the zero-copy decoding of text strings into &str and byte strings into &[u8] slices) but Java support is on my 2021 to-do list.