The TOML ecosystem and its issues

Daniil Baturin, Tomsk, Russian Federation

LVEE 2021 mini

TOML (Tom's Obvious, Minimal Language) was created with a goal to give people a configuration file format that would be fully specified, easy to implement, and user-friendly. In this talk, we'll examine whether it actually succeeded at that.

TOML is certainly better suited for configuration files than many alternatives: it’s easier to read than JSON, easier to write than XML, and less complicated and problematic than YAML. Thanks to an ecosystem of libraries for multiple languages, it’s also easy to start using, so for most projects, it’s a good configuration file format.

On the surface, it’s a format with a detailed specification and a thriving ecosystem. In reality, the situation is more complicated.

When I set out to write a TOML 1.0.0-compliant library for OCaml (named OTOML), I had to study the specification in-depth and compare multiple libraries to see how they behave in practice. That process brought many unpleasant surprises.

Specification issues

First, the specification itself is full of surprises. It’s neither minimal, nor really obvious. For example, newlines and trailing commas are allowed in arrays, but not in inline tables.

There are also many syntax decisions and feature interactions that make parsing the format much harder than it seems at a glance. For example, arrays of tables use [[name]] syntax for their headers, but the specification does not say whether [ [ name ] ] @ is also valid syntax or @[[ should be treated as a single token. The issue is moot though, because foo = [[]] is definitely a valid nested array and context tracking is required to tell it from a table array header.

Lack of limitations on the key syntax is another example. Dotted keys that look exactly like floats are valid: 2.1 = 1.2. Likewise, special floating-point values nan and inf are not reserved keywords and can be used as keys, as in inf = nan, and the same applies to boolean values: true = false is valid TOML.

These facts make the language context-sensitive over the set of characters, which prevents using tools designed for context-free languages for parsing it. OTOML solves that problem by tracking the context in the lexer and feeding the Menhir LR parser a set of abstract tokens that make it look context-free.

Many libraries are non-compliant

The hidden complexity of the specification leads to many non-compliant libraries. Those incompatibilities can persist for a long time because most users are only using a small, indeed a minimal and obvious subset of the language.

The TOML project wiki features a list of implementations, but standard compliance is self-reported and being listed as “v. X.Y compliant” does not guarantee anything.

An additional obstacle to interoperability between libraries is the slow uptake of new versions. Implementations that report 1.0.0 compliance are a minority, some programming languages still don’t have a 1.0.0-compliant implementation available, and many libraries appear to have never been updated from 0.5.0 and 0.4.0, or even earlier specification versions.

Testing is hard

There is no official TOML test suite. The de facto standard test suite developed by community members strives to close that gap, but testing existing libraries with it is a non-trivial problem.

The main issue is that most libraries were designed to map TOML types directly to native types of the implementation language, and that conversion is not always lossless. For a simple example, TOML has distinct integer and float types, while some programming languages (like Lua, Perl, or JavaScript) don’t, and TOML libraries for them simply map both integer and float values to the native numeric type. The test suite expected the unit-under-test to tell it both the value and the original type, so such libraries would need to be modified to expose the AST to the test suite interface.

Thus, there is no easy way to build a true picture of library compliance statistics.

Conclusion

While TOML succeeded at giving users a friendly configuration language, its specification and ecosystem are not without problems, and there are many lessons that future format designers and implementers can learn from it.

In particular, a comprehensive test suite should definitely be developed in lock-step with the specification to avoid future problems with applying it to existing code. Besides, the language itself should be carefully designed to account for unintended edge cases, feature interactions, and unfortunate implications for the language complexity class.

Abstract licensed under Creative Commons Attribution-ShareAlike 3.0 license