I’ve been working on issue #220 these past weeks, which aims to provide a way to specify what is valid content for a given node at a finer granularity than the current system of ‘kinds’. So I thought I’d write something on what I’m doing, and ask for feedback.
My current approach is a regex- or CFG-like notation, where nodes specify strings like
"inline*"
"heading paragraph+"
"(image | blockquote){1, 4}"
The expression "foo bar"
means ‘first a foo node and then a bar node’. "foo*"
means zero or more foo nodes, "foo+"
one or more, "foo?"
zero or one, and the braces also work similar to the way they do in regexps.
The words in these expressions can either directly refer to node type names, or to separately declared groups of nodes. So you could have a group block
which contains, for example, paragraphs, blockquotes, and lists, and then refer to those with a single word. Note that the idea of sub- and super- kinds is gone – groups are simple flat collections of node types. You can use an or expression – (group1 | group2)
– if you have a place where multiple groups can appear.
To make it possible (and efficient) to reason about these, content expressions must be flat (the only thing you parenthesize is the |
operator, which only takes plain words as operands). So "((foo | bar+)* | baz{2})"
is not allowed.
Ambiguous matches are also not allowed – adjacent subexpressions must not overlap, so that there’s always a single way to match a sequence of nodes. An example of an expression that violates this would be "paragraph* block+"
, where it is unclear whether a starting paragraph must be matched to the paragraph
or to the block
part (and backtracking would be required to find matches).
This seems it is an easy to learn notation, still powerful enough to express more complicated schema. Do reply to this thread if you see a problem, or have a type of node that couldn’t be expressed in this way.
The main difficulty in implementing this has been the way it introduces failure cases in many spots that didn’t have them before – splitting, joining, or partially replacing a node can now result in invalid content, so we have to do more checking before trying to perform operations, and introduce some degree of automatic structure fixing in order to make, for example, cutting and pasting with constrained content possible.
My current approach is to make split and join fail hard when they would create an invalid node, and require checking in advance there, whereas replace (which is already a somewhat magic process) does its best to create the extra nodes needed to make the content fit together, and soft-fails (discards some content) when it can’t create valid content.
One idea would be to require that all node types that the schema requires appear at least once in a parent node to be ‘synthesizable’ – that means, all their attributes have a default value, and all required content is also synthesizable, so that we can create such nodes when they are needed to produce fitting content, and thus automatically solve some problematic failure cases – such as when the replace algorithm, after having decided on a suitable parent node for some content, finds out that it’s missing some required node at the end of the content. Due to the way the replace algorithm works, this is a rather painful case, and there are a bunch of other situations where this property would be nice, such as being able to provide a ‘force split’ operation that always succeeds.
For many node types, providing default attributes is easy. For some, such as images, it would require introducing the concept of a placeholder image. Does this restriction (all required nodes must be synthesizeable) sound reasonable? It wouldn’t have much impact on the default schema, where only list items and blocks (defaulting to paragraph) are ever required, and inline content is always optional.