RFC: Making content expressions a regular language

(This builds on the RFC to simplify content expressions)

I’ve had to explain several times that, though content expressions look like regular expressions, they aren’t – you can’t nest them, since they must be a flat sequence of node type sets with an optional repeat operator applied to them, with adjacent sets not overlapping.

The reason for these constraints is that it has to be cheap to match these, and to “make up” a piece of content that fills up the gap between existing fragments in a way that makes the content valid. Especially the latter is hard to do for arbitrary expressions.

But it recently occurred to me that if we make this a regular language and compile the expression down to a deterministic finite automaton (as explained here), matching can still be very fast, and filling-in becomes at least doable — it’d still be a branching search problem, rather than the linear one it currently is, but since I expect most content expressions to remain quite simple and linear, you’d only be paying for that if you use it, and having the automaton as a data structure makes the work that has to be done during this searching relatively cheap.

So in the interest of predictability and power, I’m thinking it might be a good idea to extend what is allowed in a content expression. The downside would be that the library would have to include the complexity that this involves. I managed to write a DSA builder in about 100 lines, but that excludes the additional complexity in the parser (which would have to become recursive). On the other hand, it would probably make some of the other code in model/src/content.js simpler, so the total cost wouldn’t be huge.

This change could be backwards-compatible, at least as far as schema definitions are concerned, since the new system would accept strictly more expressions than the old system did.

Does anyone have an opinion on whether this would be a good idea?

I like that idea very much ! Maybe then, use cases as this would be possible.

What kinds of expressions would be allowed in the new content expressions superset, which aren’t allowed currently?

But it recently occurred to me that if we make this a regular language and compile the expression down to a deterministic finite automaton

Would the “parser” still use regexes under the hood (as it does today), or something else? Reading that Russ Cox post has made me more sensitive to regex subtleties :slight_smile:

It’d be a nestable expression language allowing sequence, choice, and repeat operators. So you could say “paragraph+ (figure paragraph+)* figure?”, say, to enforce that there have to be paragraphs between figures.

Probably, yes, that’s the easiest way to tokenize in JavaScript.

How complex do you think the interface between the content validator, content generator, and rest of ProseMirror be? Do you think it’d be viable to have a fairly simple abstraction API, something like:

function isValid(node: Node): boolean;
function fill(node: Node, pos: ResolvedPos): Node;

What do you imagine the operations it needs to perform are?

I expect that removing the need to know about attributes (via the other RFC) will definitely introduce some opportunities for simplification, but nothing huge – the overall structure will remain similar.

I’ve also been considering whether this could be pluggable – if the interface is simple enough, people could provide their own content-validation implementations in cases where the default isn’t powerful enough. I could probably define a smaller minimal interface that an implementation has to implement, but some of the extra methods provide useful optimizations (for example, to get a match after a known-to-match fragment, if the content expression has only one state, we don’t have to play through it, we can just immediately return that state).

Something like the current ContentMatch class – a value encapsulating a match state – probably remains necessary. The interface you propose doesn’t really express what needs to be expressed – adding a node to match creates a new match state (or a failure to match), rather than a boolean, and filling works in terms of fragments, not positions in a document.

1 Like

I’ve pushed a number of patches that implement this, and I’m quite happy with the result. src/content.js became 100 lines smaller, rather than bigger (to be fair, that also had to do with removing the attribute constraints), and the overall system feels cleaner now.

It did lead to some breaking changes, though apart from the schema definition changes (can’t use attribute constraints anymore, have to specify mark sets in a separate property), they are mostly to rather obscure methods that you’re probably not using.