Design for constrained node content

marijn · May 3, 2016, 7:42am

I’ve been working on issue #220 these past weeks, which aims to provide a way to specify what is valid content for a given node at a finer granularity than the current system of ‘kinds’. So I thought I’d write something on what I’m doing, and ask for feedback.

My current approach is a regex- or CFG-like notation, where nodes specify strings like

  "inline*"
  "heading paragraph+"
  "(image | blockquote){1, 4}"

The expression "foo bar" means ‘first a foo node and then a bar node’. "foo*" means zero or more foo nodes, "foo+" one or more, "foo?" zero or one, and the braces also work similar to the way they do in regexps.

The words in these expressions can either directly refer to node type names, or to separately declared groups of nodes. So you could have a group block which contains, for example, paragraphs, blockquotes, and lists, and then refer to those with a single word. Note that the idea of sub- and super- kinds is gone – groups are simple flat collections of node types. You can use an or expression – (group1 | group2) – if you have a place where multiple groups can appear.

To make it possible (and efficient) to reason about these, content expressions must be flat (the only thing you parenthesize is the | operator, which only takes plain words as operands). So "((foo | bar+)* | baz{2})" is not allowed.

Ambiguous matches are also not allowed – adjacent subexpressions must not overlap, so that there’s always a single way to match a sequence of nodes. An example of an expression that violates this would be "paragraph* block+", where it is unclear whether a starting paragraph must be matched to the paragraph or to the block part (and backtracking would be required to find matches).

This seems it is an easy to learn notation, still powerful enough to express more complicated schema. Do reply to this thread if you see a problem, or have a type of node that couldn’t be expressed in this way.

The main difficulty in implementing this has been the way it introduces failure cases in many spots that didn’t have them before – splitting, joining, or partially replacing a node can now result in invalid content, so we have to do more checking before trying to perform operations, and introduce some degree of automatic structure fixing in order to make, for example, cutting and pasting with constrained content possible.

My current approach is to make split and join fail hard when they would create an invalid node, and require checking in advance there, whereas replace (which is already a somewhat magic process) does its best to create the extra nodes needed to make the content fit together, and soft-fails (discards some content) when it can’t create valid content.

One idea would be to require that all node types that the schema requires appear at least once in a parent node to be ‘synthesizable’ – that means, all their attributes have a default value, and all required content is also synthesizable, so that we can create such nodes when they are needed to produce fitting content, and thus automatically solve some problematic failure cases – such as when the replace algorithm, after having decided on a suitable parent node for some content, finds out that it’s missing some required node at the end of the content. Due to the way the replace algorithm works, this is a rather painful case, and there are a bunch of other situations where this property would be nice, such as being able to provide a ‘force split’ operation that always succeeds.

For many node types, providing default attributes is easy. For some, such as images, it would require introducing the concept of a placeholder image. Does this restriction (all required nodes must be synthesizeable) sound reasonable? It wouldn’t have much impact on the default schema, where only list items and blocks (defaulting to paragraph) are ever required, and inline content is always optional.

johanneswilm · May 3, 2016, 9:24am

I think we were among those asking for this feature initially. Since then we have been able to cover our main usecase by adding a filter to transforms. However, for new features this could possibly still be something useful, also for us. For that we will likely only use the most simple vocabulary, and restrictions will be limited to things like “only allow keyword nodes inside this block node” or “only allow 0-3 child nodes”. Adding defaults to all attributes should not be a problem, but hopefully the realworld usage of auto-generated nodes like that will be minimal. I have a hard time thinking of a situation where it would feel “natural” for the editor to suddenly add a default image in the content body.

johanneswilm · May 3, 2016, 9:11pm

@marijn Btw, will this be important for how you plan on supporting tables?

marijn · May 4, 2016, 7:39am

Yes, it’s one of the pieces that’s required for tables.

marijn · May 4, 2016, 10:07am

One situation would be if you have a figure node which requires caption and figure_image children, and you select from inside the caption to below the image and press delete. We could delete the whole figure node, but it seems nicer to leave the part of the caption that was not selected intact, and insert a dummy image to keep the figure node valid.

johanneswilm · May 4, 2016, 10:11am

I see. Yes, that would work. Deleting the entire figure or only deleting the selected part of the caption would be ok as well.

kiejo · May 4, 2016, 1:00pm

I like the proposed approach and have a few questions:

I assume that the constraints will be enforced on every step. I think there are cases where being able to explicitly skip a constraint check would be helpful. One example would be that I want to define a constraint which disallows a list_item to contain a list as its first child to prevent a structure like this:

I would use something like "paragraph (bullet_list | ordered_list)?" to enforce this. The problem is that the list_item:lift and list_item:sink commands sometimes need to create a sequence of steps which use a nesting like this as intermediary steps. Being able to skip the constraint check for these steps would allow us to still perform these normally “disallowed” steps.

What would be the best way to define a rule which ensures that headings can only appear at the top level? Is it to explicitly not include heading as a possible child for all block nodes that can contain children (except the root doc type of course)? My feeling is that whitelisting allowed nodes will be the way to go with this system, so this might be a straight forward way to define this rule.
Alternatively being able to blacklist nodes using for example "!heading" could be helpful, but I’m not sure if this is necessary if explicit whitelisting is the preferred way in general.

This does not seem to be a problem for the use cases I have thought about so far.

marijn · May 4, 2016, 1:17pm

Yes, and I’m making some changes to the ways some commands work to make sure this is possible. As you observed, the current split-and-then-ancestor-and-then-join approach to moving nodes around will cause problems if you have constraints on the involved nodes. So in my branch I have a ‘shift’ step, which moves content around by inserting and removing opening and closing ‘tokens’ before and after it. That allows Transform.lift, Transform.wrap, and list_item:lift/sink to be expressed as a single step. (I’m currently working on solving a similar problem in the way Transform.replace will move text directly after a replace towards the end of the replaced range.)

Definitely whitelisting. Don’t make heading part of your block group, and explicitly allow (block | heading) in places where headings are allowed.

marijn · May 11, 2016, 9:05pm

I’m landing this on the master branch now. Here’s what you need to know to port your code:

Schema definition is done differently now. The SchemaSpec class is gone, and you now pass a plain object to the Schema constructor, something like this:

const mySchema = new Schema({
  nodes: { // Node types in the schema
    doc: {type: Doc, content: "block+"},
    paragraph: {type: Paragraph, content: "inline[_]*"},
    text: {type: Text},
    /* ... and so on */
  },
  groups: { // Groups referred to in content expressions
    block: ["paragraph", /* ... */],
    inline: ["text", /* ... */]
  },
  marks: {
     em: EmMark
     /* ... */
   }
})

(Content expressions are now roughly documented in a new guide.)

A bunch of things that were previously properties on node type classes are now expressed in these expressions in the schema definition. Nodes no longer have kind, contains, canBeEmpty, and containsMarks properties. The various canContain... predicates are replaced by a few finer-grained methods on nodes: canReplace, canAppend, and a contentMatchAt method for a lower-level interface for reasoning about its content. (But you’ll probably only need those when writing generic commands.)

(Moving some things off node type classes and into the schema definition is the first step in an effort to make those less magical. There’ll be more changes like that after 0.7.0.)

As a general rule, you now have to be more careful when modifying the document, since the more powerful constraints are also easier to violate. Whereas splitting nodes, for example, was almost always possible in the old model, that is no longer the case, and a new predicate canSplit was introduced in the transform package, allowing you to check in advance whether a split is safe.

Replace transformations have become more clever (this was the biggest challenge in implementing all this), and will preserve content constraints when necessary by inserting extra nodes on the edges of the replaced content. You should be able to use them without worrying too much about their inner working, and just rely on the fact that they’ll give you a valid document with the given content replacing the old range. (Of course, when you give them content that trivially fits, no magic will happen, and the content will be inserted exactly as expected.)

Another change that landed is that steps are represented differently. Again, you probably only need this for specialized code. Instead of all steps having the same fields, they are now classes in control of their own serialization, deserialization, and mapping. The amount of different step types has been reduced to 4, and changes that were previously expressed as an awkward series of split, join, and ancestor steps can now be done in a single ReplaceWrapStep, which replaces pieces of the document on both sides of a piece of content, allowing ‘motion’ of content between adjacent nodes, and wrapping/unwrapping of content, in a single step. This is important because the previous approach of using intermediate steps was likely to temporarily violate content constraints, making the steps impossible even though the end result was valid.

Let me know how this works for you. I plan to give you a few days to spot the most horrible problems, after which I’ll release 0.7.0.

sacha · March 10, 2017, 11:11pm

I’ve added rules in the content field, but it seems to have no effect on the editor behavior. When are the rules defined in content supposed to be enforced?

An example of what I’ve tried:

mySchema.nodes.doc.content = 'heading[level=1] paragraph*'

marijn · March 12, 2017, 8:11pm

You aren’t allowed/expected to mutate an existing schema. Create a new one with the proper content property in you doc node’s spec.

sacha · March 13, 2017, 12:22am

Thanks a lot, that did it! I have to say it works amazingly well.

Now I would like to add a new node type to serve for rule definition only:

const myNodes = {
  doc: {
    content: "subsec+"
  },
  subsec: {
    content: 'heading block*'
  },
  ...
}

This new node type shouldn’t have any existence in the DOM. But ProseMirror complains that it has no toDOM() method. I have tried to provide a minimal toDOM method, without success.

What would be a minimal node type implementation?

marijn · March 13, 2017, 9:02am

That doesn’t work. Nodes must be able to be represented in (and parsed from) the DOM.

sacha · March 13, 2017, 9:15am

Ok, thanks, I will experiment with this.

I don’t know if it is related, but something strange happened in the past hours with the gitgub codebase: this file contains the following broken code:

const nodes = {
  ...
  text: {
    group: "inline"
  },
  ...
}

But when using const { nodes } = require("prosemirror-schema-basic"), I got this correct code:

const nodes = {
  ...
  text: {
    group: "inline",
    toDOM(node) { return node.text }
  },
  ...
}

marijn · March 13, 2017, 10:01am

That code isn’t broken, it just doesn’t work with the current release (and it, itself, isn’t released yet).