Schema for preserving unknown HTML content and structure as-is

I’m working on making a replacement for our old CKE4 implementation with a new ProseMirror implementation, and need to be careful about avoiding unintended data loss when migrating over to the new ProseMirror-based editor, as we will have old existing HTML to read from the database and must be able to survive a roundtrip through the ProseMirror implementation.

The main idea we have is that have a catch-all node for unknown content, that is simply atomic and and saves the outerHTML as an attribute, renders it as a node view and restores it back when serializing to DOM.

That part is no issue, the issue comes when I need to split the case into two due to block and inline not being allowed to be mixed in the content of a node in schema. This leads to having both unknown_inline and unknown_block with tag: *. And here is where I’m facing issue in how to properly separate these two cases consistently without triggering the automatic node wrapping or closing the parent node.

I’ve looked into using context: 'doc/' for unknown_block to ensure it only triggers at root level, but that fails during automatic node wrapping as there can be an open paragraph node generated from unwrapped text before. For example this case I can’t get to parse consistently using context: 'doc/' only:

<div>abc</div>
123
<div>xyz</div>

The second div gets ignored as 123 starts a new open paragraph for wrapping and thus the second div then fails the context check.

I’ve tried using getAttrs as a filter to check the parentElement for a class marker added to the root container that is being passed to the DOMParser. That works fine when I control how the parsing is handled, but I don’t know how that scales for pasting and other functionality that parses by themselves, so that the marker is not guaranteed to exist in all cases.

Another approach is that I’ve tried tag: 'p *' as a selector for the unknown_inline, but that only works if i can be sure about the ancestors. I don’t know if copy and pasting from outer sources might include a weird P tag at root level and thus making all unknown content into unknown_inline.

So what approaches would be best for us in order to best preserve the structure of unknown HTML content? Here is basic test case I have for when only paragraphs and text is defined as known nodes in the schema, everything else should be preserved as is and parsed as unknown nodes, including not changing the ancestors of the unknown content:

<div>abc</div>
123
<div>xyz</div>

<p>
  456
  <span>def</span>
</p>

<div>ghi</div>
<p>789</p>

Unless there are a lot of custom elements in your content, you could try just enumerating the block (or inline) elements defined by the HTML standard, and use tag names to distinguish them.

I thought about that, but we have very old plugins and we can’t in really our case assume correctly structured HTML, so its literally * selector all that we can go for. We even have some metadata that is embedded using HTML comments, luckily that is wrapped by an actual HTML tag, so we at least don’t have to worry about parsing comment nodes.

In essence what I’m aiming to express is that anything that can’t fit into the schema to be preserved exactly as-is into unknown fallback nodes. So given the alternatives I explored yesterday, it seems that using tag: 'p *' for the inline content is the best way forward for us.

Could you clarify a bit on intended behavior for context, since it was a bit ambiguous if the context is supposed to be applied to the raw DOM or the current open node in the processed state (for example when additional wrapping nodes is generated)? The latter is the current behavior from what I’ve tested, I’m just wondering if it is intentional or not that the second div in this example with context: 'doc/'does not gets parsed due to the text being wrapped into an open paragraph during the processing:

<div>abc</div>
123
<div>xyz</div>

The docs for context seem pretty clear. If you enumerate the textblock nodes in your schema there, you should be able to force anything under those to be parsed as an inline node, and then have a lower-precedence rule that catches others and parses them as block nodes.

I’ve tried that approach but as I mentioned I keep running into issues due to the automatic node wrapping. So here is the schema when I go with that approach:

export const schema = new Schema({
  nodes: {
    doc: {
      content: 'block+'
    },

    paragraph: {
      group: 'block',
      content: 'inline*',
      parseDOM: [{ tag: 'p' }],
      toDOM() {
        return ['p', 0]
      }
    },

    unknown_block: {
      group: 'block',
      atom: true,
      draggable: true,
      attrs: {
        html: {}
      },
      parseDOM: [
        {
          tag: '*',
          getAttrs: (element) => ({
            html: element.outerHTML
          }),
          priority: 50
        }
      ],
      toDOM: (node) => {
        const $parser = document.createElement('div')
        $parser.innerHTML = node.attrs.html
        return $parser.firstElementChild!
      }
    },

    text: {
      group: 'inline'
    },

    hard_break: {
      group: 'inline',
      inline: true,
      selectable: false,
      parseDOM: [{ tag: 'br' }],
      toDOM() {
        return ['br']
      }
    },

    unknown_inline: {
      group: 'inline',
      inline: true,
      atom: true,
      draggable: true,
      attrs: {
        html: {}
      },
      parseDOM: [
        {
          tag: '*',
          context: 'paragraph//',
          getAttrs: (element) => ({
            html: element.outerHTML
          }),
          priority: 100
        }
      ],
      toDOM: (node) => {
        const $parser = document.createElement('div')
        $parser.innerHTML = node.attrs.html
        return $parser.firstElementChild!
      }
    }
  }
})

And it works fine as intended when HTML supplied follows that structure:

However, as soon as I add unwrapped text that will create an open paragraph and suck everything into it, as the context is being applied to the current internal state, not the raw DOM:

And if I reverse it to check context: 'doc/' instead I get less abrupt result but still issues with the unwrapped text:

export const schema = new Schema({
  nodes: {
    doc: {
      content: 'block+'
    },

    paragraph: {
      group: 'block',
      content: 'inline*',
      parseDOM: [{ tag: 'p' }],
      toDOM() {
        return ['p', 0]
      }
    },

    unknown_block: {
      group: 'block',
      atom: true,
      draggable: true,
      attrs: {
        html: {}
      },
      parseDOM: [
        {
          tag: '*',
          context: 'doc/',
          getAttrs: (element) => ({
            html: element.outerHTML
          })
        }
      ],
      toDOM: (node) => {
        const $parser = document.createElement('div')
        $parser.innerHTML = node.attrs.html
        return $parser.firstElementChild!
      }
    },

    text: {
      group: 'inline'
    },

    hard_break: {
      group: 'inline',
      inline: true,
      selectable: false,
      parseDOM: [{ tag: 'br' }],
      toDOM() {
        return ['br']
      }
    },

    unknown_inline: {
      group: 'inline',
      inline: true,
      atom: true,
      draggable: true,
      attrs: {
        html: {}
      },
      parseDOM: [
        {
          tag: '*',
          getAttrs: (element) => ({
            html: element.outerHTML
          })
        }
      ],
      toDOM: (node) => {
        const $parser = document.createElement('div')
        $parser.innerHTML = node.attrs.html
        return $parser.firstElementChild!
      }
    }
  }
})

The div xyz get parsed as unknown inline instead of block:

Right, if you have inline stuff at the top level (or in nodes you don’t parse), it’ll do that. Your best bet might be to pre-process the HTML and do the tagging of foreign inline/block nodes before giving it to ProseMirror’s parser.