How to deal with with nodes being dropped during parsing

Let’s say I am pasting HTML that looks like this

<p>Before Image<img src="someUrl" />After Image</p>

In the schema, I have paragraphs listed in a block group and images listed in a block group. Blocks can only contain inline items. What I expect is a structure that looks like this after parsing

  <p>Before Image</p>
  <img src="someUrl" />
  <p>After Image</p>

However, the img element is removed when parsing. I dug into the internals of prosemirror-model and it looks like when “findPlace” is called, the img tag can’t be wrapped in anything to be valid inside of a paragraph and it is then dropped. If you force “solid” to be false, it seems to work since it resolves it to the topLevel node (the doc in this case). Ideally, we don’t want any unexpected data loss when parsing HTML.

The only way we can get around this right now when parsing is to

  • a) Normalize the HTML/Dom Node before parsing
  • b) Prevent the parsing of the paragraph and let the text nodes wrap themselves in their own paragraph when “findWrapping” is called.
  • c) Overwrite PM internals

“a” is not ideal because there could be many corner cases that we could miss. “b” is not ideal because we could lose attributes that were part of the paragraph tag. “c” is not ideal because it’s easy to break stuff, especially when we upgrade.

If we could somehow overwrite the “solid” attribute in the NodeContext when parsing the slice, that could solve our problems, but I’m not sure if that would cause any other issues.

Does anyone have any thoughts?

It should be possible to make the DOM parser smarter about this case, I guess—if a node that can only be paced ‘above’ the current context, it could temporarily exit that context and then, if there’s more content that fits after the node, restore it. But that code currently doesn’t exist, and the parser is already somewhat subtle and complicated as it is, so adding that functionality wouldn’t be trivial.

“Solid” currently means ‘this context corresponds to an actual parsed node’ (as opposed to a made-up parent node to make something fit). Even if you somehow get it to be false in that situation, you’d still have an issue with returning to the original context after the image node—paragraphs are likely your default textblock, but if you had, say, a code block or header around the image, the text after it would end up in a plain paragraph. The current code leans towards preserving successfully parsed context vs forcing a place for a new node.

Thank you for the explanation! For now we opted to not parse the wrapping paragraph if it has an image. The text nodes are then successfully wrapped in paragraphs since that is our default text block. The problem is that we could lose important attributes on the paragraph. The resulting wrapper paragraphs will have to be default paragraphs. If you have additional ideas, we could try to take a stab at adding parsing functionality that tries to prevent that type of data loss. It seems like a “lift” step could be added and an additional option could be passed in to make sure the parse functionality was backwards compatible.

Making sense of random, arbitrary HTML is difficult :pensive:

A pull request that implements temporarily leaving the current context to place a node and then returning to it (if necessary—shouldn’t create an empty node when there’s no more parsable content in the node) would be welcome, if you have time to work on it.