Investigation on Nested Marks in ProseMirror

Bistard · October 23, 2024, 11:56pm

Introduction

I’m using marked as a tokenizer to convert raw Markdown into tokens, and I’ve developed my own parser (based on ProseMirror’s official default Markdown parser) to transform these tokens into ProseMirror-compatible document nodes.

Background

My parser is closely modeled on the code provided by prosemirror-markdown. For example, similar to openMark() and closeMark() from prosemirror-markdown, I use the following methods to handle mark tokens:

public activateMark(mark: ProseMark): void {
    const active = this.__getActive();
    active.marks = mark.addToSet(active.marks);
}

public deactivateMark(mark: ProseMarkType): void {
    const active = this.__getActive();
    active.marks = mark.removeFromSet(active.marks);
}

The Issue (Corner Case)

Consider this Markdown input: *This is *italic* text*.

The tokenized result from marked looks like this:

Paragraph [block]
1. Em [inline]
  1. Text: “This is” [inline]
  2. Em: “italic” [inline]
    1. Text: “italic” [inline]
  3. Text: “text” [inline]

In the parser:

Each time a mark is activated (activateMark), it adds a mark to the active node’s markSet (in this case, the active node is the paragraph).
Each time a mark is closed (deactivateMark), the corresponding mark is removed from the markSet of the active node.

The Problem

When activating the same type of mark consecutively (like two em marks in this case), only one instance of the mark is added to the markSet. As a result, two activateMark calls will still leave just one em mark in the markSet.

However, when deactivateMark is called twice (once for each nested em), the first deactivateMark removes the single em from the markSet, and the second deactivateMark is effectively removing a non-existent em mark.

Analysis of the Corner Case

In the nested case *This is *italic* text*, here’s how the bug manifests:

The first activateMark for *This is *italic* text* adds an em mark to the markSet.
The second activateMark for *italic* doesn’t add a second em mark because the markSet can only hold one instance of the same mark type.
When the first deactivateMark for *italic* is called, it removes an em mark from the paragraph.
When creating a text node with the text “text”, it obtains the markSet from the current active node (the paragraph), but the first deactivateMark has already removed the em. As a result, the “text” part of the string no longer has an em mark, even though it should.
When the second deactivateMark for *This is *italic* text* is called, it tries to remove an em mark from the paragraph’s markSet, but there is nothing there anymore due to step 3.

Conclusion

Due to this issue, the final part "text" in the tokenized result incorrectly lacks the em mark. The problem arises because the parser is not handling the nested activation and deactivation of the same mark type correctly.

Request and Question

I am aware that there are several ways to solve this issue without modifying ProseMirror’s core code, but I think it is less elegant coding and a little bit messy in terms of coding style.

Therefore, I would like to ask: is it possible to provide a new feature that would allow the client to decide whether to enable nested marks when activating them? This could provide a more elegant and configurable solution to handle such cases internally. Thank you.

marijn · October 24, 2024, 8:04am

I don’t think so. Marks are explicitly defined to not nest, and changing that would cause a whole bunch of new complications and issues.

This problem sounds like it should be solved on the level of the Markdown parser, by collapsing nested marks into a single mark with the range of the outer one. (And no, you cannot losslessly represent all Markdown documents as a ProseMirror document. That is not something the library tries to provide.)