Introduction
I’m using marked
as a tokenizer to convert raw Markdown into tokens, and I’ve developed my own parser (based on ProseMirror’s official default Markdown parser) to transform these tokens into ProseMirror-compatible document nodes.
Background
My parser is closely modeled on the code provided by prosemirror-markdown
. For example, similar to openMark()
and closeMark()
from prosemirror-markdown
, I use the following methods to handle mark tokens:
public activateMark(mark: ProseMark): void {
const active = this.__getActive();
active.marks = mark.addToSet(active.marks);
}
public deactivateMark(mark: ProseMarkType): void {
const active = this.__getActive();
active.marks = mark.removeFromSet(active.marks);
}
The Issue (Corner Case)
Consider this Markdown input: *This is *italic* text*
.
The tokenized result from marked
looks like this:
- Paragraph [block]
- Em [inline]
- Text: “This is” [inline]
- Em: “italic” [inline]
- Text: “italic” [inline]
- Text: “text” [inline]
- Em [inline]
In the parser:
- Each time a mark is activated (
activateMark
), it adds a mark to the active node’smarkSet
(in this case, the active node is the paragraph). - Each time a mark is closed (
deactivateMark
), the corresponding mark is removed from themarkSet
of the active node.
The Problem
When activating the same type of mark consecutively (like two em
marks in this case), only one instance of the mark is added to the markSet
. As a result, two activateMark
calls will still leave just one em
mark in the markSet
.
However, when deactivateMark
is called twice (once for each nested em
), the first deactivateMark
removes the single em
from the markSet
, and the second deactivateMark
is effectively removing a non-existent em
mark.
Analysis of the Corner Case
In the nested case *This is *italic* text*
, here’s how the bug manifests:
- The first
activateMark
for*This is *italic* text*
adds anem
mark to themarkSet
. - The second
activateMark
for*italic*
doesn’t add a secondem
mark because themarkSet
can only hold one instance of the same mark type. - When the first
deactivateMark
for*italic*
is called, it removes anem
mark from the paragraph. - When creating a text node with the text “text”, it obtains the markSet from the current active node (the paragraph), but the first
deactivateMark
has already removed theem
. As a result, the “text” part of the string no longer has anem
mark, even though it should. - When the second
deactivateMark
for*This is *italic* text*
is called, it tries to remove anem
mark from the paragraph’s markSet, but there is nothing there anymore due to step 3.
Conclusion
Due to this issue, the final part "text"
in the tokenized result incorrectly lacks the em
mark. The problem arises because the parser is not handling the nested activation and deactivation of the same mark type correctly.
Request and Question
I am aware that there are several ways to solve this issue without modifying ProseMirror’s core code, but I think it is less elegant coding and a little bit messy in terms of coding style.
Therefore, I would like to ask: is it possible to provide a new feature that would allow the client to decide whether to enable nested marks when activating them? This could provide a more elegant and configurable solution to handle such cases internally. Thank you.