How to map ProseMirror Transactions to offset-based changes in a plain-text

Bistard · January 17, 2025, 12:26am

Hello everyone,

I’m currently developing a cross-platform Markdown note-taking application using Electron, and I’m facing a serious architectural issue with my editor. My editor is encapsulated in an EditorWidget class, which is composed of two parts:

EditorModel – Responsible for reading source files from disk, tokenizing them, and converting them into a ProseMirror document tree for rendering. (This step is what I call “deserialization.”)
EditorView – Holds a ProseMirror instance that renders the editor content.

Up to now, my approach has been to re-serialize the entire ProseMirror document tree back into plain text every time ProseMirror detects a change (via Transactions), and then I overwrite the file on disk. However, I’ve run into a critical limitation: in Markdown, there are often multiple ways to represent the same formatting. For example, code blocks can be delimited by either triple backticks (```) or triple tildes (~~~). My current approach unifies all such blocks into triple backticks upon serialization, losing the original syntax choice.

I want to avoid that loss of information. Ideally, I’d like my software to only modify the parts of the source file that actually changed, keeping all other syntax details (like ~~~ vs. ``` for code blocks) intact. To that end, I’ve introduced a piece-table–based TextBuffer in my EditorModel as the single source of truth for the text. The buffer provides fine-grained APIs to modify the original file based on character offsets.

My question: How can I map ProseMirror’s Transactions (which describe structural changes to the document) back to offset-based changes in my piece table, in a way that preserves these original Markdown plain-text and only modify the changed parts? I am here to find similar people who had a similar problems with me, trying to find a proper direction of achieving such thing. Happy to discuss anything here!

Additional Context:

I found out there is a library called prosemirror-changeset but I’m not sure this is exactly what I need.
I talked with gpt-o1 for a while, its suggestion is to maintain a mapping table that maps the ProseMirror offset into plain-text offset.

marijn · January 17, 2025, 7:46am

This is going to be really difficult. Doing this for plain text is more or less doable, since you can count characters and line breaks in a stretch of ProseMirror content to get a text offset, but doing it for Markdown, which has its own complicated rules for where markup characters are inserted, sounds like it’ll require a bunch of very complicated logic, or probably an entire custom serializer to make sure your serialization output agrees with your character position counting logic.

Doing this per-block rather than per-character might make it slightly easier, but you’ll still need some way to locate a given ProseMirror block in the markdown text.