Thoughts on offsets and positions

I realize this might be a dumb question, but somebody once said there is no such thing, so here goes.

I’ve searched through the posts here, the docs on the site, and even looked around in the source code, but for the life of me, I can’t figure out the right way to deal with positions. Thanks to my latest post, I now get how useful ReselvedPos is, but you first need a context for that, perhaps a selection or a transaction, to deal with that.

I’m sure there’s an “easy” (read, intended use) to do this, but I just can’t seem to reason right about it.To be specific, as I said in my last post, I’m building an editor. I’m currently implementing a custom spellchecking solution, and given the problems highlighted in this forum and on github, I’ve decided to go about it a little differently. So, I get all the misspelled words in the document, but from another process. That means the offsets are off, and so when I try to “decorate” the misspelled words, I get the wrong ones (except for the first paragraph, which makes sense, since I suppose it’s a matter of dealing with paragraph breaks.

I guess, my question is, is there a consistently reliable way to map external positions (offsets from the same exact file) to PM positions?

That depends on how those external offsets are counted. If you’re dealing with string offsets, you’re probably going to have a bad time, since those have no real correspondence with ProseMirror document offsets, because the number of ProseMirror tokens between two pieces of text often won’t correspond to the number of newlines your text representation puts between them.

One approach would be to represent your text as a set of (offset, string) pairs, one per textblock, and then add the start offset to the string offset when going back to ProseMirror coordinates. If this is asynchronous (which it sounds like), you should also make sure to map those offsets through any steps that happened since you first recorded them, so that they still align with the current document.

1 Like

[quote="marijn, post:2, topic:706] That depends on how those external offsets are counted. If you’re dealing with string offsets, you’re probably going to have a bad time…[/quote] Yeah, they’re string offsets. Basically, from the UI (electron), I tell the main process to open the same file, read its contents, spell check the text, and send back a list of errors and their location to the render process, where PM resides.

It is async indeed. The problem is (at least the way I’ve approached things now) that I don’t start from the PM document. Well, I have access to it, that’s not the issue, but rather than send the text, I simply tell the backend to read the file on disk (which is the same content).

Perhaps I could send the document instead, then use that as a source of truth? Can I work with nodes outside of an active PM instance?

Definitely.

Awesome, I thought to myself. And it is. But I’m facing a problem still. Here’s what I did.

I send the file path to my main process from my view. I read the file and parse the markdown with PM to get a document I get content.content from the doc to get the paragraph nodes I iterate over them, get each parapraph’s text with node.textContent. Since I have the node, I get the position at the beginning, run my spellcheck, which returns an array of errors and their offsets, and since they’re all for this paragraph, I map all those offsets and add the start position of the paragraph. That gives mes a multidimensional array,which I flatten and send back. Still, the positions don’t match the prosemirror document, which is the same file, parsed with the same parser (defaultMarkdownParser).

I can’t figure out why. Seems I’m not using a specific part of the API the way it was intended, but can’t figure out which part.

I’m new to ProseMirror and I’m not sure I understand your issue. But what caught my eye in your explanation is that you use textContent. My understanding of textContent is that it is a concatenation of children’s text. So I’m not sure you can use it to compute positions, because then you “forget” about children nodes. Maybe you should use the text field instead and iterate over paragraph’s children.

@sacha, you are absolutely correct. I did manage to get it working using textContent, by iterating over all nodes and resolving their position in the document, then using their start position to map it onto the textContent offsets. It is heavy and definitely not the best approach though. Once I’ve found the right approach, I’ll be sure to post it here for anyone facing a similar issue.

So, I’ve had some time to go through this again, here’s what I found out in case someone finds themselves with a similar need. Since, like @marijn pointed out, we can work with a PM document outside of an instantiated editor, I did just that. I send over the path of the file to the “backend”. There I read the file contents, parse the markdown with the PM parser and get the text. Here’s the piece of code for that:

const txt = doc.textBetween(0, doc.content.size, '\n\n');

The trick here, in order to match the string offsets, is to have not one, but two characters for the blockSeparator argument of the textBetween function. Not sure about this, but I think it’s because that’s actuallly the way the PM serializer serializes back different paragraphs to markdown, with two paragraph breaks.

Now you can run your favorite spellchecking solution, knowing that whatever positions you get back for your errors will match your PM document. I personally use the excellent retext module, which has an awesome array of plugins.

There it is. Hope this helps somebody.

That still won’t work when you have any nodes like blockquotes, lists, images, hard breaks, or horizontal rules in your document.

Oh, damnit! I thought I had this figured out. I only handle pure text for now, which is why I haven’t run into this issue, but obviously I’d like to handle such cases as well. I will keep thinking on a better way to handle this. If you have any pointers…