Slicing with word boundaries

Mukhammadsaid19 · November 26, 2022, 4:03pm

Hello there! I’m trying to create slices with “complete words” at both ends. The thing is I want to detect what has been changed using step maps, and then slice the changed part plus a minimum of 10 characters before and after. However, slicing borders within some words, and the following scenario happens:

Source: I went to the school only to discover it was closed on holidays.

Let’s say the changed word is “the”. So getting its window with 10 characters yields:

I went to the school on

However, I need the slices to start and end with whole words, not with subwords. The expected output then would be:

I went to the school only

The possible solution is to use resolve method (and find the start or the end of the word with while loop) but is there any other way of doing that diligently? Perhaps there’s a way to linearly traverse the node with words as indices instead of characters? Thanks.

marijn · November 26, 2022, 8:48pm

No, there’s no word-based iteration in the library, you’ll have to implement this by looking at the nodes.

Mukhammadsaid19 · November 27, 2022, 7:00am

Thanks for the response. Could you suggest any ideas on how to start with this?

Mukhammadsaid19 · November 28, 2022, 9:47am

Ok. Seems like I figured it out. Providing the solution in case somebody needs it.


// Gets the required offset we have to subtract from the current position 
// in order to slice it at word boundary
// The slice should be within one block
// It should mark across text nodes (inline too), like one word 
// may be writing half bold and half italic
function getOffsetOfTokensBefore(
  state: EditorState,
  pos: number,
  tokenCount: number
) {
  let resolved = state.doc.resolve(pos);
  let prevResolved = resolved;
  let offset = 0;

  while (tokenCount) {
    resolved = state.doc.resolve(pos);
    if (!resolved.sameParent(prevResolved)) return offset;

    const node = resolved.nodeBefore;

    if (node?.isText || node?.isInline) {
      const text = node.text || "";
      let idx = text.length;
      while (idx >= 0) {
        const char = text[idx];
        if (/\s/.test(char)) {
          tokenCount--;
          if (tokenCount === 0) break;
        }
        offset++;
        idx--;
      }
      pos -= text.length - idx;
    }

    prevResolved = resolved;
  }
  return offset;
}