Special handling of UTF8 needed?

I have a short editor content:

"<p>abc😃Bryce 123456</p>"

There is UTF8 4-byte character. doc.content.size returns size 19. But is should be 18?

I also want to enter mark on position 5 to 10. When I do that it enters it into middle of that 4-byte character like this:

<p>abc�<span class="tag-highlight" style="background-color: person-tag">�Bryc</span>e 123456</p>

Am I doing something wrong in handling UTF? I cannot find anything in documentation nor github issues.

All offsets and text lengths in ProseMirror are counted in utf16 code units, by design (mostly because it would be painfully expensive to do otherwise). So yes, not hitting the middle of surrogate pairs (or, in most circumstances, multi-codepoint glyphs) when doing something like inserting content is the responsibility of the caller. You’ll usually work with selection or node positions anyway, which should already be in valid places, but if you’re computing positions otherwise you have to be somewhat careful about this.

I see. Thanks @marijn. I thought I would try to encode those characters on input to HTML entity:

text.replaceAll(/[\uD800-\uDFFF]./g, m => he.encode(m))

But it seems prosemirror is encoding it back to unicode :). Gee not sure how to handle this. I will for now just remove them from input.

ProseMirror doesn’t encode or decode anything—it’s just uses JavaScript’s native UTF16 string encoding.

I see I was too fast in my judgement, I have not investigated how it turns back to unicode. Thanks for correcting me.