Special handling of UTF8 needed?

peterkruty · November 4, 2020, 11:29am

I have a short editor content:

"abc😃Bryce 123456"

There is UTF8 4-byte character. doc.content.size returns size 19. But is should be 18?

I also want to enter mark on position 5 to 10. When I do that it enters it into middle of that 4-byte character like this:

abc��Bryce 123456

Am I doing something wrong in handling UTF? I cannot find anything in documentation nor github issues.

marijn · November 4, 2020, 1:39pm

All offsets and text lengths in ProseMirror are counted in utf16 code units, by design (mostly because it would be painfully expensive to do otherwise). So yes, not hitting the middle of surrogate pairs (or, in most circumstances, multi-codepoint glyphs) when doing something like inserting content is the responsibility of the caller. You’ll usually work with selection or node positions anyway, which should already be in valid places, but if you’re computing positions otherwise you have to be somewhat careful about this.

peterkruty · November 5, 2020, 10:09am

I see. Thanks @marijn. I thought I would try to encode those characters on input to HTML entity:

text.replaceAll(/[\uD800-\uDFFF]./g, m => he.encode(m))

But it seems prosemirror is encoding it back to unicode :). Gee not sure how to handle this. I will for now just remove them from input.

marijn · November 5, 2020, 10:22am

ProseMirror doesn’t encode or decode anything—it’s just uses JavaScript’s native UTF16 string encoding.

peterkruty · November 5, 2020, 10:33am

I see I was too fast in my judgement, I have not investigated how it turns back to unicode. Thanks for correcting me.