Saving the content in NFC unicode normalization form

intaek-h · July 10, 2023, 4:17am

I have made an editor for a Korean website which worked great until some of the creators complained about some letters getting broken (looks something like this - ÁÂ¿î).

I suspected this problem to be the result of not normalizing the saved content in one distinct format, NFC or NFD either.

Currently the content saved to the DB is going through the following process.

const serializer = DOMSerializer.fromSchema(mySchema)

const getHTMLStringFromContent: (content) => {
  if (content.doc == null) return null
  const wrap = document.createElement('div')
  const contentFragment = serializer.serializeFragment(content.doc)
  wrap.append(contentFragment)
  const html = wrap.innerHTML
  wrap.remove()
  return html
}

const formData = {}

formData.Content = this.getHTMLStringFromContent(this.contentJson) || ''
...add other meta data...

// and upload it.

My hypothesis is that, if I normalize every content in NFC right before uploading, there won’t be any unicode breaks.

I want to ask the community whether my hypothesis is correct, and if so, what is the suggested practice to normalize the content?

Would below code work?

const getHTMLStringFromContent: (content) => {
  if (content.doc == null) return null
  const wrap = document.createElement('div')
  const contentFragment = serializer.serializeFragment(content.doc)
  wrap.append(contentFragment)
  const html = wrap.innerHTML
  wrap.remove()
  return html.normalize('NFC') -----> normalizing it here...
}

marijn · July 10, 2023, 5:52am

That looks like content was decoded in a different encoding than it was encoded with. Normalization won’t help there. You’ll have to review the encodings you use when sending and receiving text to and from the client, and whether the HTTP requests have their encoding set correctly so that the browser decodes/encodes in the right way (the recommended approach is use UTF-8 in all requests).

intaek-h · July 10, 2023, 8:50am

Thank you for your idea. However, I found that the broken letters are not just rendered to be broken but saved as broken letters. Which means that I can’t fix it by adding things like utf-8 to charset in the request header.

스크린샷 2023-07-10 오후 5.46.36