Library to cleanup Microsoft Word HTML?

orestis · September 18, 2020, 9:46am

I just spent the better part of a morning trying to figure out if there’s a library out there that works in the browser and can take whatever crap HTML Microsoft Word outputs and convert it into something that has some semantic semblance.

Like, the way Word outputs list is with paragraphs that contain a bunch of inline styles, the bullet itself included (wrapped in some conditional comments). Then, to detect ordered lists you have to actually check the “bullet” itself. The better way is to actually parse some inline styles of the document and see if the list has something like mso-level-number-format:bullet;

This madness probably extends to other areas. Various turn-key editors (e.g. CKEditor) have some functionality to handle all this for you, but they’re very editor-specific. It would be nice if there was a library that could do this using only a plain DOMParser.

I’d think that this is something that ProseMirror users would have to deal with all the time, is there any established solution which I’m not seeing?

marijn · September 18, 2020, 10:31am

One issue is that some of the rules for such a conversion would be schema-specific. But I guess even just having a generic Word-garbage-to-clean-HTML converter (could be separate from ProseMirror, though a quick web search didn’t turn anything up) and installing that as a pasted HTML transformer, would help a lot.

Has anyone ever seen such a library? There’s the paste-from-office plugin for CKEditor, which might contain a lot of the relevant logic, but that’s not open source.

orestis · September 18, 2020, 10:47am

Yeah it’s bound to be schema-specific, and actually prose mirror already does a lot of the cleanup by throwing unknown stuff out. So even without a plugin, you get less-styled text instead of garbage, which is a good starting point.

I wonder if ProseMirror parsing rules are powerful enough to handle some stuff. Here’s an example ordered list (there’s no wrapping ul or anything similar):

<p class=MsoListParagraphCxSpFirst style='text-indent:-18.0pt;mso-list:l2 level1 lfo2'><![if !supportLists]><span
lang=EN-US style='mso-bidi-font-family:Calibri;mso-bidi-theme-font:minor-latin;
mso-ansi-language:EN-US'><span style='mso-list:Ignore'>1.<span
style='font:7.0pt "Times New Roman"'>&nbsp;&nbsp;&nbsp;&nbsp; </span></span></span><![endif]><span
lang=EN-US style='mso-ansi-language:EN-US'>An ordered list<o:p></o:p></span></p>

<p class=MsoListParagraphCxSpMiddle style='text-indent:-18.0pt;mso-list:l2 level1 lfo2'><![if !supportLists]><span
lang=EN-US style='mso-bidi-font-family:Calibri;mso-bidi-theme-font:minor-latin;
mso-ansi-language:EN-US'><span style='mso-list:Ignore'>2.<span
style='font:7.0pt "Times New Roman"'>&nbsp;&nbsp;&nbsp;&nbsp; </span></span></span><![endif]><span
lang=EN-US style='mso-ansi-language:EN-US'>With <u>some underlined</u> items<o:p></o:p></span></p>

<p class=MsoListParagraphCxSpLast style='text-indent:-18.0pt;mso-list:l2 level1 lfo2'><![if !supportLists]><b><i><span
lang=EN-US style='mso-bidi-font-family:Calibri;mso-bidi-theme-font:minor-latin;
mso-ansi-language:EN-US'><span style='mso-list:Ignore'>3.<span
style='font:7.0pt "Times New Roman"'>&nbsp;&nbsp;&nbsp;&nbsp; </span></span></span></i></b><![endif]><b><i><u><span
lang=EN-US style='mso-ansi-language:EN-US'>Is also nice<o:p></o:p></span></u></i></b></p>

the MsoListPargraphCxSp{First,Middle,Last} classes seems to be reliable – so they can be used to wrap the entire list under a ul or ol.
the content between  and  has to be ignored, but can be used to detect the list type (ul or ol, albeit in a hacky way)
the mso-list:l2 level1 lfo style can be used to detect indent level (level1, level2, level3 etc)

marijn · September 18, 2020, 11:25am

No, for much of it you’d need some kind of preprocessor.

gethari · September 30, 2024, 11:07am

@orestis by any chance did you get some workaround over this ?

prosed · October 11, 2024, 12:26am

I’m interested in implementing this - not with any sort of urgency - but the biggest problem is my lack of Microsoft Word license.

If people could just share some paste-bins or gists with problem content and an associated screenshot of what that content looks like, that would dramatically lower the bar for someone to pick this up and start running with it right away, rather than first needing a license, then needing to create some document, then getting the HTML, and finally cleaning it - they could jump straight to the cleaning part.

gethari · October 22, 2024, 10:05am

Office for Web is not FREE to use