Parsing non-standard nested (un)ordered list HTML

bZichett · March 21, 2016, 6:28pm

Hi All,

I’m running into an issue on parsing non-standard nested lists from html to the PM model (based on https://www.w3.org/wiki/HTML_lists#Nesting_lists ). The parsing does just fine with the standard (with LI tags wrapping the nest, rather than as seen below).

pm.setContent('<ul><li><p>one</p></li><ul><li><p>two</p></li></ul><li><p>three</p></li></ul>', 'html')

Renders like this (snapshot)

( Same as snapshot:

one
- two
three

)

Which is kind of interesting because both render identically in the browser (normally … like the below markdown conversion … but I’d understand why it’s an inconvenience for ProseMirror to deal with this due to the tree structure it uses).

one

two

three

And the following is a spec I added to ProseMirror’s tests:

t("unordered_list_nest_nonstandard",
  doc(ul(li(p("one")), ul(li(p("two"))), li(p("three")))),
  "<ul>" +
    "<li><p>one</p></li>" +   // <= nonstandard form has </li> here.
      "<ul>" +
        "<li><p>two</p></li>" +
      "</ul>" +
   "" +                       // <= Where the </li> *should* be according to the standard.
    "<li><p>three</p></li>" +
  "</ul>")

Fails with the following:

dom_unordered_list_nest_nonstandard: Failure: types differ at doc.0.1
in doc(bullet_list(list_item(paragraph("one")), bullet_list(list_item(paragraph("two"))), list_item(paragraph("three"))))
vs doc(bullet_list(list_item(paragraph("one")), list_item(bullet_list(list_item(paragraph("two")))),   list_item(paragraph("three"))))

The reason I’m posting this is because I have a ton of old HTML code from another text editor that sadly used the non-standard so I’d have to figure out how to parse and fix all of the conditions. (But also, others will run into this issue if they are scraping html out in the web)

So I’m wondering how difficult it’d be for prose mirror to be able to deal with this type of nested structure appropriately?

marijn · March 21, 2016, 8:50pm

ProseMirror uses the browser’s HTML parser to parse HTML into a DOM tree before converting it, so, since HTML5 specifies that that inner <ul> will get an implicit <li>, you’re out of luck here, and will have to use some external tool to clean up your HTML.

bZichett · March 22, 2016, 12:53am

Fair enough Thanks for letting me know. I’ll start on that now…

bZichett · April 7, 2016, 6:17pm

On testing out the transformation of pasted list item content from Google Docs directly into ProseMirror, i’m running into similar issues.

Google Docs Snapshot

http://imgur.com/uSy6GQb

ProseMirror

http://imgur.com/iBFq6oq

I could hook into the transformPasted event and try to undo the incorrect node setup, but I’m wondering if that isn’t the right approach, or rather there is something fundamentally broken about the parsing of bulleted lists, as I don’t believe Google Doc uses the incorrect specification for ul/li/ol. Other editors handle the pasted content correctly (Ex: CKEditor, textAngular, Redactor)

marijn · April 8, 2016, 1:43pm

On closer look, it appears that I got this wrong. The standard, as far as I can see, disallows lists as direct children of other lists, but MDN states the content of a list can be “zero or more <li> elements, eventually mixed with <ol> and <ul> elements.”, where I suppose “eventually” means “if your browser feels like it”. Browsers do seem to allow such structure. I’ve pushed this patch, which adds a somewhat ugly kludge to ‘normalize’ this kind of input.

katepol · July 8, 2020, 12:47pm

Hi!

I run into this issue when pasting nested list from Google Docs to markdown (e.g. here https://prosemirror.net/examples/markdown/)

The text in Google Docs is

And in markdown it turns to

I checked HTML provided by Google Docs, here it is:

<meta http-equiv="content-type" content="text/html; charset=utf-8"><meta charset="utf-8"><b style="font-weight:normal;" id="docs-internal-guid-53910d9d-7fff-69fd-5645-b3172db1606b"><p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">Unordered nested list</span></p><ul style="margin-top:0;margin-bottom:0;"><li dir="ltr" style="list-style-type:disc;font-size:11pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;"><p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt;" role="presentation"><span style="font-size:11pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">Depth 1</span></p></li><ul style="margin-top:0;margin-bottom:0;"><li dir="ltr" style="list-style-type:circle;font-size:11pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;"><p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt;" role="presentation"><span style="font-size:11pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">Depth 2</span></p></li></ul></ul></b>

This HTML has the same issue with ul/li order.

Is this behavior of markdown editor expected?

marijn · July 8, 2020, 2:06pm

Yes. Since the nested list isn’t a child of the first list item in the HTML, ProseMirror’s parser won’t turn it into a child of that item in the ProseMirror document. If your schema allows list items to start with lists, you’ll get a somewhat different structure for this (<ul><li>Depth1</li><li><ul><li>Depth 2</li></ul></li></ul>) but that still doesn’t really reflect what you’re looking for here.

marijn · July 8, 2020, 2:12pm

Oh, nonsense, there was in fact a regression in prosemirror-model 1.10.0 that broke this. I’ve published 1.10.1 which should do better.

katepol · July 8, 2020, 2:48pm

thank you, marijn!!!