Having trouble parsing an anchor element as a node, not a mark

lowercasename · August 25, 2020, 2:44pm

Hi there! I’m trying to leverage a ProseMirror schema to convert HTML into JSON, but I’ve got stuck on getting ProseMirror to recognise an <a> tag as a top-level Node element.

Here’s my input HTML:

<a class="link-preview-container" href="https://example.com" rel="noopener noreferrer nofollow" target="_blank">
  <img class="link-preview-image" src="https://example.com/image.jpg">
  <div class="link-preview-text-container">
    <div class="link-preview-title">Link Title</div>
    <div class="link-preview-description">A link description</div>
    <div class="link-preview-domain">example.com</div>
  </div>
</a>

Here’s my schema for this node (note I haven’t written the getAttrs function so I’m aware this won’t fill in the resultant JSON, but it should at least generate it):

sweet_link_preview: {
      group: "block",
      attrs: {
        "url": { default: null },
        "title": { default: null },
        "description": { default: null },
        "image": { default: null },
        "domain": { default: null }
      },
      parseDOM: [{ tag: "a.link-preview-container" }],
      toDOM: node => 
        ["a", { class: "link-preview-container", href: node.attrs.url, rel: "noopener noreferrer nofollow", target: '_blank', }, 
          ["img", { class: "link-preview-image", src: node.attrs.image }],
            ["div", { class: "link-preview-text-container"}, 
              ["div", {class: "link-preview-title"}, node.attrs.title],
              ["div", {class: "link-preview-description"}, node.attrs.description],
              ["div", {class: "link-preview-domain"}, node.attrs.domain],
          ]
        ]
    },

Here’s the actual output:

{
  "type": "doc",
  "content": [
    {
      "type": "paragraph",
      "content": [
        {
          "type": "text",
          "marks": [
            {
              "type": "link",
              "attrs": {
                "href": "https://example.com"
              }
            }
          ],
          "text": "Link Title"
        }
      ]
    },
    {
      "type": "paragraph",
      "content": [
        {
          "type": "text",
          "marks": [
            {
              "type": "link",
              "attrs": {
                "href": "https://example.com"
              }
            }
          ],
          "text": "A link description"
        }
      ]
    },
    {
      "type": "paragraph",
      "content": [
        {
          "type": "text",
          "marks": [
            {
              "type": "link",
              "attrs": {
                "href": "https://example.com"
              }
            }
          ],
          "text": "example.com"
        }
      ]
    }
  ]
}

And here’s the expected output I would like:

{
  "type": "doc",
  "content": [
    {
      "type": "sweet_link_preview",
      "attrs": {
        "url": "https://example.com",
        "title": "Link Title",
        "description": "A link description",
        "image": "https://example.com/image.jpg",
        "domain": "example.com"
      }
    }
  ]
}

Any help or pointers would be gladly appreciated! And let me know if you need more of my schema.

lowercasename · August 25, 2020, 2:55pm

With a minimal test case I’ve determined that it’s definitely the link rule that’s tripping the parser up in this case (with just the link rule disabled in the schema, the conversion works). Although there are some other complicating factors with other custom Nodes I’ve made - but once I’ve got this working, I can proceed to those. For reference, here’s the link schema:

link: {
      attrs: {
        href: {},
      },
      parseDOM: [{tag: "a[href]", getAttrs(dom) {
        return {href: dom.getAttribute("href") }
      }}],
      inclusive: false,
      toDOM(node) { let { href, target } = node.attrs; return ["a", { href, rel: "noopener noreferrer nofollow", target: '_blank', }, 0] }
    },

marijn · August 25, 2020, 3:01pm

You can’t have both a link mark and a link node, both targeting <a> tags, if you want unambiguous parsing of such tags. Sounds like you want to get rid of the mark?

lowercasename · August 25, 2020, 4:19pm

Ah, I see! I was hoping that it would end up being unambiguous because the node was targeting a more specific selector. Unfortunately, I need both, because the input text I’m dealing with has regular links within paragraphs, blockquotes, and lists, as well as these ‘link previews’, which are root-level blocks. This is a one-time operation (I’m converting a database to use ProseMirror), so perhaps I need to first parse the HTML and replace the <a.link-preview-container> tags with <sweet-link-container>, or whatever.

Do let me know if I’m obviously in the middle of an XY problem - the actual PM editor I’m using of course does use unambiguous elements in its schema, but I first need to get the old database into a JSON format PM recognises.

marijn · August 25, 2020, 7:08pm

Ah, I didn’t notice the extra class name before. So these are just different elements? I think if you add a priority: 60 property to the link preview’s parseDOM rule (the thing that also has the tag property), it’ll get tried before the mark’s rule.

lowercasename · August 25, 2020, 8:34pm

Oh, this is what happens when you don’t read the docs digilently enough. I was carefully putting priority in the top-level node object, not inside parseDOM.

Thank you very much for the help!