Serialize doc to html for only for certain node types

sudars · September 17, 2022, 9:55pm

I’m trying to convert a prosemirror doc to HTML, but for a destination that only permits a certain subset of HTML node types.

The resulting HTML can have <ul>, <ol>, and <strong>, for example, but not <h1> or <img>.

Everything else I would like to be text.

I can do this by hand, but it is clunky and seems error prone. Is there a way to do this using built in functionality? It seems like I could perhaps pass a whitelist of node types, but I’m missing such an option in the API.

Is there a way to do it?

bZichett · September 17, 2022, 10:07pm

Try using a customized DOM Serializer with input params that does not contain headings nor images (or whatever your use case is)

You could alternately instantiate an instance of DOMSerializer with the static method fromSchema if you wanted to start with your original schema, but simply remove the h# and img objects in nodes from it (or whatever else). See GitHub - marijnh/orderedmap: Persistent ordered mapping for how to adjust the schema definition this way. But just make sure its immutable or start with a new instance of the Schema.

Considering what you are going for, you might be better off with the first option, just using new DomSerializer(nodes, marks) where nodes and marks are restricted lists over your original Schema. But both options above result in the same final serializer. Choose whatever is easier.

After instantiating the DOMSerializer, use serializer.serializeNode(prosemirrorDoc)

I’m not entirely sure how the output would look, but I presume that Headings will be cast as paragraphs and images would be skipped.

marijn · September 18, 2022, 10:20am

Unless your schema is large and changing all the time, doing this by hand sounds like it would be quite straightforward and unproblematic.

sudars · September 23, 2022, 6:13pm

I tried the approach using a DOMSerializer with a schema that only supports a subset of nodes, but it seems that it gets into trouble with some custom nodes we have in the doc that aren’t present in this limited schema.

It throws at this.nodes[node.type.name](node), complaining that this.nodes[node.type.name] is not a function:

  serializeNodeInner(node, options = {}) {
    let {dom, contentDOM} =
        DOMSerializer.renderSpec(doc(options), this.nodes[node.type.name](node))
    if (contentDOM) {
      if (node.isLeaf)
        throw new RangeError("Content hole not allowed in a leaf node spec")
      if (options.onContent)
        options.onContent(node, contentDOM, options)
      else
        this.serializeFragment(node.content, options, contentDOM)
    }
    return dom
  }

I guess that makes sense, given that the node is missing. I thought it would do some sort of “I don’t know about this node, skip it.”

Do you happen to know what allows “Headings will be cast as paragraphs and images would be skipped” to happen? Are those globally set somehow to be cast to different nodes / omitted?

I could still go by hand. The thing I’m worried about there is handling things like nested bullet lists–recursing while keep track of the state seems tricky.

sudars · September 23, 2022, 7:41pm

Well that was a silly question. Headings do not seem to just be cast like I expected. When I add them I get the same error, that the node type can’t be found.

bZichett · September 25, 2022, 5:22pm

My first suggestion was not accurate since the library code is checking for the existence of the types in the schema. Sorry.

That being said, the DOMSerializer still may work but you’ll need to have a parallel schema definition (no missing ones, that breaks it) that has different toDom methods (overriding default),

Ex 1: Headings into regular paragraphs.
Ex 2: Images; you could output a paragraph for img tags found, and fill in with a text placeholder such as “Image X not converted” or just [Image] etc.

@marijn By hand is this what you meant? Seems like DOMSerializer should really work fine rather than working on some doc json, he just needs to adapt a schema from his base one, overriding toDom if I understand correctly.

Well that was a silly question. Headings do not seem to just be cast like I expected. When I add them I get the same error, that the node type can’t be found.

Sounds like something else gone wrong with the schema you are creating or the DOMSerializer invocation, but not sure.

It throws at this.nodes[node.type.name](node), complaining that this.nodes[node.type.name] is not a function:
[library code]

Suggestion to just link to library code instead of pasting here, but doesnt matter much to me. Maybe try pasting your source code so we could better audit in the future.

sudars · September 26, 2022, 1:23am

Yep absolutely, sloppy linking on my part, sorry.

I think that I have a hackier solution than what you suggest, but it seems to be working for the paces I’ve put it through. I’m editing this slightly to remove some custom code, so this might not run exactly, but I have a version much like this running.

/**
 * Converts a node to text. This is intended to be used as a callback to
 * Prosemirror's `descendants()` fn.
 */
const convertNodeToHtmlSubset = ({
  node,
  // The recursive depth of this call. The root is 1
  recursiveDepth,
  // An array of string tokens. This should be joined with the empty string to
  // get the final HTML string. This is used as a string builder rather than
  // simple `str += 'foo'` because this creates less garbage collection
  tokens,
}) => {
  if (recursiveDepth > MAX_RECURSIVE_DOCUMENT_DEPTH) {
    return false;
  }

  let recurseChildren = true;

  const getTags = (node) => {
    if (node.type?.name === 'bullet_list' || node.type?.name === 'bulletList') {
      return { open: '<ul>', close: '</ul>' };
    }

    if (node.type?.name === 'ordered_list' || node.type?.name === 'orderedList') {
      return { open: '<ol>', close: '</ol>' };
    }

    if (node.type?.name === 'list_item' || node.type?.name === 'listItem') {
      return { open: '<li>', close: '</li>' };
    }

    if (node.type?.name === 'paragraph') {
      return { open: '<p>', close: '</p>' };
    }

    if (node.isBlock && node.textContent.length > 0) {
      // In this case, there's text that we will display, but it's wrapped in a
      // tag that isn't in our subset. Instead wrap it in <p> for spacing and
      // display the text.
      return { open: '<p>', close: '</p>' };
    }

    return null;
  };

  const nodeShouldBeWrappedInTags = (node) => {
    return !!getTags(node);
  };

  const nodesToNotRecurse = new Set([
    'caption',
    'footnote',
  ]);

  if (nodesToNotRecurse.has(node.type?.name)) {
    recurseChildren = false;
  } else if (node.isText) {
    if (node.text.length > 0) {
      const link = node.marks.filter((m) => m.type.name === 'link' && m.attrs.href)[0];
      const em = node.marks.filter((m) => m.type.name === 'em')[0];
      const strong = node.marks.filter((m) => m.type.name === 'strong')[0];

      const nodeText = node.text;

      if (em) {
        tokens.push(`<em>${nodeText}</em>`);
      } else if (strong) {
        tokens.push(`<strong>${nodeText}</strong>`);
      } else if (link) {
        tokens.push(`<a target="_blank" href="${encodeHtmlEntities(link.attrs.href)}">${nodeText}</a>`);
      } else {
        tokens.push(nodeText);
      }
    }
  } else if (['hard_break', 'hardBreak'].includes(node.type?.name)) {
    tokens.push('<br/>');
  } else if (nodeShouldBeWrappedInTags(node)) {
    const tags = getTags(node);

    tokens.push(tags.open);

    node.descendants((node) => {
      return convertNodeToHtmlStrForPodcast({
        node,
        recursiveDepth: recursiveDepth + 1,
        tokens,
      });
    });

    tokens.push(tags.close);

    // We got the children in `descendants()`. Don't recurse again.
    recurseChildren = false;
  } else {
    // We're ignoring this node.
  }

  return recurseChildren;
};

exports.convertProsemirrorDocToHtmlSubset = (doc) => {
  // Our helper will build this into a list of HTML tokens that we join to make
  // the final HTML.
  const tokens = [];

  doc.descendants((node) => {
    return convertNodeToHtmlSubset({
      node,
      recursiveDepth: 1,
      tokens,
    });
  });

  return (
    tokens
      .join('')
      .trim()
  );
};

bZichett · September 27, 2022, 10:00am

Interesting solution. My thoughts are it is hackier, and aspects of your solution are already implemented by the DOM Serializer - there are clear parallels to what you are doing and what that approach would have done, possibly with many less lines of code. Yet still, with that DOM serializer approach, I have uncertainties with “No-Operation” (no-op) toDOM methods (such as rendering nothing or just skipping certain marks). You may have still needed a final custom HTML “clean up” method to remove empty nodes for instance.

I think that I have [a solution]

If it works for you, it works; this doesn’t seem that difficult to either extend and/or maintain.