Scan the document to find links in paragraphs and mark them


I’ve been trying to figure out how can I scan a document to find unmarked URLs in nodes. I ahve thought of two ways of tackling the problem:

  1. Going through the document tree myself and scanning the textContent value of Text nodes to find URLs and then splitting them up. This would be my last resort as I believe there’s probably a better way of achieving this through prosemirror transactions or the transform package
  2. Listen for spaces and check if the last word is URL. If so, mark it as a link. It’s a little different from the actual request but it achieves a similar result and if that’s easier to do I would go this route. Similar behaviour can be found on Bitbucket that uses prosemirror

I’ve been going through the documentation for sometime now and it seems definitely achievable just can’t see it. What do you believe is the best solution?

P.S. I’m also taking a look at the atlassian/prosemirror-utils package. It has some valuable functions but I’m not too sure if I need it.

To do this for an entire document, you could call doc.descendants, and for every text node that’s not already a link, scan for URLs and build up a transaction with addMark changes whenever you find any.

You could also express the update-on-space behavior as an input rule, but that’s a bit trickier, since it’s not a plain text replacement, and writing such rules that create their own transaction requires some non-obvious code. And this won’t catch pasted or dropped URLs, or those that don’t have a space typed after them (because they are inserted before an existing space, at the end of a paragraph, etc), so that’s probably not the best way to do this.


Thank you for your reply @marijn

I’m going the doc.descendats path right now, but I’m struggling with marking certain text. I’ve been working on a snippet that would take a text node and mark a substring of a text. Here’s the code that I have so far:

const mark = Mark.fromJSON(this.editor_state.schema, { "type": "link", "attrs": { "href": "", "title": "a" } })

const node = doc.content.cut(2,5);
const cut_node = node.content[0];

This kinda works but instead of marking a substring, it marks the whole node. This example is just for the bare bones paragraph. How would I got about and ‘cut’ the node properly?

Firstly, you never want to mutate (see push) node data structures. Secondly, it is much easier to update documents through a Transform or Transaction.

1 Like

Right, I didn’t see that Transaction extends Transform that makes things a lot easier. I’ve made my snippet to work with AddMarkStep, the rest seems straightforward ie finding the URL, getting their positions.

One last thing that I would love to get clarification on is positions for AddMarkStep when working with nested nodes. Will I have to create a new Transform object for each node to make my changes? or I need to get positions differently.

Something like doc.descendants will pass you the whole-document position of each node, so you should easily be able to create a single document-wide, transform, apply all your changes, and then use the result.

1 Like

Managed to make it work, thank you for your help. The approach I ended up taking:

  1. Get the current transaction object from EditorState ie
  2. Use doc.descendants to go over all nodes of the document
  3. The doc.descendants callback function will be looking for node.isTextblock and not descending into them.
  4. Find links position in node.textContent using a regex
  5. Make sure that there are not interfering links in the way using node.nodesBetween
  6. Create a Mark object of type link
  7. Create a AddMarkStep object and pass to the transaction using tr.step(step)