Scan the document to find links in paragraphs and mark them

vitalij · May 14, 2020, 3:13pm

Hi,

I’ve been trying to figure out how can I scan a document to find unmarked URLs in nodes. I ahve thought of two ways of tackling the problem:

Going through the document tree myself and scanning the textContent value of Text nodes to find URLs and then splitting them up. This would be my last resort as I believe there’s probably a better way of achieving this through prosemirror transactions or the transform package
Listen for spaces and check if the last word is URL. If so, mark it as a link. It’s a little different from the actual request but it achieves a similar result and if that’s easier to do I would go this route. Similar behaviour can be found on Bitbucket that uses prosemirror

I’ve been going through the documentation for sometime now and it seems definitely achievable just can’t see it. What do you believe is the best solution?

P.S. I’m also taking a look at the atlassian/prosemirror-utils package. It has some valuable functions but I’m not too sure if I need it.

marijn · May 14, 2020, 3:38pm

To do this for an entire document, you could call doc.descendants, and for every text node that’s not already a link, scan for URLs and build up a transaction with addMark changes whenever you find any.

You could also express the update-on-space behavior as an input rule, but that’s a bit trickier, since it’s not a plain text replacement, and writing such rules that create their own transaction requires some non-obvious code. And this won’t catch pasted or dropped URLs, or those that don’t have a space typed after them (because they are inserted before an existing space, at the end of a paragraph, etc), so that’s probably not the best way to do this.

vitalij · May 15, 2020, 9:24am

Thank you for your reply @marijn

I’m going the doc.descendats path right now, but I’m struggling with marking certain text. I’ve been working on a snippet that would take a text node and mark a substring of a text. Here’s the code that I have so far:

const mark = Mark.fromJSON(this.editor_state.schema, { "type": "link", "attrs": { "href": "https://google.com", "title": "a" } })

const node = doc.content.cut(2,5);
const cut_node = node.content[0];
cut_node.marks.push(mark);

This kinda works but instead of marking a substring, it marks the whole node. This example is just for the bare bones paragraph. How would I got about and ‘cut’ the node properly?

marijn · May 15, 2020, 9:37am

Firstly, you never want to mutate (see push) node data structures. Secondly, it is much easier to update documents through a Transform or Transaction.

vitalij · May 15, 2020, 4:04pm

Right, I didn’t see that Transaction extends Transform that makes things a lot easier. I’ve made my snippet to work with AddMarkStep, the rest seems straightforward ie finding the URL, getting their positions.

One last thing that I would love to get clarification on is positions for AddMarkStep when working with nested nodes. Will I have to create a new Transform object for each node to make my changes? or I need to get positions differently.

marijn · May 15, 2020, 6:36pm

Something like doc.descendants will pass you the whole-document position of each node, so you should easily be able to create a single document-wide, transform, apply all your changes, and then use the result.

vitalij · May 19, 2020, 3:21pm

Managed to make it work, thank you for your help. The approach I ended up taking:

Get the current transaction object from EditorState ie state.tr
Use doc.descendants to go over all nodes of the document
The doc.descendants callback function will be looking for node.isTextblock and not descending into them.
Find links position in node.textContent using a regex
Make sure that there are not interfering links in the way using node.nodesBetween
Create a Mark object of type link
Create a AddMarkStep object and pass to the transaction using tr.step(step)

orestis · October 10, 2022, 12:38pm

Since I’m also looking for a way to do this, I was wondering if this approach is effective to do as-you-type. I see two downsides:

Performance, you’re scanning the whole document again and again
If a user actually doesn’t want some URL to be converted to a link, and removes the link entirely, then you are effectively overriding their decision.

Is there a way to apply this to just the nodes that were “touched” by the last transaction?