How to parse HTML raw text with ProseMirror?

I want to open edit an HTML file with ProseMirror, but I don’t want to load the html into the DOM for security reasons. Preferably ProseMirror would parse the HTML text into its JSON representation and thus only render secure content.

Is this possible?

I’m seeing that DOMParser only parses HTML nodes.

It looks like prosemirror-markdown uses markdown-it to parse directly into the JSON format… So I guess I should mention that I’m just using the prosemirror-example-setup. Perhaps this doesn’t exist yet?

Thanks!

Browsers can parse HTML securely. Look into createHTMLDocument for creating a detached DOM document.

1 Like

Hmm. I don’t think this prevents from writing a <script> that evaluates within that dom though…

It would have taken you about 10 seconds to experimentally find out that, yes, it does prevent that.

1 Like

Touché.

I didn’t see any mention of security in that documentation though.

const doc = document.implementation.createHTMLDocument()
const div = doc.createElement("div")
div.innerHTML = `<div>
<p>Hello world</p>
<script>
console.log("hello")
</script>
<div>`

image

The script doesn’t appear to be evaluated, but the script tag is there.

It looks like DOMParser is also an option.

(new DOMParser).parseFromString(`<div>
<p>Hello world</p>
<script>
alert("hello")
</script>
<div>`, "text/html")

Thanks for the help

I’m running into this issue again for writing unit tests in Node.js (no browser document) :confused:

The prosemirror-model tests use jsdom (see here) for this.

1 Like

Causing some pain again – HTMLElement is not defined… It appears node: string | Node. So I cant just check for typeof node !== "string". Since HTMLElement inherits from Node, I need to actually check the instanceof for TypeScript to be happy…

		checkbox: {
			noBlockSelect: "first-child",
			content: "paragraph block*",
			defining: true,
			attrs: {
				checked: { default: false },
			},
			parseDOM: [
				{
					tag: `li[data-type="checkbox"]`,
					contentElement: ":scope > div",
					priority: 51,
					getAttrs(node) {
						if (node instanceof HTMLElement) {
							const child = node.querySelector(
								":scope > input"
							) as HTMLInputElement | null

							return { checked: child && child.checked }
						}
					},
				},
			],

Any ideas? I’m thinking I might just patch into the node.js global :confused:

Maybe check node.nodeType == 1 instead?