How to parse HTML raw text with ProseMirror?

ccorcos · March 23, 2021, 5:14am

I want to open edit an HTML file with ProseMirror, but I don’t want to load the html into the DOM for security reasons. Preferably ProseMirror would parse the HTML text into its JSON representation and thus only render secure content.

Is this possible?

I’m seeing that DOMParser only parses HTML nodes.

It looks like prosemirror-markdown uses markdown-it to parse directly into the JSON format… So I guess I should mention that I’m just using the prosemirror-example-setup. Perhaps this doesn’t exist yet?

Thanks!

marijn · March 23, 2021, 9:53am

Browsers can parse HTML securely. Look into createHTMLDocument for creating a detached DOM document.

ccorcos · April 1, 2021, 9:14pm

Hmm. I don’t think this prevents from writing a <script> that evaluates within that dom though…

marijn · April 2, 2021, 7:04am

It would have taken you about 10 seconds to experimentally find out that, yes, it does prevent that.

ccorcos · April 2, 2021, 7:08pm

Touché.

I didn’t see any mention of security in that documentation though.

const doc = document.implementation.createHTMLDocument()
const div = doc.createElement("div")
div.innerHTML = `<div>
<p>Hello world</p>
<script>
console.log("hello")
</script>
<div>`

The script doesn’t appear to be evaluated, but the script tag is there.

It looks like DOMParser is also an option.

(new DOMParser).parseFromString(`<div>
<p>Hello world</p>
<script>
alert("hello")
</script>
<div>`, "text/html")

Thanks for the help

ccorcos · June 30, 2021, 11:21pm

I’m running into this issue again for writing unit tests in Node.js (no browser document)

marijn · July 1, 2021, 6:22am

The prosemirror-model tests use jsdom (see here) for this.

ccorcos · October 25, 2021, 2:45am

Causing some pain again – HTMLElement is not defined… It appears node: string | Node. So I cant just check for typeof node !== "string". Since HTMLElement inherits from Node, I need to actually check the instanceof for TypeScript to be happy…

		checkbox: {
			noBlockSelect: "first-child",
			content: "paragraph block*",
			defining: true,
			attrs: {
				checked: { default: false },
			},
			parseDOM: [
				{
					tag: `li[data-type="checkbox"]`,
					contentElement: ":scope > div",
					priority: 51,
					getAttrs(node) {
						if (node instanceof HTMLElement) {
							const child = node.querySelector(
								":scope > input"
							) as HTMLInputElement | null

							return { checked: child && child.checked }
						}
					},
				},
			],

Any ideas? I’m thinking I might just patch into the node.js global

marijn · October 25, 2021, 6:29am

Maybe check node.nodeType == 1 instead?