Parsing a custom file format into a ProseMirror schema

bruce · April 5, 2020, 9:43pm

I have a file “foo.lit” in a custom document format which I would like to parse into a ProseMirror document schema. I’m looking for advice on how best to do this.

The thing is, the grammar for “.lit” files is nicely captured by the ProseMirror schema. For instance, the node specification looks something like:

doc:                  { content: "head? title? (block|chapter)+"}
head:                 { content: "setup_instruction+"}
setup_instruction:    { content: text*}
chapter:              { content: title? (block | section)+ }
...

This suggests that a parser module that “speaks the language of ProseMirror schemas” would be the best fit for the job. It would be nice if ProseMirror had a FileParser class where I could say

FileParser.parse({my_scheme}, {file_name}, {rules for matching content in the file})

Well, at any rate, what are your suggestions for building my parser? Should I try use lezer? Or should I build it from scratch?

For instance, how would one best go about building a parser which takes a Markdown file to a ProseMirror document schema?

marijn · April 6, 2020, 4:46am

The actual parsing of the document format is outside of ProseMirror’s scope. I don’t know what kind of format you’re parsing here, so I can’t tell you how to write a parser, but it should be easy to construct ProseMirror document nodes as parser output, regardless of which technique you’re using—either directly, or as a transformation from the syntax tree.

See prosemirror-markdown.

bruce · April 6, 2020, 10:11am

Ok thanks, both your answers are very helpful.