Pandoc export

Hey, over the past few years I have programmed a few different export filters for Fidus Writer: DOCX, ODT, JATS, EPUB, HTML, LaTeX and the latest edition is now an exporter to the JSON format used by pandoc (thanks to sponsorship by European Union/Nlnet).

However, all of these exporters are written specifically for the schema used by Fidus Writer. In each case I am essentially iterating over the contents of a prosemirror document in some form or other.

I am wondering if it would make sense to try to rewrite the pandoc export filter so it is more generic and can be used by other prosemirror-based editors as well and then to share the maintenance burden (I would change the license to match that of prosemirror).

Previously I thought that it wouldn’t make that much sense as the schema used is specific to each editor. But seeing that others have created prosemirror-docx and prosemirror-markdown and finding that for export to the pandoc JSON format I just had to go through a lot of trial and error and looking at what the JSON would look like when importing from other formats into Pandoc JSON, I wonder if it would not make sense to maintain some package at least for all the basic node types in common and then make it extensible somehow.

Would that make sense or did everyone else here already come to the conclusion that it makes more sense to create exports that are schema-specific?

Hello @johanneswilm, I’m writing a Prosemirror-based editor for Pandoc’s internal format. Here’s the link, but for now you won’t find anything, because I haven’t published it yet, sorry. :blush:

It reads and writes Pandoc’s JSON format, so I had to write the code to make the conversion from Pandoc’s Blocks and Inlines to Prosemirror’s Nodes and Marks, and vice versa.

Pandoc vs Prosemirror

The conversion is pretty straightforward for blocks, but it’s rather complex at inline level, because you have to match the tree-like nature of Pandoc Inlines with the flat model of Prosemirror Marks.

It is perfectly fine for Pandoc to nest an Emph inside another Emph, but it’s difficult to model it with a Mark in Prosemirror, unless you differentiate the two Emphs with some attribute, and set the excludes property to an empty string in its MarkSpec.

That’s because in Prosemirror a Mark is either set or not set on a span of text; you can’t set it twice.

Even for a given document model – I’m focusing on the Pandoc AST now – you can imagine a bunch of slightly different Prosemirror schemas.

For example, how do you model a Pandoc RawInline? Since it’s an Inline, I first thought of a Mark in Prosemirror. I eventually decided for an atomic inline Node instead, providing a textual sub-editor in the GUI.

Back to your question

I think it’s nevertheless possible to abstract some functions to help in the conversion between Prosemirror and Pandoc, or any format that is tree-like at inline level. The trickiest part of that job is solving the flat vs tree-like translation.

Here I’m describing the path I followed, because I think it relates to your question:

  • I started thinking of different prosemirror-based editors for different models;

  • for each one I wanted to provide an export function to Pandoc JSON, this way providing an export to any format supported by Pandoc;

  • it meant maintaining a bunch of editors sharing parts of code and the ability to export into Pandoc JSON;

  • eventually I opted for a single editor based on Pandoc JSON, that can be configured to adapt to different models and workflows, that way becoming “multiple editors”;

  • choosing Pandoc internal model is clearly a strong requirement, but you can use all its input and output formats, and you can even support further ones through custom readers and writers;

  • the challenge I face is making a single editor become “multiple editors” without changing the editor’s code, only through configuration files or custom readers, writers and filters

1 Like

Hey @massi, that all sounds very interesting. As for the translation problem: I take it that this is mainly a problem if you want to have non-lossy two-way conversion between the Pandoc JSON and the ProseMirror JSON. It should be less of a problem when you consider the file types Pandoc outputs as a lot of them will have a distinction between inline and block level content similar to ProseMirror, right? Even if a format allows the tree approach on inline-content, it will not necessarily make sense - for example a double <strong> or two levels of <i> in HTML will not be something that an author really needs to write, is it?

I’ll look forward to the release of your project. It sounds like it may make most sense for me to contribute to your project if I feel anything is missing.

Yes, you are right about nested <strong> or <i>. Apart from that, the translation is non-lossy.

Situations like this:

<strong>A strong text with <i>italic span</i> inside</strong>.

are not too difficult to manage.

There are ill cases, that are possible in Prosemirror, but not in correct HTML:

<strong>Strong text <i>intertwined</strong> with italic text</strong>.

In Prosemirror you’d have three text nodes: the first and the third with only one Mark, the second one in the middle would have both.

There is also the case of overlapping Marks, which is not ill at all:

<p>Text that is both <strong><i>italic and strong</i></strong>.</p>
<p>Text that is both <i><strong>italic and strong</strong></i>.</p>

Both the paragraphs above would have the same representation in Prosemirror JSON.

The hardest part has been coding how to manage those cases. A hierarchy between Marks decides which has to be cut in ill, intertwined cases, and which Mark should be external or internal in overlapping.

Hey, yes, I’m familiar with the prosemirror format. I can see how the conversion can be lossy in some cases if you cannot distinguish between those two cases you mention. But as far as I can tell, there is no semantic difference in meaning. So even if one isn’t able to recreate the original HTML (or other format), one does not lose any information that could be important to the author, correct? It doesn’t matter to the reader if the strong or the italic tag is outside.

After I was done writing the pandoc exporter bit, I noticed that pandoc is quite lossy when exporting to some formats. So I’d guess that if there is an order between those two tags, you are not guaranteed that output formats like ODT or DOCX will also have an order or follow that particular order - right?

Btw, here is the code: fiduswriter/fiduswriter/document/static/js/modules/exporter/pandoc at c9c55036b9a475f5ad395c910a651f2b3e0d7365 · fiduswriter/fiduswriter · GitHub

With strong and i the order is rather irrelevant.

When you have some data in the attrs of a Mark, you’d prefer your Mark not to be split into many tags with the same attributes.

Suppose you put a mark on a span of text to create a reference for an index. Something like this:

<p>Hello <span class="index-ref" idref="..."><em>wonderful</em> world</span>!</p>

If you model an index reference as a Mark, you’d have two text Nodes, “wonderful” and " world", both with the same Mark.

If you export like that, you’ll get two index references instead of one.

You may have the save problems with links, citations, or any Mark that carries important data in the attrs that should not be split.

It’s manageable, but it requires some attention.