Document representation in JSON?

Hello peeps,

All of us, who hold burns from experience with the contentEditable for sure will choose most portable document representation ever. JSON is a perfect basis for storing documents, especially considering native support by the databases. Then the question of the document structure comes into play? How should it look like to not get in the way of the editor’s progress? I was keeping an eye on the content-kit-editor by Bustlelabs and seems that they come up with interesting solution called Mobiledoc https://github.com/bustlelabs/mobiledoc-kit

So I wonder - how hard converting a document from the ProseMirror’s internal representation to the Mobiledoc can be?

I’m working on this aspect of my project right now and I am a little confused by your question. ProseMirror has a function to serialize it’s document model (register.serializeTo) into JSON format (or text or markdown) and parse it back to PM document

The following is a simple one block example, but feel free to try out more complex examples to see how it deals with inline styles, bulleted lists, etc.

{"content":
   [
    {"content": 
      [
        {
           "text":"This area is dedicated to my personal book notes that
                   Im electing to publicly share via Project AMPLE.",
           "type":"text"
        }
      ],
      "type":"paragraph"}
   ],
   "type":"doc"
}

If you aren’t using javascript on the server, you could plug in in a package such as pyexecjs to run the appropriate prosemirror converters server side and generate html there to render in templates. (As to avoid having to send the appropriate prosemirror components to the client)

Or if the client will have the package, you can just use register.parseFrom there.

From there, (for example) you can use this to pull out a truncated string (the first text block) or all hyperlinks … or headers. Or you can just store both the HTML and json formats if you don’t care about duplication.

I checked out mobile-doc and I’m not entirely sure what benefit it brings over this method. What about it has piqued your interests? I was just using the demo and it seems it hasn’t implemented bulleted/indented lists (whereas in prosemirror, they are.)

Finally, if you are worried about this JSON schema changing over time, that’s something that we’d need Marijn to chime in on. If it does change, it’d make complete sense to assume fallback converters to keep it up to date, as there’d be too many issues any other way.

(Please clarify, and let me know If i misinterpreted your question!)


P.S: If this does answer your question and you use python on the backend, I will gladly help you out with executing javascript modules from python.

Hello @bZichett ,

thank you for the prompt reply! Indeed, I just run through documentation quickly and didn’t saw a schema versioning mechanism. Sorry, for turbulence, I should just asked what are the mechanism of future-proofing the document’s internal representation.

I’m also very interested in a schema versioning/migration mechanism and opened a new thread specifically for this topic.

That sounds like an interesting setup. We are using Python serverside but have worked around that problem without using JavaScript. Have you looked at how costly it is to call JavaScript on the server? That would be interesting to see.

We are serializing everything to HTML before storing it. It seems to be doable without loss of information and the big advantage is that the HTML format itself will be more stable than a type of json format any of us will come up with. Migrations between Pm versions hasn’t been a problem – at least so far. For really complex items where the HTML structure may change because we use some library, we just make sure to have all the information in a a parent node.

For example, we switched from Mathjax to Katex for math formulas and the two use different HTML structures to represent their formulas, but we had it stored like this:

<p>Some sentence<span class="math" data-formula="2X+16e=\sqrt{5R}">[MATHJAX MARKUP]</span>.</p>

When loading, we would only read the data-formula attribute of the span.math and regenerate the internal part. Switching from one library to another was therefore no big deal.

Just one caveat: HTML can almost always be used to fully represent the document DOM. There is one exception we have come across: sibling text nodes. Before using ProseMirror we found it was important to have sibling text nodes, so we serialize the HTML to JSON for storing in the database. We still do that now, for compatibility reasons, but given that we don’t need sibling text nodes any more, this will likely be gone in the next version of our file format.

Johanne,

I’ve actually implemented a similar type of setup. I’m serializing to 3 formats on the client (json, html, and plain text - for search indexing) The JSON and html are a bit duplicative, but both have a purpose; the html is used for quick template rendering server side (“blog view page”,) and json is used for pulling out previews (first X chunks) or sent in it’s entirety to the SPA which has prosemirror loaded (and therefore parse/serialize capability).

Accessing specific locations or recursion/iterating through the dictionary seems more ‘friendly’ than any python DOM parsing library, and I don’t have too much experience with this, but I believe it might be quite a bit faster as well (haven’t got to timing the difference yet.)

As I said before, my most used case (right now) is to pull out a preview directly from the json field instead of worrying about a updating a separate preview field. This consists of a simple:

try: 
    return self.text_json.get('content')[0]
except:
   return None

I’d predict this method is much more reliable and quicker than setting up an traverse-able/accessible DOM from HTML, especially if the HTML becomes a large chunk of data, but probably faster even if the html consisted of only a single sentence.

So yea, I’ve only used pyexecjs to batch convert models that only had an html format pre-prosemirror (or if I’m importing html or text programmatically from some API (say importing emails from GMail or notes from Evernote.) I’d agree something like pyexecjs is cost prohibitive to run anymore than batch conversions. And those batch conversions can be done anywhere (on my local development server, and then fed into the production server)

Some of this was kind of just to get my thoughts out and I actually copied part of this into my technical documentation notes since it’s a good short summary of a tiny bit of my architecture and thought process. I’d appreciate it if you let me know if anything doesn’t make sense to you. As for me, the only potential issues I see are mostly network transfer and data storage, with the understanding that everything else is better optimized.


As for the json schema, it’s pretty simple, and I think marijn won’t have a real reason to change it for quite a while, if ever. If he does, I’d expect it to be a pretty quick conversion script. Not worrying about embedded math right now although in the future I want to get there. What does all the information about the math formulas look like in what prosemirror serializes to JSON format? Would it be more easy to manage for you?

I only find references to serializeTo in old documentation; is there a more recent method of serializing the contents of the editor to JSON?

I think you want the toJSON method on the document object.

Thanks for the prompt reply!

I’m building a proof-of-concept project to hook ProseMirror up to an existing JSON document structure, but I’m unclear how to use toJSON / fromJSON. To get the data from my JSON blob into ProseMirror should I be using something like EditorState.fromJSON({schema}, jsonBlob)? To serialize the editor contents you point to the document object’s toJSON; is there a way to control the shape of the JSON it outputs?

I think that might work if you wrap the blob in {doc: jsonBlob}. But generally, to create a document from the blob you want schema.nodeFromJSON. The result of that can be put in the doc option to EditorState.create.

No, the library only defines a single format (though you can of course translate that to other structures with your own code).

1 Like