Python 'module format' equivalent

matthieubellon · June 14, 2016, 3:34pm

Hi,

We will store our content in PM Json Document format. We are lloking forward developing a Python format module equivalent for our server side operations.

Questions :

Is anyone already working on such Python (or Ruby, Java, …) module we can contribute to ? (ping @johanneswilm ?)
Does official (even draft) specs of PM document format exists ?

Thanks for your feedback Matthieu

johanneswilm · June 14, 2016, 5:57pm

Not working on that yet, but would potentially be interested in collaborating on something like this. It only makes sense if we are fairly certain that the maintenance of this won’t be a major burden.

marijn · June 14, 2016, 6:49pm

Are you looking to just port the representation part, or also things like transformations? The first is probably not too hard, and has been relatively stable across releases already (minus the linear position change). I’m (finally) ramping up for a 1.0, which is where the backwards compatibility guarantees start, and the point where we could start thinking about writing a spec.

matthieubellon · June 14, 2016, 8:20pm

Thanks for your replies

Yes we are only interested in the representation part. Our goal is, from a well specified content structure (PM Document), export to numerous formats, in batch or not.

We will wait for 1.0 and the spec but start to work on this as soon as possible as I understand this part will not change much.

Thanks again

johanneswilm · June 20, 2016, 10:00pm

This question made me think more about how best to store the document using a Python backend today if one is going make a release now where users expect to be able to upgrade to future versions.

We are currently storing full HTML of the document every 2 minutes + all the steps that have been sent since the last full update. We use PM 0.7.0 in the frontend and the backend doesn’t understand the data.

This seems problematic, because if the structure of the steps change, and the administrator upgrades their version, then that will destroy the documents that contain unapplied steps that no longer work with the new transformation model.

I am thinking through a few solutions:

A) When the last collab user leaves a document, apply all unapplied steps serverside. Unfortunately this won’t quite work, because even though there is PyV8, it doesn’thave a DOM, so it won’t be able to load the document in HTML format.

B) Create a management page that an administrator can call from his/her browser which loads all the documents with unapplied changes and applies those changes and saves the document. The administrator is asked to do this before updating their server with any newer version.

C) Store both HTML and the PM document format on the server. Use the HTML for now, but for future versions plan on switching to the PM document format. Use PyV8 to apply unapplied steps to the PM document format version only whenever a document is closed.

I’m actually not sure which one makes more sense. It would be preferable though if one could find a solution so that we don’t need to ship older versions of PM with future versions of our software just to be able to read the steps.

@matthieubellon: Is this a problem you guys are facing? If so, what are you doing about it?

marijn · June 21, 2016, 7:39am

ProseMirror’s test suite is running the DOM serializer/parser on top of jsdom. I’m not sure how hard it would be to make that run outside of node, but I don’t see any good reason why it couldn’t.

matthieubellon · June 21, 2016, 7:49am

@johanneswilm I am not sure to answer properly your question because you raised two problem in fact (storage and collaborative issues)

Storage is a problem we are facing, right now. And we have yet to find the best solution.

At the moment we store our content in plain text (Markdown) which, by far, was a mistake.

We are now moving our text editor from CodeMirror to ProseMirror, and, in the process, we try to define THE correct way to store content.

Our users want to decide when they “commit” changes so we disabled the automatic save every x seconds. We won’t have “live” collaborative editing I think which (again, in our context) brought more UX / technicals issues than users benefit. So we diverge here in our problematic I suppose.

For storage we are studying :

The XML/HTML way: Good because HTML has well defined specs. Kind of bad to express complex markers such as annotations.
The JSON way : Good to express complex structure like annotations spanning over multiple paragraph, overlapping each other, or custom metadata. Bad because not specified at the moment (until PM has Document specs written down).
The Plain Text way : We have done that for 2 years with Markdown. What a terrible mistake I made (I only saw the advantages and stupidly put under the carpet the disadvantages of an unspecified format).

At the moment I am thinking of a custom JSON superset over PM Document format once it will be specified. But this idea has still to be battle tested.

johanneswilm · June 21, 2016, 12:18pm

Thanks, that may make sense. Unfortunately, I found a post from 2012 where apparently a jsdom developer claimed that jsdom would not run in anything else than nodejs due to requirejs, etc. . There is a link to Mozilla’s dom.js there, but that hasn’t been updated for 4 years. There is an updated version called domino, but also it is mainly made for nodejs. Different from jsdom, it claims to also run on older versions of nodejs, so that it also should work with Ubuntu 14.04 servers.

So… I will to spend some time figuring out how feasible this is in the short term. Right now I am thinking PyExecJS hardlinked to nodejs + domino may be the way to go.

Yes, indeed. And this also explains why you won’t be looking into translating the transformation code to Python.

Right. However, given that it has to be displayable in a web browser, and that you need to be able to handle paste (?), won’t you have to have some way of serializing it into HTML in an unambigous way anyway?

Agreed.

Taken your comments together with our previous experience of changing filetypes, I am wondering if we simply should save everything in two ways… one that we use now and one we potentially use in the future. That way we minimize the risk. On the other hand, it would be better more ideal to be able to define a migration step on the server once we do change. If we can get access to some kind of DOM on the server without introducing lots of large dependencies, that may be the sanest way of going about this.

matthieubellon · June 22, 2016, 7:59am

Definitively. This what make that data storage question a bit hard to clear out.

This is what we are thinking within the team right now. But it seems weird to have that strategy going on.

We’ll wait for PM v1 and Document specs to feed our thoughts on this.

johanneswilm · August 4, 2017, 10:15am

Hey again, are people still interested in this? I assume that in order to be able to apply transformations on a python backend now, one would have to port prosemirror-transform and the code to resolve positions ( https://github.com/ProseMirror/prosemirror-model/blob/master/src/resolvedpos.js#L203-L218 ). Does the backend need to know the document schema as well, or is that not needed, @marijn?

marijn · August 4, 2017, 10:34am

To apply transformations, definitely. I still really don’t recommend doing this – that’s a lot of rather complicated code to port.

johanneswilm · August 4, 2017, 10:49am

Ok, but unless one can run nodejs on the backend, the only choices are porting the transformation code (++) to the server’s language or have a client do the transformations, right? We have been doing the second for the past two years, and it’s working ok, but we’ve had to deal with a lot of edge cases and the code is now so complex, it’s close to impossible to get new programmers to understand all that is going on.

johanneswilm · August 4, 2017, 11:03am

One more option I could think of C:

Option A) Right now we send transformations around that the server doesn’t understand, but it can still act as a central authority for distribution, etc. … Every two minutes the clients send in a copy of the full document with transformations applied.

Advantages:

Code already exists Disadvantages:
A lot of complexity around having to deal with a server that doesn’t really know the document.

Option B) Port the transformation code to Python.

Advantages:

less traffic (no full documents sent),
server always up to date and a lot of code relatively simpler.

Disadvantages:

Have to convert and maintain a lot of complex python code.

Option C) Along with each transformation, send a chunk of the document after the transformation has been applied that represents the changes document. Only send full nodes along with information about where to insert it.

Advantages:

Server knows about full current document at any time.
Less code to port and maintain that option B.

Disadvantages:

Somewhat more network traffic. Some operations like adding a letter will likely not add much extra space, whereas others (make entire document bold) will be as big as the entire document.

@marijn Would option C sound reasonable to you?

Edit: For option C one could use a json diff mechanism available in several languages such as json-delta [1] to send this type of diff along with the prosemirror steps. This would be a bit overhead because the same information is transmitted twice, but it would have the advantage of always havign a server with a current version of the document.

[1] http://json-delta.readthedocs.io/en/latest/

marijn · August 4, 2017, 1:03pm

I’d embed a V8 in the Python process and use that to run the ProseMirror code – there are several Python-V8 bindings, I’m not sure what the best one is at the moment.

johanneswilm · August 4, 2017, 1:57pm

Thanks! That was also an option I was looking at a while ago. I believe the main reason we chose not to do that was that it would cause quite a bit of overhead to run it in addition to the Python server according to our test results. Additionally, the installation process became quite a bit more difficult. But we don’t have current data.

Of course, patching a json structure in python will also not be without cost. I think we may start by trying to option C using a JSON patch format such as RFC6902 and one of the patch libraries for JavaScript/Python and then see how that compares with running a separate V8 process as you suggest.

johanneswilm · September 16, 2017, 10:47am

We have now tried it out for some time, and it is indeed possible to use RFC6902 type tools to send patches back to the server and have python apply it directly. Some negative points about it:

One ends up sending more data, because changes need to be sent in both steps and as RFC6902-compliant patches
Under some situations, ProseMirror takes care of adjusting the document structure automatically without there being a step (notable: a new document is created without contents and PM makes sure it starts with the minimal permitted structure of documents). In such case one needs to make sure to send patches to the server which otherwise will be unaware of this.
Serverside enforcement of partial editing rights seems difficult: It’s easy enough to allow or a prohibit a user entirely from making any changes. But if a specific user for example only is allowed to add comments and those comments are part of the document structure, it’s not really feasible. If the user sends a patch that does not correspond with the steps, there is really no way for the server to notice.

Despite these shortcomings, we will release Fidus Writer 3.3 with this patch mechanism, as it removes a lot of complications we didn’t really know how to deal with.

As for running an V8 process on the server: As far as I can tell, that is problematic because the PyV8 bindings have not been updated for some 5 years and other solutions seem to do a lot of translations that result in slow execution times. This is all not too good when working over websockets and having to do everything in a single thread to ensure that the order of steps stays as it is.

If others here have experimented with another solution that works better – I would be very interested in hearing about it.

elgow · July 10, 2019, 4:04am

I’ve been trying exactly that. I tried PyMiniRacer, which died of a stack overflow seg fault during creation of a small schema, and js2py which died while translating the ProseMirror code bundle.

I’d like to inquire about whether anyone has had success in using the ProseMirror javascript code in a server-side interpreter. This would, IMHO, be the cleanest way to manipulate editor documents in the server.

johanneswilm · July 29, 2019, 12:13pm

@elgow Did you see my more recent post on https://github.com/fiduswriter/prosemirror-python ? It’s working - it’s just not that fast. I could see how it could make sense under some circumstances though - for example you might have clients send the full document to the server every now and then and only if all clients disconnected abruptly and you end up needing the full document on the server you use this to apply changes that were not in the last full document update.

sciyoshi · August 9, 2019, 3:48am

You may be interested in taking a look at some excellent work done by my teammate Shen Li. He wrote a direct translation of the schema + document parts of ProseMirror into Python, and in the process even uncovered a few bugs in the original implementation.

We have been working on rewriting our editing experience using ProseMirror and will be rolling this library out to production in the next few weeks.

matthieubellon · August 9, 2019, 7:17am

Hello,

Thanks for the links, testing it right now