Rust vs Python prosemirror

Hey,

prosemirror-rs has been around for a number of years. But it didn’t include the changes of prosemirror over the past few years and couldn’t be run based on a schema during runtime, so as part of a project by NlNet/EU on extending Fidus Writer, I tried to add what was missing so that I could use it as an alternative backend instead of the python version.

I did some benchmarking:

──────────────────────────────────────────────────────────── Backend : rust

Documents : 5000

Steps/doc : 500

Wall time : 15.220 s

CPU time : 15.210 s

Memory Δ : +32.6 MB

──────────────────────────────────────────────────────────── Backend : python

Documents : 5000

Steps/doc : 500

Wall time : 41.648 s

CPU time : 41.650 s

Memory Δ : +4934.2 MB

============================================================

COMPARISON (Rust vs Python)

Wall time : 0.37x (faster)

CPU time : 0.37x (faster)

Memory Δ : 0.01x (smaller)

(I am wondering If I made a mistake on the memory calculation as the difference is so extreme.)

I’ll be happy to hand this back to the original prosemirror-rs maintainer if he wants it. In the meantime you can find it here: GitHub - fiduswriter/prosemirror-rs: Implementation of the https://prosemirror.net model / transform API in Rust · GitHub

1 Like

There are python bindings here: prosemirror-rs · PyPI

And node bindings here: https://www.npmjs.com/package/prosemirror-rs

As a proof-of-concept I have now even compiled it to wasm so one can use the rust implementation as a replacement for prosemirror-model and prosemirror-transform.

Test it here: prosemirror-rs Demos Warning: EXPERIMENTAL CODE! Some functionality may not work.

What is the point?

My idea was initially that this could be a good way of running the same code in the browser and on the backend. Or at the very least discover problems with the code by trying it out in the browser before running into issues with the server. Maybe it could even make sense to run in the browser to be faster if one has a lot of plugins.

Unfortunately, I don’t think a lot of that makes sense any more. It turns out that in Rust, the binding code for different languages (node, wasm, python, etc.) takes up quite a large percentage of the code. So it’s not just a thin layer and bugs can hide in this binding layer just as easily as in the core rust code itself. And for speed - it takes longer to download the wasm binaries. Later on, on most cases the speed limiting factor will be the human doing the typing. Even if one runs a lot of plugins, my guess is that they will spend most of their time in JavaScript and the conversion back and forth between wasm and JavaScript is also not time saving. Maybe, if someone were to write some plugins that do a lot of calcualtions directly in rust, that could actually speed things up.

So really, the one thing I will be using this for is to be the server layer with the python bindings. And for that I believe it did make sense to add the entire API, so that all the prosemirror-model and prosemirror-transform tests can be run against it, which helped get rid of a lot of minor bugs.

If someone else can use it for something, let me know. I think the most important part about these backend implementation in other languages is that they are maintained over time. And that requires more projects using it.

Looks like impressive gains over the straight python port (in particular the memory footprint).

What sort of document manipulation are you doing server-side requires you to have a port in the first place?

We are doing as little as possible server side with it. If it’s not an E2EE document, the backend receives individual steps and applies them to the document, distributed to collaborators, and then saves every X steps. It also rejects steps if they are too old, does some basic checks on whether steps correspond to what the user’s access rights give them the right to do, and sends out “missing steps” if some where lost earlier.

But the server is written in python, and it’s an extendable system (Django + some additions so it’s easier to create plugins for both frontend and backend) - users create their own modules to add extra functionality or integrations with their overall setup. So I can’t easily switch away from Python for the backend without leaving users behind (mainly educational institutions and individual professors, etc.). At one time it was suggested to just run a separate nodejs process on the server. The issue with hat is that it’s just too complex for many of those operating instances to have another process run on the side and it’s a nightmare for us to try to debug - especially if we don’t have direct access to their server.

For example, access rights can be “is allowed to only add comments (which leaves comment markers in the document)” or “can write in the document, but all added text has to be marked as an unapproved tracked change”. One could just have the backend receive snapshots of the document at regular intervals and not apply the steps to the document itself, but then the backend wouldn’t be able to check whether users stuck to the access rights they were assigned.

For E2EE, the clients do follow the advice given and just send in snapshots of the documents every now and then and the server just stores diffs without being able to decrypt the contents. For that reason there are only read-and-write or read-only access rights for E2EE documents and none of the more complex options available to documents that are not encrypted.

1 Like