Store Document in memory

voluntadpear · March 15, 2018, 1:25pm

Hello, I want to make sure my ProseMirror editor is production-ready and able to handle also potential malicious users and bugs. We are using ProseMirror as a collaborative editor. We would like to validate all steps coming from the client before storing them on the server. We’ve been looking into the idea of having the full current document state instance in memory on the server and try to apply new steps (that may come for different users collaboratively editing the same document) to this state that is in memory on the server and only if it can be applied without problems, store the new steps with changes to our database.

What we are wondering if this is a plausible approach and if there’s something we may take into account before doing this and if there’s any other approach that you recommend to validate steps sent by different clients to the same document? We worry that there might be additional complexity of having all those documents be in memory of the server and keeping them in sync, especially if we consider horizontal scaling of servers. Is there an easier way to validate steps instead of having to keep the state instance for each document around in memory?

Thank You.

marijn · March 15, 2018, 2:03pm

Hi. Yes, that sounds like a plausible approach, though note that just because a step can be applied doesn’t mean that it is valid—the content of inserted nodes, for example, isn’t checked when they are inserted as a whole. So you might also want to call .check() on the resulting document to make sure it conforms to the schema. Are you already validating the JSON data against a schema? The fromJSON methods are not terribly defensive, and might create bogus nodes when given bad input.

marijn · March 15, 2018, 4:21pm

I’ve just released new versions of prosemirror-model, prosemirror-transform, and prosemirror-state that have more defensive fromJSON methods—they will check the types of their input, so if someone submits, for example, step json with a from field of "lalala", you’ll get an immediate error, rather than an invalid Step object.

mitar · March 15, 2018, 6:09pm

O, thank you.

So if I understand correctly, we can now use fromJSON and if it is succeeds then we know that it also conforms to schema, or should we also do additional schema check on the step itself?

And then we should also add a step to the in-memory document step, and if this succeeds, we should still call .check() on the document, to make sure the result is still according to the schema?

Is there anything else we should care about? Are there ways to send steps which would do some denial of service? Explode memory usage? Things like that?

And if I understand correctly, you are saying that keeping the document state in memory is the best approach to validate that step is reasonable and can be applied on top existing history? Do we have to keep all steps in history of this document state (I am assuming it is keeping it), or can we just somehow keep only the latest state?

marijn · March 16, 2018, 9:50am

fromJSON doesn’t check that, it just verifies that the data has the right types, doesn’t check for schema constraints. But if you run check on the result of applying such steps, that should be enough to check those.

I think JSON-encoded steps can only increase the document size to a degree proportional to their own size. People can of course fill your system up with useless data, but they’d at least need to use bandwidth and time to do it.

Yes, to verify that applying steps is valid you’ll need to actually apply them (or do equivalent work, but I don’t see a reason not to use the existing code for that).

Only if you want to do something with them (like show the document history). If you’re just interested in the current document, that’s all you need to store.

mitar · March 16, 2018, 4:22pm

Thanks for the answers.

Just to be clear. Applying a step to a document instance does not make document instance store the step as well. It just updates its state to a new state, yes?

marijn · March 17, 2018, 8:39pm

Yes. These are all pretty straightforward data structures with little hidden stuff in them. A document is just a tree of nodes, nothing else.

mitar · April 9, 2018, 9:06pm

One more question. It is also possible to use check to check steps themselves? So on the server, where steps is an array of all steps to apply, we were thinking of doing also the following:

steps.forEach((step) => {
  if (step.slice) {
    step.slice.content.descendants((node) => {
      node.check(); // will throw an error if node is not valid
    });
  }
});

But this seems to fail when adding a blockquote.

marijn · April 10, 2018, 6:36am

No, that’s not currently implemented. Slices may be partially open, so calling node.check() on their content like that may throw errors even for perfectly valid slices (say, a list item that’s open at the start and doesn’t start with a paragraph).

mitar · April 10, 2018, 6:53am

Thanks for the explanation. This is also what we were observing.