How to handle edge case with collab module

kiejo · April 17, 2016, 9:33am

I have the following rare edge case when using ProseMirror with the collab module:

Client sends steps to server with version 0
Server applies steps and increments to version 1
A network connection problem leads to a timeout on the client, meaning that the client does not call confirmSteps and stays at version 0 without knowing that the server applied the steps
Client reconnects and tries to apply the same steps as before with version 0
Server replies with an “outdated version” error
Client requests the latest steps starting from version 0
Server responds with the steps that were sent in step 1
Client applies the steps incrementing the version to 1 (now in sync with the server) and rebases its locally pending steps (which are the same steps the server already received and applied)
Client sends the rebased steps to server
Server applies the steps
Client and server are now in sync with version 2, but the same steps have been applied twice

Are there any recommendations or ideas on how to best handle this case? The fact that other collaborators might have changed the document during the whole time makes it additionally hard to handle this edge case.

One idea would be to let the client confirm every server confirmation before the server commits its changes, but this would lead to quite some overhead for the communication between server and client.
Another idea would be to add a unique identifier to each step, which the client could compare against after recovering from an “apply steps timeout”. That way the client could check if the latest steps from the server are the same as the local steps and prevent duplicates from being rebased and applied.

Or maybe there are better ways to handle this?

marijn · April 18, 2016, 8:37am

What transport are you using for the server communication? Would it help to have the server only commit the changes if the client has acknowledged its response? (Which is already happening in TCP-based transports anyway.)

kiejo · April 19, 2016, 10:16am

I’m using sockets via socket.io.
I think that acknowledge messages cannot solve this problem after all. I thought through different edge cases and there always seems to be a way for the client and server to end up in different states in the case of a (network) failure.

Here’s an example of what can still go wrong even when using an acknowledge message:

Client sends steps to server (v1 uncommitted)
Server applies steps and waits for an acknowledge message from the client (v1 uncommitted)
Client receives success message from server and confirms the steps (v1 committed)
Client sends an acknowledge message
A network connection problem prevents the server from receiving the acknowledge message, which results in a timeout that leads the server to throwing away the uncommitted changes (back to v0)
Client is now at version 1 while the server is at version 0. The client thinks that the server is at version 1 too and does not have any uncommitted steps to send to the server.

These problems do not only occur in case of a network failure, but can also happen if the server or client crashes at a bad time. I’m not sure how we should best handle these cases in ProseMirror. Maybe using unique identifiers for steps would help solve the initial problem better (filtering out already applied steps before sending them again after a timeout).

marijn · April 19, 2016, 7:56pm

I think you have a good point. I’m wondering how other systems act in a similar situation. For example, a database client might want to know whether its transaction went through. Does it just assume it might have gone either way when its connection was interrupted at the wrong point? I guess it has to. The term, as someone on twitter told me, is Two Generals’ Problem and it has indeed been proven to not have a solution.

I’ve filed #307 to track this.

Adding ids to steps might also simplify some tricky issues in history tracking, but it would mean (a little) more complexity. I wouldn’t filter the steps the way you described, but rather move the confirming of steps into the regular step-receiving part of the protocol. Instead of directly confirming the steps when successfully sent to the server, the client could notice, when it receives the steps from the server, that some of these are in its own unconfirmed step set, and confirm them.

I’ll try to work on this next week.

johanneswilm · April 20, 2016, 6:06am

This is not really solving the entire problem either, but at least it gives a little more stability:

In addition to the version number, we calculate checksums of the document contents. Then we send around the checksum of the doc before the steps together with the steps. Any client receiving the steps + checksum first checks if the checksum corresponds to the confirmed version of the doc it holds itself. If yes, it applies the transformation steps. If no, it cancels and asks the server for the last full document plus subsequent diffs and tries to figure it out with that information available.

It is seldom that I see the checksum test fail, but it does happen.

Additonally, after sending in a diff and before receiving confirmation, we don’t accept and other diffs. If we receive another diff from the server, we save it until we have received confirmation. If we don’t receive any confirmation within a given period of time, we suspect that the diff was lost on the way or that the confirmation was.

kiejo · April 20, 2016, 10:04am

Thanks for pointing that out. I had a feeling this was unsolvable and it’s nice to know the right term for this now.

I think that adding a little more complexity would definitely be worth it in this case.

I like this approach, sounds good!

@johanneswilm thanks for sharing, are there any other collaboration edge cases you encountered and which you are handling with these mechanisms? Would be great to document them so that we can directly handle them inside ProseMirror if possible.

johanneswilm · April 20, 2016, 11:06am

Double application of the same step was one of them. I assumed that it was because of our general setup that we had this problem and that it didn’t happen for others. We haven’t figured out why the checksums don’t match in 100% of cases (this was when writing fast in PM =< 0.5.1), but we used to have a lot of cases of broken documents because invalid steps were applied. This seemed like a sane way of avoiding that.

A better way to perhaps avoid application a step that originated on the same machine could be to add a signature (no encryption, just a random text string that always stays the same throughout a session) to the step. So if a PM instance receives a step that has its own signature, it just skips that one.

marijn · April 25, 2016, 2:35pm

I’ve implemented a modified protocol. See #307 for details.

johanneswilm · April 26, 2016, 3:00pm

Ok, will try this out shortly. From the description if it, the change means that we go from:

Client A sends a bundle of steps to the server. Server responds to client A with accept message if bundle is accepted. Upon receipt of accept message, client A marks steps as being confirmed. Server forwards the accepted bundle to all clients connected to that doc EXCEPT client A.

To:

Client A sends in a bundle of steps to the server. If accepted, the server forwards the bundle to all clients connected to that doc INCLUDING client A. Client A notices that the bundle originated from himself, so instead of applying the steps, client A marks them as having been confirmed.

Correct?

What if client A receives steps that have been applied already? Will they automatically be discarded? (Currently we check and deal with this in our own code)

marijn · April 26, 2016, 7:36pm

Yes, that description sounds correct.

Client A can trivially see, through the monotonous version number, which steps it has already applied. In pull-based incarnations of this protocol, it will query the server to give it only changes after version X.

johanneswilm · April 27, 2016, 9:09am

Trying to do the switch is harder than expected. This is mainly because we are bundling the steps together with other things and expect for the entire thing to be confirmed (or not be confirmed). For example: Someone makes a change within the footnote editor (a separate pm instance). This causes a change also in the main editor. We send the steps of both to the server and expect both to either be accepted or rejected. If we send them separately, and say only the footnote editor changes are received and accepted while the main editor changes are lost in transit, then that causes a new series of issues I haven’t quite thought entirely through yet. So I think I may have to build something on top of the pm code that essentially does what “confirmSteps” used to do. This new functionality is still useful for each pm instance by itself.

@kiejo: When your client disconnects, do you actually get a disconnect message on the server? To me it seems that some bundles just get randomly lost, even though there is no disconnects/reconnect happening. But from googling around, websocket connections shouldn’t really “lose data” like that.

kiejo · April 27, 2016, 9:47am

You could add the newly applied steps and clientIDs to your confirmation message on the server and directly call receive instead of confirmSteps on the client. That’s what I’m doing now and it should result in the same behavior of the confirmSteps method we used before. That way you also don’t have to send a “new version” message to the client who is responsible for the new changes. Basically it’s pretty much the same logic from before. If the confirmation message gets lost due to some connection issue, that is not a problem as the new code will make sure not to apply the steps a second time during the next retrieval of remote steps.
Maintaining consistency across multiple PM instances is a different challenge of course…

I think this might depend on the socket library you’re using. With socket.io a ping is used to check the connection status and it might very well be that you receive a disconnect message 30 seconds after the connection was actually lost (depending on the ping interval). I think this also depends on how the socket is closed. Cleanly closing the socket should fire the disconnected events immediately.
In regards to lost packages I know that socket.io buffers messages, if you call emit while being disconnected: send buffer code. But I’m pretty sure that the current syncing protocol ensures correctness even if some messages get lost. What concrete problems are you experiencing in regards to lost bundles?

johanneswilm · April 27, 2016, 12:59pm

Yes, same here. That’s what I meant. Except we already gave each bundle an identifier and the server confirms that identifier. So we don’t really need to send the clientId back.

Interesting. With Tornado as the server, the disconnection happens pretty much instantaneously, also if the disconnect isn’t clean. At least in all our tests it has been working that way.

Clients that have never (officially) been disconnected receive a bundle and that’s when they notice their version number is too low. So they send a request to the server to resend the missing steps.

kiejo · April 28, 2016, 7:34am

Thinking about it, it might be possible that the behavior I described only occurs on the client while on the server I get the disconnect immediately. For the client I know for sure that there is some delay in detecting a disconnection of the socket. I can see a delay for example when I deactivate my network adapter.

I’m not sure what the connection problem is, but it sounds like this shouldn’t affect the final document output?
If the client’s version number is too low, it requests the latest steps starting from its own version number and calls receive on these steps. If the steps originated from this client, the steps get confirmed, otherwise they get applied and unconfirmed local steps get rebased and sent to the server.
Do the connection issues cause problems with the document syncing?

johanneswilm · April 28, 2016, 9:33am

Right. Of course, taken together, our setup works for me, currently. But that is because we have a lot of extra checks and timeouts and we don’t assume that what is being sent actually arrives. All this makes the app a lot more complex with more potential room for bugs, etc. .

So I was thinking: What if I simply keep all messages I sent back and forth and add just one version number starting with zero from connect (no matter whether for pm, a chat, etc.) and if it is too low, request everything missing again, etc., but then I realized that websocket itself should take care of that already, and I started wondering whether this “data loss” really is on the part of the connection or whether there could be another explanation for it.

About once a month the university project that builds on top of Fidus Writer meets up physically. There they try to write a document with sone 10 participants. Last time they still managed to vreak it. Unfortunately, they don’t provide debug output, so I’m a bit in the dark as for possible causes. But this “data loss” looks like a possible candidate to me.

kiejo · April 29, 2016, 7:39am

I see, I agree that it would definitely be nice to remove some of that extra complexity.

This goes back to the Two Generals’ Problem, which @marijn mentioned earlier. As long as we’re working with an “unreliable link”, we simply cannot assume that what’s being sent actually arrives.
Two quotes from the wiki article: “it shows that TCP can’t guarantee state consistency between endpoints” … “it applies to any type of two party communication where failures of communication are possible”.

Unfortunately sockets do not solve this underlying problem, which is why we have to handle this on the application level. Based on what I have read, I think that there’s no way around the additional complexity of handling these particular edge cases.
I’m not saying that there isn’t any other explanation for the “data loss”, but that some of the additional code and complexity is necessary if we want to reliably synchronize state on an unreliable connection.

This is interesting as we are struggling with similar issues, too.
In our case I do not think that the problems are caused by connection issues, but by steps, which cannot be successfully applied in certain situations (this manifests in errors like “Position X out of range”). These errors occur rarely in the latest PM version (@marijn was very fast in responding to these issues).
I think what happens is that the steps are applied on the client and even though they throw an error, everything looks fine to the user (I opened a similar issue some time ago). On the server the error is thrown and we catch it and reply with an error to the client and do not broadcast these steps. In this case refreshing the page and going back to the server version is the only workaround I currently know of in our case. I’m not sure what the best way to handle this kind of error is.
Could it be possible that something similar is happening in your case, too? Maybe we should open another thread in that case.

kiejo · April 29, 2016, 9:34am

I just thought of another potential problem you might be experiencing. How are you applying and saving new steps on the server? If you are doing this asynchronously or using multiple threads, you need to perform some locking/unlocking logic on the documents. I ran into this issue early on as we are persisting changes to documents in an asynchronous fashion in Node.js on the server. Without locking a document when applying new steps it is possible for multiple clients to apply different steps to the same version of the document at the same time resulting in the last write winning over the others. This was hard to reproduce with few collaborators, but with more collaborators it happened more frequently.
I could provide more information on how we handle locking/unlocking documents, if you are interested.

johanneswilm · April 29, 2016, 2:10pm

Yes, let’s open another thread titled something like “avoiding dataloss problems over websocket connections”

johanneswilm · April 29, 2016, 2:24pm

As for saving: Oyr server doesn’t isn’t running javascript. So it only receives steps and checks if the version number is correct and if it is, it increases the version number and forwards the steps to all other connected clients. At least one of the clients periodically sends in a html version of the entire document. For this purpose it uses the confirmed version of the document, without any unconfirmed steps applied.

On tge server, the document is just opened once and this instance is shared through tornado with all connected clients. When the server receives the full html version, it saves tge full version + recently received steps and a few other things to the database.

As I said, I’m not sure what the issue is. I think it was likely multiple things and fewer things now than a few months ago. I can spend half an hour collaborating with one or two other issues, and it all works fine. There seems to be no easily defined user action that triggers it. I will know more on Monday, when they have their next test.

kiejo · April 30, 2016, 9:42am

That’s an interesting setup you have! Then let’s wait for the next test to find out more.