How to handle edge case with collab module

I’ve implemented a modified protocol. See #307 for details.

Ok, will try this out shortly. From the description if it, the change means that we go from:

Client A sends a bundle of steps to the server. Server responds to client A with accept message if bundle is accepted. Upon receipt of accept message, client A marks steps as being confirmed. Server forwards the accepted bundle to all clients connected to that doc EXCEPT client A.

To:

Client A sends in a bundle of steps to the server. If accepted, the server forwards the bundle to all clients connected to that doc INCLUDING client A. Client A notices that the bundle originated from himself, so instead of applying the steps, client A marks them as having been confirmed.

Correct?

What if client A receives steps that have been applied already? Will they automatically be discarded? (Currently we check and deal with this in our own code)

Yes, that description sounds correct.

Client A can trivially see, through the monotonous version number, which steps it has already applied. In pull-based incarnations of this protocol, it will query the server to give it only changes after version X.

Trying to do the switch is harder than expected. This is mainly because we are bundling the steps together with other things and expect for the entire thing to be confirmed (or not be confirmed). For example: Someone makes a change within the footnote editor (a separate pm instance). This causes a change also in the main editor. We send the steps of both to the server and expect both to either be accepted or rejected. If we send them separately, and say only the footnote editor changes are received and accepted while the main editor changes are lost in transit, then that causes a new series of issues I haven’t quite thought entirely through yet. So I think I may have to build something on top of the pm code that essentially does what “confirmSteps” used to do. This new functionality is still useful for each pm instance by itself.

@kiejo: When your client disconnects, do you actually get a disconnect message on the server? To me it seems that some bundles just get randomly lost, even though there is no disconnects/reconnect happening. But from googling around, websocket connections shouldn’t really “lose data” like that.

You could add the newly applied steps and clientIDs to your confirmation message on the server and directly call receive instead of confirmSteps on the client. That’s what I’m doing now and it should result in the same behavior of the confirmSteps method we used before. That way you also don’t have to send a “new version” message to the client who is responsible for the new changes. Basically it’s pretty much the same logic from before. If the confirmation message gets lost due to some connection issue, that is not a problem as the new code will make sure not to apply the steps a second time during the next retrieval of remote steps.
Maintaining consistency across multiple PM instances is a different challenge of course…

I think this might depend on the socket library you’re using. With socket.io a ping is used to check the connection status and it might very well be that you receive a disconnect message 30 seconds after the connection was actually lost (depending on the ping interval). I think this also depends on how the socket is closed. Cleanly closing the socket should fire the disconnected events immediately.
In regards to lost packages I know that socket.io buffers messages, if you call emit while being disconnected: send buffer code. But I’m pretty sure that the current syncing protocol ensures correctness even if some messages get lost. What concrete problems are you experiencing in regards to lost bundles?

Yes, same here. That’s what I meant. Except we already gave each bundle an identifier and the server confirms that identifier. So we don’t really need to send the clientId back.

Interesting. With Tornado as the server, the disconnection happens pretty much instantaneously, also if the disconnect isn’t clean. At least in all our tests it has been working that way.

Clients that have never (officially) been disconnected receive a bundle and that’s when they notice their version number is too low. So they send a request to the server to resend the missing steps.

Thinking about it, it might be possible that the behavior I described only occurs on the client while on the server I get the disconnect immediately. For the client I know for sure that there is some delay in detecting a disconnection of the socket. I can see a delay for example when I deactivate my network adapter.

I’m not sure what the connection problem is, but it sounds like this shouldn’t affect the final document output?
If the client’s version number is too low, it requests the latest steps starting from its own version number and calls receive on these steps. If the steps originated from this client, the steps get confirmed, otherwise they get applied and unconfirmed local steps get rebased and sent to the server.
Do the connection issues cause problems with the document syncing?

Right. Of course, taken together, our setup works for me, currently. But that is because we have a lot of extra checks and timeouts and we don’t assume that what is being sent actually arrives. All this makes the app a lot more complex with more potential room for bugs, etc. .

So I was thinking: What if I simply keep all messages I sent back and forth and add just one version number starting with zero from connect (no matter whether for pm, a chat, etc.) and if it is too low, request everything missing again, etc., but then I realized that websocket itself should take care of that already, and I started wondering whether this “data loss” really is on the part of the connection or whether there could be another explanation for it.

About once a month the university project that builds on top of Fidus Writer meets up physically. There they try to write a document with sone 10 participants. Last time they still managed to vreak it. Unfortunately, they don’t provide debug output, so I’m a bit in the dark as for possible causes. But this “data loss” looks like a possible candidate to me.

I see, I agree that it would definitely be nice to remove some of that extra complexity.

This goes back to the Two Generals’ Problem, which @marijn mentioned earlier. As long as we’re working with an “unreliable link”, we simply cannot assume that what’s being sent actually arrives.
Two quotes from the wiki article: “it shows that TCP can’t guarantee state consistency between endpoints” … “it applies to any type of two party communication where failures of communication are possible”.

Unfortunately sockets do not solve this underlying problem, which is why we have to handle this on the application level. Based on what I have read, I think that there’s no way around the additional complexity of handling these particular edge cases.
I’m not saying that there isn’t any other explanation for the “data loss”, but that some of the additional code and complexity is necessary if we want to reliably synchronize state on an unreliable connection.

This is interesting as we are struggling with similar issues, too.
In our case I do not think that the problems are caused by connection issues, but by steps, which cannot be successfully applied in certain situations (this manifests in errors like “Position X out of range”). These errors occur rarely in the latest PM version (@marijn was very fast in responding to these issues).
I think what happens is that the steps are applied on the client and even though they throw an error, everything looks fine to the user (I opened a similar issue some time ago). On the server the error is thrown and we catch it and reply with an error to the client and do not broadcast these steps. In this case refreshing the page and going back to the server version is the only workaround I currently know of in our case. I’m not sure what the best way to handle this kind of error is.
Could it be possible that something similar is happening in your case, too? Maybe we should open another thread in that case.

I just thought of another potential problem you might be experiencing. How are you applying and saving new steps on the server? If you are doing this asynchronously or using multiple threads, you need to perform some locking/unlocking logic on the documents. I ran into this issue early on as we are persisting changes to documents in an asynchronous fashion in Node.js on the server. Without locking a document when applying new steps it is possible for multiple clients to apply different steps to the same version of the document at the same time resulting in the last write winning over the others. This was hard to reproduce with few collaborators, but with more collaborators it happened more frequently.
I could provide more information on how we handle locking/unlocking documents, if you are interested.

Yes, let’s open another thread titled something like “avoiding dataloss problems over websocket connections”

As for saving: Oyr server doesn’t isn’t running javascript. So it only receives steps and checks if the version number is correct and if it is, it increases the version number and forwards the steps to all other connected clients. At least one of the clients periodically sends in a html version of the entire document. For this purpose it uses the confirmed version of the document, without any unconfirmed steps applied.

On tge server, the document is just opened once and this instance is shared through tornado with all connected clients. When the server receives the full html version, it saves tge full version + recently received steps and a few other things to the database.

As I said, I’m not sure what the issue is. I think it was likely multiple things and fewer things now than a few months ago. I can spend half an hour collaborating with one or two other issues, and it all works fine. There seems to be no easily defined user action that triggers it. I will know more on Monday, when they have their next test.

That’s an interesting setup you have! Then let’s wait for the next test to find out more.

The result of today’s test: while they discovered some oddities about Fidus Writer in general, they were not able to crash the editor. A second team was giving it a go, but they weren’t able to break it either. So beyond known issues, we seem to be relatively fine.

Nice, that’s great to here!

The new protocol has a pitfall of its own – it’s not hard to work around, but it’s awkward. If you have multiple changes being made in quick succession, the client will send the first one to the server, but won’t assume it arrived there. It might then, before it gets its changes back from the server, try to send again. At that point, its confirmed version is still the old one, so it’ll try to send all steps, including the previously sent ones, leading to a conflict. Conflicts should be responded to by fetching new changes anyway, so if you put everything together correctly, the end result should still be fine, but you do get a bunch of unnecessary requests happening.

To avoid these, we could bring back a redundant confirmation response to submitting new changes, which the client can use, but which, through the adjustments made in response to this issue, it doesn’t need to rely on. But I’m not sure I want to add more complexity. @kiejo Did you run into this issue? If so, how did you solve it?

The way I solved it: have the server still create confirmation messages, but just use the receive function to confirm them with pm. Also, when a change has been sent to the server, don’t send another change until either A) the first change has been confirmed by the server or B) 1000 ms have passed.

I’m basically doing the same thing. The client always waits for a response from the server (success or failure) or times out before sending a new message. So there’s always at most one open request at a time.

My implementation also has only a single open request at a time, but when the send request finishes, you still have to take care not to immediately start another one (when you still have pending changes).

Yes, I always make sure to call receive on the collab module before “releasing the lock”.

I haven’t checked the demo collab code in a while and I’m using sockets in my implementation so the following points might not apply, but these are additional edge cases that I’m handling on the client:

  • After confirming steps (regardless of whether the steps originated on the current client or someone else), make sure to send new locally pending steps if there are any. Otherwise it is possible for the last few steps to not get synced in case the user stopped typing while new steps were being received/transmitted (and no one else is typing anymore).
  • If we get a “new version” message while there is an already open getSteps request, schedule another getSteps request for when the current request finishes. Otherwise it is possible that we do not sync the latest few steps from other clients (if no one is typing anymore).

We haven’t run into syncing issues for quite some time now, so I’m confident that it is fairly robust by now.