Queryable AND collaborative ProseMirror documents

I have a large corpus of documents with every word marked with grammatical features. I want to full text search and render fragments of these documents. I also want to synchronize these documents across devices.

Storing the documents as HTML or JSON seems impractical. Secondary indices must be synchronized to each document. A single change will require updating multiple data structures and reserializing entire documents is expensive.

Storing the documents as YJS or Loro is good for collaboration and persistence, but the secondary indices problem remains. Also those glue layers sure are… sticky.

Since I’m interested in querying by word, a SQL schema like this seems natural:

create table document (
	id string primary key,
	-- other metadata
);
create table word (
	docId string,
	id float8 unique, -- globally unique, useful for collaboration

	block string, -- serialized node
	before string, -- punctuation
	text string,
	after string, -- punctuation

	stem string, -- stem of `text` for search
	chapter usmallint, -- for rendering/searching only a single chapter
	sentence int, -- for rendering/searching only a single sentence
	-- other useful word features for grammarians and translators
);
create table span (
	docId string,
	startWordId float8,
	endWordId float8,
	tag string,
	data string -- could make union types multiple different ways in SQL
);

My questions:

  1. Has anyone had success querying ProseMirror compatible documents in SQL databases?
    1. If so, how did your schema differ from above?
    2. If not, why not?
    3. Did you bother translating ProseMirror transactions into SQL statements or just periodically snapshot documents?
  2. How have others dealt with collaboration while maintaining querying and fragmenting? I’m looking at Doltlite as a potential solution but find solutions like articulated fascinating at preserving semantic intent.

After prototyping for a month I’ve answered most of my own questions.

  1. Only by 2-way mapping ordered IDs in a SQL database to positions in a ProseMirror document OR storing entire serialized documents.
    1. The schema’s been adapted to better fit the problem domain for convenient SQL queries.
    2. N/A
    3. SQL is the source of truth. ProseMirror transactions execute SQL and reactive queries update the UI. SQL holds true constraints and ProseMirror’s schema is an approximation.
  2. Doltlite is wonderful, but merge conflicts are tricky. Commits don’t map nicely to transactions nor steps.

For content addressing I was wrestling with ordered ids vs (version, pos) this week. I’ve come to the conclusion that if you want to serialize a document fragment without loading and reserializing the entire document then you want ordered ids. Otherwise an addition near the start of the document requires incrementing all successive positions which is O(n) vs updating a B-Tree which is O(log n). As soon as you try to hack around that with a bias that’s added when mutating existing positions, well, you’ve invented ordered ids.

To sanely use ordered ids (without a crazy two-way mapping and sacrificing performance) requires forking prosemirror-model and most prosemirror-* libraries. If anyone knows any existing document models that use ordered ids I’d love to try them out. I’ll update this thread with progress.