Data structure with ids

There no easy way to attach attributes only to top-level nodes, since they are associated with node types, and node types tend to be reused at different levels. Though it would actually be possible to create a schema where you use different node types for top-level versions of nodes and nested versions (i.e. top_paragraph can be a child of doc, and paragraph can only be a child of list_item, blockquote, etc).

You could try putting your data directly into attributes, but you’ll have many of the same issues: what happens when the node is split? what happens when it is copied? You no longer have to worry about uniqueness though, which might be helpful.

The tests you show are broken because they don’t match the schema defined. You can’t have text directly in a list item in the default schema, and my example idSchema only attaches ids to textblocks, which, in ProseMirror jargon, are nodes like paragraphs, that directly contain text.

I’ve thought about this some more, and am becoming increasingly convinced that such id tracking is just going to be a poor match for ProseMirror’s data model. I will probably end up rolling back the changes I made for this. Could you elaborate a bit on what you were hoping to achieve? Maybe we can figure out a better approach.

Our data model is an array of objects, like:

[
  { id: 'uuid-0001', type: 'h1', html: '<h1>hello world</h1>', metadata: {} },
  { id: 'uuid-0002', type: 'image', html: '<img ...>', 
    metadata: {
      starred: true,
      isBasedOnUrl: 'http...permalink',
      author: [{name: 'Photog Name', url: 'http...', avatar: '...jpg'}],
      publisher: {name: 'Flickr', url: 'http...'}
    },
    details: {
      src: 'http....jpg', 
      faces: [{x,y,w,h}, ...],
      colors: ['#AA0000', '#00AA00', ...]
    }
  },
  { id: 'uuid-0003', type: 'p', html: '<p>...</p>', metadata: {starred: true} },
]

There is too much data to meaningfully put into the DOM, so I’m trying to just use the id attribute to keep them synchronized. I got pretty far yesterday with the stuff you posted, and think that making a schema with top_paragraph etc. would help us in more ways than just these ids.

What is in the metadata, though? The JSON you showed doesn’t tell us much.

And what should happen when top level blocks are split, joined, deleted, or copied?

All of the the image analysis stuff in the image block details, for example, shouldn’t be in DOM. Prominent colors in image, position of faces detected. And the paragraph’s starred. metadata and details are used more for images and other media, but text blocks can have the same metadata.

Splitting a text node, the top keeps the metadata and the bottom gets a new block and id. Joining, the joined block has the top block’s id, and the bottom block’s meta is gone. Copied, maybe metadata copies but with a new id. Cut and paste, should keep original meta and id.

I reverted the changes I made in this context last week (they were too fragile). I think a viable approach to this would to just crudely attach ids to the nodes you are interested in, and periodically (either after each change or when saving), have dedicated code run over the document to detect nodes that need a new id and nodes whose id got duplicated, and fix them up. Such code would have global knowledge about the document, which the callbacks I was trying to use for this simply don’t have, and thus can perform the task in an easier to understand and more robust way.

Working on this in our homegrown library today, and landed on something like that. When calling our getDocument serializer, I check each top-level element and fix missing and duplicated ids.

Will try the same with my ProseMirror experiment.

Where would be a good place for that hook?

Argh, looks like I reverted a lot more than I wanted. I’ve restored the ability to have parsing and serialization methods on attributes in this commit. You can now do something like…

let nextID = 0
const idAttribute = new Attribute({compute: () => ++nextID})

idAttribute.parseDOM = (dom, options) => options.source != "paste" && dom.getAttribute("block-id") || ++nextID
idAttribute.serializeDOM = (dom, id) => dom.setAttribute("block-id", id)

And if create a schema like this…

let attrPred = (_, data) => data.type.prototype.isTextblock
const idSchema = new Schema(defaultSchema.spec.addAttribute(attrPred, "id", idAttribute))

… that will attach that attribute to all textblock nodes (paragraphs and such – you could also pass a different predicate). When such nodes get parsed or serialized, the attribute methods we defined before will be called to read from or add to the DOM node attribute block-id.

1 Like

This looks good and also comes handy when trying to uniquely identify footnotes. However, it seems that the attribute passing doesn’t take place when copy/pasting directly (when the pasted content is equal to lastCopied), so that ID numbers that are adding to elements will be there more than once if the user copies and pastes contents from one part of the document to another.

Yes, that is exactly what we discussed in this thread earlier. See this reply especially.

I see. I had not understood when that duplication would take place. So it’s directly after pasting on the pasted contents? Any other times?

For footnotes, I think it is OK to use two types of IDs: the visual ones, which can be done with css counters, which always have to be in order and start at 1 with no numbers missing. The browser can take care of that by itself. And then a second identifier which just needs to be unique and available to JavaScript so that footnotes can be attached correctly.

So it’s OK if the footnotes are numbered 10, 11, 12, 2, 3, 9 (in that order). There just cannot ever be two 9 in there.

But I now think this may make more sense for our use case:

As these points:

don’t really apply to us. The footnote marker will be non-editable and therefore cannot split, and when the footnote is copied, it’s fine if it the contents are just copied along with it.

Splitting (not relevant for non-divisible elements), ctrl-dragging, and of course any client code that manipulates the document could end up copying nodes.

One idea I want to explore is to leave footnotes as inline content in the text. That might not be fore everyone (it requires a different kind of interface, and you won’t be able to have multiple paragraphs in a single footnote), but it’d be easy to edit. In this model, footnotes would just be an inline ‘style’ (I have to rename ‘style’ to something else, will do so soon – sorry about the backwards-incompatibility). But there will be awkwardness there too – we’d have to somehow prevent such pieces of text from being split.

Here’s where I am so far with deduplication: with my class’ get content (), I’m calling this deduplication with both the live DOM and the current PM doc:

function domMutationDedupeIds (domChildren, doc) {
  let ids = []
  let len = domChildren.length
  for (let i=0; i<len; i++) {
    let child = domChildren[i]
    let id = child.getAttribute('data-grid-id')
    if (!id || ids.indexOf(id) !== -1) {
      id = uuid.v4()
      // Need to set it in both dom and doc for future consistency
      child.setAttribute('data-grid-id', id)
      doc.content[i].attrs.id = id
    }
    ids.push(id)
  }
}

Feels dangerous to set it both the DOM and the doc like that. Is there a cleaner way that I’m missing?

NOOOOOOOOO. Don’t mess with the DOM, and definitely don’t mutate ProseMirror document nodes like that. Go through the transform interface if you want to change something about the document.

Hehe… thought that it felt dangerous.

Tried to do it the right way, as a transform, but something isn’t working. Is setNodeType with the same type and different attrs the right method? How should I be getting the pos?

// Deduplicate ids
function makeTransformDedupeIds (doc) {
  let ids = []
  let transform = new Transform(doc);
  for (let i = doc.content.iter(), child; child = i.next().value;) {
    let id = child.attrs.id
    if (!id || ids.indexOf(id) !== -1) {
      id = uuid.v4()
      transform.setNodeType(new Pos([i.pos], 0), child.type, {id})
    }
    ids.push(id)
  }
  return transform
}

Another issue that I’m going to run into… I’m doing this on change (to translate to our format and save), and applying the transform triggers another change event. Is there a way make the apply silent?

Yes, and you should define the way you find the nodes in a way that gives you the pos. If you’re just scanning the whole document, doing so with the nodesBetween method should work well (just pass null, null for the positions so that it covers the whole document)

No, because often code listening for these events (change, transform, etc) will depend on receiving the event for every change. You could listen for transform events and only trigger your code if you see a Step that could actually introduce a duplicate or missing id (this could also very much narrow down the work you have to do).

Why are you doing this on every change? Why not just when exporting the modified document out of the editor?

I only need to check the top-level children / content, so using doc.content.iter() works, right?

I’m still missing something in applying the transform. The transform object looks correct, but after calling pm.apply(transform) the doc and DOM don’t seem to change.

Edit - transform.steps[0] looks correct, but transform.doc doesn’t have expected id changes.

B/c that’s how we have been doing it ;). I’m open to other ideas. We’re autosaving… that’s debounced, but when we get an update from the server we want to merge it with the last change event… this has bitten us when updates come while typing, and there was some hidden debouncing of change events that would cause you to lose some keystrokes and cursor positioning. That’s cruel.

Edit - that merging was happening above the editor component, but maybe should happen within… then we could debounce the autosave event, and only do this id deduping when needed.