Implementing annotations for ideographic languages with the HTML <ruby> tag

Hello everyone,

Thanks for this exceptional library.

I’ve been playing around: it’s great!

Here’s my question:

I’m trying to deal with tag (see here for more information).

I first treated them successfully as nodes but now I think marks would be better.

I made most of the job, but I’d like to really achieve it properly.

I have a structure like:

 <ruby lang="ja">ソムヌス<rt lang="la">Somnus</rt></ruby>

with a schema like:

rubylang: {
      attrs: {
        lang: {default: null},
        rt: {default: null},
        rtlang: {default: null},
      },
      parseDOM: [{tag: "ruby[lang]", getAttrs(dom) { const rt = dom.querySelector("rt[Lang]"); return {lang: dom.lang, rt: rt.innerHTML, rtlang: rt.lang} }},{tag: "rt", ignore: true}],
      toDOM(node) { let {lang, rt, rtlang} = node.attrs; return ["ruby", {lang}, ["rb", 0], ["rt", {rtlang}, rt]] },
    },

Here the output:

{
            "type": "text",
            "marks": [
              {
                "type": "rubylang",
                "attrs": {
                  "lang": "ja",
                  "rt": "Somnus",
                  "rtlang": "la"
                }
              }
            ],
            "text": "ソムヌス"
},

I am very happy with this!

My question is on the toDOM method:

with the above I get a structure like:

<ruby lang="ja"><rb>ソムヌス</rb><rt rtlang="la">Somnus</rt></ruby>

I would like to get rid of those <rb></rb> tags as they don’t mean anything.

I have read the docs as carefully as I could but I can’t really understand them in detail.

Thanks in advance for your help ☆

You can do something like this:

toDOM: () => [
        'ruby',
        { lang: 'jojo' },
        ['span', 0],
        ['rt', { rtlang: 'ji' }, '1'],
      ]

But as you can tell, Prosemirror follows the rule : Content hole must be the only child of its parent node, for which you will have to wrap the 0 in a span or something more appropriate.

Thank you very much for your reply but this is not a solution.

You’re just replacing the tag by a <span> one!

I’d like to have:

<ruby lang="ja">ソムヌス<rt rtlang="la">Somnus</rt></ruby>

There must be a way to achieve that.

OK I just realized that I have another problem.

Please see below the actualisation.

Below reflects exactly the problem I’d like to solve.

Sorry for that.

Hello everyone,

Thanks for this exceptional library.

I’ve been playing around: it’s great!

Here’s my question:

I’m trying to deal with tag (see here for more information). I first treated them successfully as nodes but now I think marks would be better. I made most of the job, but I’d like to really achieve it properly.

I have a structure like:

 <ruby lang="ja">ルネ<rp>(</rp><rt lang="fr">René</rt><rp>)</rp>=<rp>(</rp><rt lang="fr">-</rt><rp>)</rp>アントワーヌ<rp>(</rp><rt lang="fr">Antoine</rt><rp>)</rp></ruby>

with a schema like:

rubylang: {
      attrs: {
        lang: {default: null},
        rt: {default: null},
        rtlang: {default: null},
      },
      parseDOM: [{tag: "ruby[lang]", getAttrs(dom) { const rt = dom.querySelector("rt[Lang]"); return {lang: dom.lang, rt: rt.innerHTML, rtlang: rt.lang} }},{tag: "rt", ignore: true}],
      toDOM(node) { let {lang, rt, rtlang} = node.attrs; return ["ruby", {lang}, ["rb", 0], ["rt", {rtlang}, rt]] },
    },

Here the output:

 {
        "type": "text",
        "marks": [
          {
            "type": "rubylang",
            "attrs": {
              "lang": "ja",
              "rt": "René",
              "rtlang": "fr"
            }
          }
        ],
        "text": "ルネ=アントワーヌ"
},

I am not happy with this!

I would like something like:

 {
        "type": "text",
        "marks": [
          {
            "type": "rubylang",
            "attrs": {
              "lang": "ja",
              "rt": "René",
              "rtlang": "fr"
            }
          }
        ],
        "text": "ルネ"
},
 {
        "type": "text",
        "marks": [
          {
            "type": "rubylang",
            "attrs": {
              "lang": "ja",
              "rt": "-",
              "rtlang": "fr"
            }
          }
        ],
        "text": "="
},
 {
        "type": "text",
        "marks": [
          {
            "type": "rubylang",
            "attrs": {
              "lang": "ja",
              "rt": "Antoine",
              "rtlang": "fr"
            }
          }
        ],
        "text": "アントワーヌ"
},

with the above I get a structure like:

<ruby lang="ja"><rb>ルネ=アントワーヌ</rb><rt rtlang="fr">René</rt></ruby>

I would like:

  1. to get rid of those <rb></rb> tags as they don’t mean anything,
  2. to get a recursive structure.

This would give:

<ruby lang="ja">ルネ<rt rtlang="fr">René</rt>=<rt rtlang="fr">-</rt>アントワーヌ<rt rtlang="fr">Antoine</rt></ruby>

I have read the docs as carefully as I could but I can’t really understand them in detail.

Thanks in advance for your help ☆

I don’t understand what you mean by recursive structure (I’m not seeing a nested <ruby> tag in your example) and you indeed can’t mix node content with other elements at a single level.

You could try making <rt> its own node, allowed as content of <ruby>, but getting the editing experience right for such a setup will require a bunch of scripting (to make sure editing on the node boundaries does the right thing and to guide users towards a valid structure).

Thank you very much marijn for taking time for an answer ☆

I don’t understand what you mean by recursive structure

By recursive structure, I mean that the innerHTML of <ruby> is fundamentally a recursion of text followed by its <rt></rt> annotation tags like:

text1<rt>annotation1</rt>text2<rt>annotation2</rt> ... etc. ... etc. ... infinite

recursive here means “it is possibly infinite” but following the same structure.

You could try making <rt> its own node

I did!

I have succeeded in making what I wanted as a node.

But I think it is not appropriate for semantical also for verbosity reasons:

<rt> are annotations as they are mostly indications of prononciation.

[in japanese, as in other ideographic languages, readers might not know how to pronounce complex characters, therefore annotations are welcome]

[also when writers litterally transcribe the pronunciation of a uncommon word from another language into japanese, readers might not figure out its original spelling so that annotations helps disambiguation about the original terms that it refers to]. [that the present example]

As so, it is not part of the text itself, the same way a link is to an anchor.

If I am not mistaken, in Prosemirror, links are treated are marks, aren’t they?

That is why I choosed to move from a node implementation to a marks implementation.

I ignore <rt> tags [see: ignore: true]:

I want them to be attributes, as link addresses for anchors.

From a size point of view, in a text that may have lots of <ruby> tags like this one, marks are smaller and easier to read.

They can also be coupled nicely with links or other marks.

To give an more precise idea, may be <ruby> tags can be described as structured as kind of tables.

As a solution, one may think of <ruby> tag as a recursion of 3 <ruby> tags:

<ruby lang="ja">ルネ<rt rtlang="fr">René</rt></ruby>
<ruby lang="ja">=<rt rtlang="fr">-</rt></ruby>
<ruby lang="ja">アントワーヌ<rt rtlang="fr">Antoine</rt></ruby>

that would have omitted middle </ruby><ruby lang="ja> redundancy.

A <ruby> tag would have to output as shown in previous post [see below]:

{
    "type": "text",
    "marks": [
      {
        "type": "rubylang",
        "attrs": {
          "lang": "ja",
          "rt": "René",
          "rtlang": "fr"
        }
      }
    ],
    "text": "ルネ"
},
{
    "type": "text",
    "marks": [
      {
        "type": "rubylang",
        "attrs": {
          "lang": "ja",
          "rt": "-",
          "rtlang": "fr"
        }
      }
    ],
    "text": "="
},
{
    "type": "text",
    "marks": [
      {
        "type": "rubylang",
        "attrs": {
          "lang": "ja",
          "rt": "Antoine",
          "rtlang": "fr"
        }
      }
    ],
    "text": "アントワーヌ"
},

What I tried to do:

I looked foward to trim the text of the innerHTML <ruby> tag before the first <rt> tag to have [text1, annotation1], return it, then recurse on [text2, annotation2] … etc. … etc. … .

Problem being that I have to recurse on <ruby> which is not possible.

Therefore, I should probably:

  1. call the marks rubylang but
  2. target <rt> tags,
  3. set the parsed text as a node attrs called rt,
  4. find the immediate parent text before the <rt> tag after the previous immediate <ruby> or <rt> tag,
  5. take this text and set it as text of reference for the operation.

What about that?

I don’t really see how I can do that right now that is why I asked help.

The solution must be somewhere in between the parent child relation ship of <ruby> and <rt>.

I have trouble in figuring what can be the node syntax for toDOM in the docs; it would be clearer with examples that would not be arrays but strings and DOM object [I could not find any].

About Editing, I am thinking of an implementation similar to what exists for links.

I’m not seeing a nested tag in your example

That’s true, it’s not nested (title of this thread changed to disambiguate).

I am right now with the following schema:

<ruby > tag are nodes.

rubylang: {
  content: "inline*",
  group: "inline",
  inline: true,
  attrs: {lang: {default: null}},
  parseDOM: [{tag: "ruby[lang]", getAttrs(dom) { return {lang: dom.lang} }}, {tag: "rp", ignore: true}],
  toDOM(node) { let {lang} = node.attrs; return ["ruby", {lang}, 0] },
},

<rt> tags are marks

rtlang: {
  attrs: {lang: {default: null}},
  inclusive: false,
  parseDOM: [{tag: "rt[lang]", getAttrs(dom) { return {lang: dom.lang} }}],
  toDOM(node) { let {lang} = node.attrs; return ["rt", {lang}, 0] },
},

please note the inclusive: false: I don’t understand exactly why I need it reading the docs but I considered that rt marks are similar to link marks therefore should have that accordingly. I’d like to understand this point if possible.]

Here is the output I get:

{
        "type": "rubylang",
        "attrs": {
          "lang": "ja"
        },
        "content": [
          {
            "type": "text",
            "text": "ルネ"
          },
          {
            "type": "text",
            "marks": [
              {
                "type": "rtlang",
                "attrs": {
                  "lang": "fr"
                }
              }
            ],
            "text": "René"
          },
          {
            "type": "text",
            "text": "="
          },
          {
            "type": "text",
            "marks": [
              {
                "type": "rtlang",
                "attrs": {
                  "lang": "fr"
                }
              }
            ],
            "text": "-"
          },
          {
            "type": "text",
            "text": "アントワーヌ"
          },
          {
            "type": "text",
            "marks": [
              {
                "type": "rtlang",
                "attrs": {
                  "lang": "fr"
                }
              }
            ],
            "text": "Antoine"
          }
        ]
      },

I would like to write I am satisfied … but I can’t help looking at the <a> anchor link exemple, therefore willing to achieve a structure like this:

{
"type": "text",
"marks": [
  {
    "type": "rubylang",
    "attrs": {
      "lang": "ja",
      "rt": "René",
      "rtlang": "fr"
    }
  }
],
"text": "ルネ"
 },
{
"type": "text",
"marks": [
  {
    "type": "rubylang",
    "attrs": {
      "lang": "ja",
      "rt": "-",
      "rtlang": "fr"
    }
  }
],
"text": "="
},
{
"type": "text",
"marks": [
  {
    "type": "rubylang",
    "attrs": {
      "lang": "ja",
      "rt": "Antoine",
      "rtlang": "fr"
    }
  }
],
"text": "アントワーヌ"
},

This would be so easy with considering a recursive structure!

<ruby lang="ja">ルネ<rt rtlang="fr">René</rt></ruby>
<ruby lang="ja">=<rt rtlang="fr">-</rt></ruby>
<ruby lang="ja">アントワーヌ<rt rtlang="fr">Antoine</rt></ruby>

the <ruby> tag structure itself being just an abstract of the above.

So if there is a way to implement that, I’d like to understand a possible implement it!

As the texts I’d like to consumes are full of such <ruby> tags…