Implementing annotations for ideographic languages with the HTML <ruby> tag

tidiview · January 5, 2021, 11:09pm

Hello everyone,

Thanks for this exceptional library.

I’ve been playing around: it’s great!

Here’s my question:

I’m trying to deal with tag (see here for more information).

I first treated them successfully as nodes but now I think marks would be better.

I made most of the job, but I’d like to really achieve it properly.

I have a structure like:

 <ruby lang="ja">ソムヌス<rt lang="la">Somnus</rt></ruby>

with a schema like:

rubylang: {
      attrs: {
        lang: {default: null},
        rt: {default: null},
        rtlang: {default: null},
      },
      parseDOM: [{tag: "ruby[lang]", getAttrs(dom) { const rt = dom.querySelector("rt[Lang]"); return {lang: dom.lang, rt: rt.innerHTML, rtlang: rt.lang} }},{tag: "rt", ignore: true}],
      toDOM(node) { let {lang, rt, rtlang} = node.attrs; return ["ruby", {lang}, ["rb", 0], ["rt", {rtlang}, rt]] },
    },

Here the output:

{
            "type": "text",
            "marks": [
              {
                "type": "rubylang",
                "attrs": {
                  "lang": "ja",
                  "rt": "Somnus",
                  "rtlang": "la"
                }
              }
            ],
            "text": "ソムヌス"
},

I am very happy with this!

My question is on the toDOM method:

with the above I get a structure like:

<ruby lang="ja"><rb>ソムヌス</rb><rt rtlang="la">Somnus</rt></ruby>

I would like to get rid of those <rb></rb> tags as they don’t mean anything.

I have read the docs as carefully as I could but I can’t really understand them in detail.

Thanks in advance for your help ☆

kepta · January 6, 2021, 9:08am

You can do something like this:

toDOM: () => [
        'ruby',
        { lang: 'jojo' },
        ['span', 0],
        ['rt', { rtlang: 'ji' }, '1'],
      ]

But as you can tell, Prosemirror follows the rule : Content hole must be the only child of its parent node, for which you will have to wrap the 0 in a span or something more appropriate.

tidiview · January 6, 2021, 9:55am

Thank you very much for your reply but this is not a solution.

You’re just replacing the tag by a <span> one!

I’d like to have:

<ruby lang="ja">ソムヌス<rt rtlang="la">Somnus</rt></ruby>

There must be a way to achieve that.

tidiview · January 6, 2021, 10:06am

OK I just realized that I have another problem.

Please see below the actualisation.

Below reflects exactly the problem I’d like to solve.

Sorry for that.

tidiview · January 6, 2021, 10:21am

Hello everyone,

Thanks for this exceptional library.

I’ve been playing around: it’s great!

Here’s my question:

I’m trying to deal with tag (see here for more information). I first treated them successfully as nodes but now I think marks would be better. I made most of the job, but I’d like to really achieve it properly.

I have a structure like:

 <ruby lang="ja">ルネ<rp>(</rp><rt lang="fr">René</rt><rp>)</rp>＝<rp>(</rp><rt lang="fr">-</rt><rp>)</rp>アントワーヌ<rp>(</rp><rt lang="fr">Antoine</rt><rp>)</rp></ruby>

with a schema like:

rubylang: {
      attrs: {
        lang: {default: null},
        rt: {default: null},
        rtlang: {default: null},
      },
      parseDOM: [{tag: "ruby[lang]", getAttrs(dom) { const rt = dom.querySelector("rt[Lang]"); return {lang: dom.lang, rt: rt.innerHTML, rtlang: rt.lang} }},{tag: "rt", ignore: true}],
      toDOM(node) { let {lang, rt, rtlang} = node.attrs; return ["ruby", {lang}, ["rb", 0], ["rt", {rtlang}, rt]] },
    },

Here the output:

 {
        "type": "text",
        "marks": [
          {
            "type": "rubylang",
            "attrs": {
              "lang": "ja",
              "rt": "René",
              "rtlang": "fr"
            }
          }
        ],
        "text": "ルネ＝アントワーヌ"
},

I am not happy with this!

I would like something like:

 {
        "type": "text",
        "marks": [
          {
            "type": "rubylang",
            "attrs": {
              "lang": "ja",
              "rt": "René",
              "rtlang": "fr"
            }
          }
        ],
        "text": "ルネ"
},
 {
        "type": "text",
        "marks": [
          {
            "type": "rubylang",
            "attrs": {
              "lang": "ja",
              "rt": "-",
              "rtlang": "fr"
            }
          }
        ],
        "text": "＝"
},
 {
        "type": "text",
        "marks": [
          {
            "type": "rubylang",
            "attrs": {
              "lang": "ja",
              "rt": "Antoine",
              "rtlang": "fr"
            }
          }
        ],
        "text": "アントワーヌ"
},

with the above I get a structure like:

<ruby lang="ja"><rb>ルネ＝アントワーヌ</rb><rt rtlang="fr">René</rt></ruby>

I would like:

to get rid of those <rb></rb> tags as they don’t mean anything,
to get a recursive structure.

This would give:

<ruby lang="ja">ルネ<rt rtlang="fr">René</rt>＝<rt rtlang="fr">-</rt>アントワーヌ<rt rtlang="fr">Antoine</rt></ruby>

I have read the docs as carefully as I could but I can’t really understand them in detail.

Thanks in advance for your help ☆

marijn · January 6, 2021, 10:59am

I don’t understand what you mean by recursive structure (I’m not seeing a nested <ruby> tag in your example) and you indeed can’t mix node content with other elements at a single level.

You could try making <rt> its own node, allowed as content of <ruby>, but getting the editing experience right for such a setup will require a bunch of scripting (to make sure editing on the node boundaries does the right thing and to guide users towards a valid structure).

tidiview · January 6, 2021, 1:45pm

Thank you very much marijn for taking time for an answer ☆

I don’t understand what you mean by recursive structure

By recursive structure, I mean that the innerHTML of <ruby> is fundamentally a recursion of text followed by its <rt></rt> annotation tags like:

text1<rt>annotation1</rt>text2<rt>annotation2</rt> ... etc. ... etc. ... infinite

recursive here means “it is possibly infinite” but following the same structure.

You could try making <rt> its own node

I did!

I have succeeded in making what I wanted as a node.

But I think it is not appropriate for semantical also for verbosity reasons:

<rt> are annotations as they are mostly indications of prononciation.

[in japanese, as in other ideographic languages, readers might not know how to pronounce complex characters, therefore annotations are welcome]

[also when writers litterally transcribe the pronunciation of a uncommon word from another language into japanese, readers might not figure out its original spelling so that annotations helps disambiguation about the original terms that it refers to]. [that the present example]

As so, it is not part of the text itself, the same way a link is to an anchor.

If I am not mistaken, in Prosemirror, links are treated are marks, aren’t they?

That is why I choosed to move from a node implementation to a marks implementation.

I ignore <rt> tags [see: ignore: true]:

I want them to be attributes, as link addresses for anchors.

From a size point of view, in a text that may have lots of <ruby> tags like this one, marks are smaller and easier to read.

They can also be coupled nicely with links or other marks.

To give an more precise idea, may be <ruby> tags can be described as structured as kind of tables.

As a solution, one may think of <ruby> tag as a recursion of 3 <ruby> tags:

<ruby lang="ja">ルネ<rt rtlang="fr">René</rt></ruby>
<ruby lang="ja">＝<rt rtlang="fr">-</rt></ruby>
<ruby lang="ja">アントワーヌ<rt rtlang="fr">Antoine</rt></ruby>

that would have omitted middle </ruby><ruby lang="ja> redundancy.

A <ruby> tag would have to output as shown in previous post [see below]:

{
    "type": "text",
    "marks": [
      {
        "type": "rubylang",
        "attrs": {
          "lang": "ja",
          "rt": "René",
          "rtlang": "fr"
        }
      }
    ],
    "text": "ルネ"
},
{
    "type": "text",
    "marks": [
      {
        "type": "rubylang",
        "attrs": {
          "lang": "ja",
          "rt": "-",
          "rtlang": "fr"
        }
      }
    ],
    "text": "＝"
},
{
    "type": "text",
    "marks": [
      {
        "type": "rubylang",
        "attrs": {
          "lang": "ja",
          "rt": "Antoine",
          "rtlang": "fr"
        }
      }
    ],
    "text": "アントワーヌ"
},

What I tried to do:

I looked foward to trim the text of the innerHTML <ruby> tag before the first <rt> tag to have [text1, annotation1], return it, then recurse on [text2, annotation2] … etc. … etc. … .

Problem being that I have to recurse on <ruby> which is not possible.

Therefore, I should probably:

call the marks rubylang but
target <rt> tags,
set the parsed text as a node attrs called rt,
find the immediate parent text before the <rt> tag after the previous immediate <ruby> or <rt> tag,
take this text and set it as text of reference for the operation.

What about that?

I don’t really see how I can do that right now that is why I asked help.

The solution must be somewhere in between the parent child relation ship of <ruby> and <rt>.

I have trouble in figuring what can be the node syntax for toDOM in the docs; it would be clearer with examples that would not be arrays but strings and DOM object [I could not find any].

About Editing, I am thinking of an implementation similar to what exists for links.

tidiview · January 6, 2021, 6:46pm

I’m not seeing a nested tag in your example

That’s true, it’s not nested (title of this thread changed to disambiguate).

I am right now with the following schema:

<ruby > tag are nodes.

rubylang: {
  content: "inline*",
  group: "inline",
  inline: true,
  attrs: {lang: {default: null}},
  parseDOM: [{tag: "ruby[lang]", getAttrs(dom) { return {lang: dom.lang} }}, {tag: "rp", ignore: true}],
  toDOM(node) { let {lang} = node.attrs; return ["ruby", {lang}, 0] },
},

<rt> tags are marks

rtlang: {
  attrs: {lang: {default: null}},
  inclusive: false,
  parseDOM: [{tag: "rt[lang]", getAttrs(dom) { return {lang: dom.lang} }}],
  toDOM(node) { let {lang} = node.attrs; return ["rt", {lang}, 0] },
},

please note the inclusive: false: I don’t understand exactly why I need it reading the docs but I considered that rt marks are similar to link marks therefore should have that accordingly. I’d like to understand this point if possible.]

Here is the output I get:

{
        "type": "rubylang",
        "attrs": {
          "lang": "ja"
        },
        "content": [
          {
            "type": "text",
            "text": "ルネ"
          },
          {
            "type": "text",
            "marks": [
              {
                "type": "rtlang",
                "attrs": {
                  "lang": "fr"
                }
              }
            ],
            "text": "René"
          },
          {
            "type": "text",
            "text": "＝"
          },
          {
            "type": "text",
            "marks": [
              {
                "type": "rtlang",
                "attrs": {
                  "lang": "fr"
                }
              }
            ],
            "text": "-"
          },
          {
            "type": "text",
            "text": "アントワーヌ"
          },
          {
            "type": "text",
            "marks": [
              {
                "type": "rtlang",
                "attrs": {
                  "lang": "fr"
                }
              }
            ],
            "text": "Antoine"
          }
        ]
      },

I would like to write I am satisfied … but I can’t help looking at the <a> anchor link exemple, therefore willing to achieve a structure like this:

{
"type": "text",
"marks": [
  {
    "type": "rubylang",
    "attrs": {
      "lang": "ja",
      "rt": "René",
      "rtlang": "fr"
    }
  }
],
"text": "ルネ"
 },
{
"type": "text",
"marks": [
  {
    "type": "rubylang",
    "attrs": {
      "lang": "ja",
      "rt": "-",
      "rtlang": "fr"
    }
  }
],
"text": "＝"
},
{
"type": "text",
"marks": [
  {
    "type": "rubylang",
    "attrs": {
      "lang": "ja",
      "rt": "Antoine",
      "rtlang": "fr"
    }
  }
],
"text": "アントワーヌ"
},

This would be so easy with considering a recursive structure!

<ruby lang="ja">ルネ<rt rtlang="fr">René</rt></ruby>
<ruby lang="ja">＝<rt rtlang="fr">-</rt></ruby>
<ruby lang="ja">アントワーヌ<rt rtlang="fr">Antoine</rt></ruby>

the <ruby> tag structure itself being just an abstract of the above.

So if there is a way to implement that, I’d like to understand a possible implement it!

As the texts I’d like to consumes are full of such <ruby> tags…