You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This issue is carried over from an unanswered issue at w3c/ilreq#31
In the following i distinguish between consonant clusters and conjuncts, where the latter involves special shaping and the hiding of the VIRAMA, because that's where the difference lies afaict.
In the ilreq doc section 2. Indic orthographic syllable boundaries, contains a set of ABNF rules for indicating syllable boundaries, which are referred to for many applications, such as vertical text, line wrapping, initial-letter styling, etc. The examples include Tamil, however (with the exception of க்ஷ, ஶ்ரீ , and ஸ்ரீ ) modern consonant clusters in Tamil don't form conjuncts in the same way as, say, Devanagari or Bengali. Instead, Tamil simply applies a pulli (virama) dot above the consonant without a following vowel, eg. கேட்டுக்.
Given a Tamil word such as யாவற்றையும் (yāvaṟṟaiyum), should the break points for text segmentation, line breaking , drop letter (if the cluster appeared at the start of the text), letter spacing in horizontal text, and vertical text representation conform to this:
A
or this?
B
The latter is what the ilreq document currently suggests.
A similar question arises when fonts don't produce certain conjuncts in other scripts, for one reason or another, or where a ZWNJ is added to prevent a conjunct forming. Where are the break points for the following? Are they:
C
or
D
Given that for a more typical rendering of the text the break points, as described in the ilreq doc, would be:
E
UAX#29 currently doesn't produce E for Devanagari (which is what the ilreq doc requires). It produces something more like C. But UAX#29 is about to change, so that by default a whole consonant cluster will be seen as a unit (ie. E). The effect of that upcoming change is not completely clear, however, for scripts like Tamil, or Devanagari when the virama is showing. I'm looking for someone to provide expert advice for what would be expected in those situations.
A reliance on the shape of the text is not described in the ilreq document, which i think is problematic. (It's also problematic for the general concept of grapheme clusters in Unicode, which should count as one unit the whole of a conjunct sequence such as ஶ்ரீ but not a Tamil consonant cluster such as த்தை.)