Expansion: IPA and Combining Diacritcals to cover more Latin-based African languages

Bug #670758 reported by Denis Moyogo Jacquerye
12
This bug affects 1 person
Affects Status Importance Assigned to Milestone
fonts-ubuntu (Ubuntu)
Confirmed
Wishlist
Unassigned

Bug Description

Many characters used in African languages orthographies a currently missing from the Ubuntu font family.
Some are actually present in uppercase, for example Ɛ U+0190 is present but ɛ U+025B is missing.
Combining diacritics (U+0300, etc. and U+1DC4...1DC7) are also needed as many African languages use them.

For the combining diacritics, OpenType GPOS features need to be added for correct positioning.

Examples:
ɛ U+025B and ɔ U+0254 are used in the official alphabets of Benin, Burkina Faso, Mali, Tchad, and Cameroon, as well as other alphabets.
U+300 is used in the Pan-Nigerian alphabet (in Yoruba ẹ́ is used), or in any language using accented characters not in Unicode as precomposed form.

The ANLoc (African localisation) project has a list of characters uses in orthographies of African languages :
http://anloc-fonts.git.sourceforge.net/git/gitweb.cgi?p=anloc-fonts/anloc-fonts;a=blob_plain;f=data/charlist.txt;hb=HEAD
and a list of accented characters (not available as precomposed characters, i.e. using combining diacritics)
http://anloc-fonts.git.sourceforge.net/git/gitweb.cgi?p=anloc-fonts/anloc-fonts;a=blob_plain;f=data/comblist.txt;hb=HEAD

Revision history for this message
Paul Sladen (sladen) wrote :

This bug is quite broad and so quite hard to deal with, of course the desire of the big (support for everything ever) is laudable, but to actual make that happen we'll have to break it down and tackle it in much smaller bite-size chunks.

From an end-users' point of view, the desire will be to add the glyphs that a particular script/language requires, but from an implementors' viewpoint, Dalton Maag have a preference for ensuring that whole Unicode blocks are included in one go in order to ensure harmony across the typeface.

I'll go through the codepoints that you've specifically raised and the blocks that they are in. Currently the UFF includes Latin A+B not some of the other blocks:

  Ɛ (U+0190), Latin Extended B: done
  ɔ, ɛ (U+0254, U+025B), IPA Extensions
  x̀..xͯ (U+0300..U+036F), Combining Diacritical Marks
  x᷄..x᷇ (U+1DC4..U+1DC7), Combining Diacritical Marks Supplemental

Gaining coverage by way of doing full-blocks means:

  IPA Extensions: 95 glyphs, ~45 straight composites of existing Latin/Greek
  Combining Diacritical Marks: 111 glyphs, ~100 straight composites, placement work
  Combining Diacritical Marks Supplemental: 43 glyphs

For the latter, 36 codepoints are already grabbed by the Ubuntu Font Family (drawn as the unknown numbered square glyph), but do not appear to have a glyph associated. See the coverage map from Unicode for these:

  http://unicode.org/charts/PDF/U1DC0.pdf

tags: added: uff-diacritical uff-ipa uff-latin
summary: - Add support for African languages
+ Expansion: IPA and Combining Diacritcals to cover more Latin-based
+ African languages
Changed in ubuntu-font-family:
importance: Undecided → Wishlist
milestone: none → later
status: New → Incomplete
Revision history for this message
Denis Moyogo Jacquerye (moyogo) wrote :

Working by blocks is a way to prioritize like another.
However getting only one case of a letter is problematic, especially for harmony purposes.

In any case, there are only 61 African Latin characters missing from Ubuntu, that seems far less than completing the Unicode blocks including them. Out of those there are a few glyphs already present or some with little variation (reversed, or with hook).
Many of the Latin characters in block shared with African Latin characters are from historical orthographies.

What's the plan for combining diacritics positioning?

How do I go about contributing glyphs or positioning features?

Revision history for this message
Bruno Maag (bruno-daltonmaag) wrote :

There is a reason for updating the fonts in entire language, script and/or Unicode blocks: it allows a controlled and harmonised update of the font family. Remember that it's not only one font that needs doing this but in the next issue of the system it will be 13. By doing it in agreed blocks we can also clearly communicate what languages and scripts are supported in the font. Adding just a number of characters within a block will eventually lead to a chaotic descent of the font development.

There are also technical reasons for ensuring the updates are harmonised and controlled. More complex scripts such as Arabic and Hebrew contain intricate mark positionings in the font, all labouriously put together in VOLT. Unfortunately, when the OpenType specifications were developed (going as far back as 1995, or even further, btw) no one anticipated the advent of fonts containing a multitude of scripts, and certainly not an environment where the fonts would be updated over a period of time by a variety of people, such as in the Ubuntu project. So, all mark positionings are glyph order dependent which means that if you would add just one Latin glyph before the Arabic/Hebrew (or other) block, the entire functionality falls apart. The same goes for adding the positionings as you suggest above, if you plan to add them via GSUB/GPOS feature.

I understand that the lack of some glyphs is frustrating, and I wish we could have given you a font that contains everything imaginable - way over 100,000 glyphs in the font - right from the start. But that would have taken quite some more time, even on just one font style. So, until we have worked out a priority list for adding further language support I beg you to be patient. We're doing everything we can to make sure as much as possible is covered by the time spring comes.

Revision history for this message
Denis Moyogo Jacquerye (moyogo) wrote :

Don't get me wrong. It's great to have those letters even if it mean having some of those now and some later.
I'm not complaining you should complete them faster.

The characters list I provided is just there so you can work on glyphs that are actually used in languages' orthographies, as opposed to others that are there for historical purposes.
Just like it might be useful to prioritize current IPA characters before obsolete ones.

This could be one way to prioritize or harmonize glyph designs. Working by Unicode blocks works too, it's just an arbitrary order. Had every uppercase and lowercase been in the same blocks, it wouldn't be an issue.

As for positioning features, I don't think OpenType features have to be added in a specific order, but your tools force you to. Using Fontforge in a collaborative font project, OpenType features can be added in any order, while characters are added in any order, the tools handles it.

If I understand correctly there is no way I can contribute OpenType features.

Revision history for this message
Mark Shuttleworth (sabdfl) wrote : Re: [Bug 670758] Re: Expansion: IPA and Combining Diacritcals to cover more Latin-based African languages

Denis, if you would like to spin up an effort to find a typographer who
could collaborate on the whole block, we can accommodate that. But Bruno
is right - we don't want bits and pieces, we want to do it in concrete
chunks with (semi)-professional local insight.

Mark

Revision history for this message
Ben Laenen (benlaenen) wrote :

Prioritizing to get certain languages in the font at once makes sense. Prioritizing to get only full Unicode blocks not so.

These blocks are often just a collection of different sets of characters that are unrelated to each other. Suppose someone wants to write in the Dan language. This language uses the letter U+A78D, which happens to sit in the Latin-D block. So, you are saying that this letter can only get in if you also add things like the Egyptological, Mayanist and Medievalist additions, or old Latvian letters, since they are also found in Latin-D?

I understand that you don't want bits and pieces, but Denis isn't asking for that. Adding bits of pieces would be like adding three Arabic letters and no more. Of course that's never useful. But we are talking about adding a few extra Latin letters that would make the font available to millions of people (just add ɛ and ɔ for example and you have support for Lingala, spoken by 10 million people). And these letters need to be harmonized with the Latin letters already in the font, not with the rest of their respective blocks. In fact, one would say that these letters need to be harmonized with the other letters in their block, as much as with all other letters of the same script found in all other blocks (e.g. that hook found on letters in Latin-B can best look the same as that hook in some letters in Latin-C).

About "Adding just a number of characters within a block will eventually lead to a chaotic descent of the font development.": I can't really believe that a professional font foundry would have trouble to internally communicate to every developer what letters you are putting into the font?

As for the external communication to let everyone know what you support: your average user will likely understand the phrase "most African languages in Latin script are supported" better than "has support for Unicode block Latin-C". Just adding full blocks makes it easier to give a summary of what you have of course, but you don't need to tell exactly what you support in your summaries anyway. You can have more extensive information somewhere else. At DejaVu Fonts we have scripts that generate lists like http://dejavu.svn.sourceforge.net/viewvc/*checkout*/dejavu/trunk/dejavu-fonts/langcover.txt for example where you can immediately see if the language of your interest is supported or not.

Revision history for this message
Mark Shuttleworth (sabdfl) wrote :

Hmm... Ben that was well articulated, thank you. Bruno, I think it
warrants diving into the necessary tool support for this, because
engaging with communities will mean engaging with language and
culture-defined subsets of the blocks.

Mark

Revision history for this message
David Marshall (dave-daltonmaag) wrote :

I can see and appreciate the arguments in both directions for this. I
think the problem we're going to face - whichever approach we choose -
is keeping the different fonts in synch for character coverage,
especially as we're effectively aiming to crowd-source our glyph designs.

In many cases, however, "filling" a Unicode block once you have a few
key glyph designs is going to be very straightforward as there's a hell
of a lot of composites out there.

Dave

Revision history for this message
Paul Sladen (sladen) wrote :

At the moment in the Ubuntu Font Family 0.69 release we are already shipping with a number of incomplete/partial Unicode blocks, where the current glyphset covered is only a subset of the Unicode listing for the full block:

  Alphabetic Presentation Forms
  Currency Symbols
  Cyrillic
  General Punctuation
  Greek and Coptic
  Superscripts and Subscripts

I'm inclined to believe that we don't want to set a hard rule either way (purely blocks, or purely scripts), but, just like in the case of the music notes for DVD subtitles (bug #655350) we work out on a case-by-case basis what is a /sensible/ set, depending on what the interactions are.

We want full, coherent contributions, rather than odds-and-sods, but at the same time, needing to cover 256 glyphs when instead 50 could be prioritised is something I don't think we want to turn down.

As Ben notes, the selling point if ticking off coverage of use-cases (scripts, or geographic regions), and these (except for the complex scripts) don't tend to match up one-for-one.

Revision history for this message
Denis Moyogo Jacquerye (moyogo) wrote :

For the DejaVu font project, the only restriction regarding coverage of additions is that one can only add a character if also adding its equivalent in another (if it is encoded), and doing so in all styles of the same typeface.
So there are never issues like having Ɛ but not ɛ.

We don't care about Unicode blocks, they are arbitrary once you go beyond scripts, but rather we care that characters are usable when added, not when the whole script is done.

If the policy for Ubuntu fonts is to have Unicode block full, that's fine. But if you don't follow it yourselfes, I don't suppose I'd have to either.

Revision history for this message
Denis Moyogo Jacquerye (moyogo) wrote :

I missed a word, please read:
For the DejaVu font project, the only restriction regarding coverage of additions is that one can only add a character if also adding its equivalent in another case (if it is encoded), and doing so in all styles of the same typeface.

Revision history for this message
Bruno Maag (bruno-daltonmaag) wrote :

Like David, I can totally appreciate arguements from all sides and see their merit. I understand that some glyphs which are part of the current glyphsets are in Unicode blocks other than Lat, Grk or Cyr, and are partial only. My reasoning for Unicode blocks is that it *is* a defined standard but at the same time I am happy enough to agree to a different mechanism of extending the Ubuntu font family, such as doing it per script system, just as long we follow complete charactersets.

Ben, it seems you have misunderstood my meaning about 'descending into chaos'. The problem is not with us, the professional font foundry - the problem lies in the fact that in future the fonts will be extended by the community. To ensure that the fonts remain a coherent and usable tool for the future their development has to be carefully co-ordinated. Remember, please, that the font suite consist not only of four font styles but 13 all of which will eventually be available if I am not mistaken. Whilst we will still be involved in the future, our involvement is more in a guiding and advisory capacity.

I am currently working on a proposal how to extend the fonts, which scripts first, where are the resources etc. In parallel we are now putting a spec together which we hope to discuss with the libre folks in regards to updating Fontforge to assist with this huge project.

Revision history for this message
David Marshall (dave-daltonmaag) wrote : Re: [Bug 670758] Re: Expansion: IPA and Combining Diacritcals to cover more Latin-based African languages

I should probably be clear that I'm not arguing that we should avoid
adding key individual characters to the fonts - but that if we are
making strategic additions, we should bite the bullet and add whole
Unicode ranges, especially in those ranges which mainly involve Latin
composites.

Dave

Revision history for this message
Denis Moyogo Jacquerye (moyogo) wrote :

The bug title has been changed to "Expansion: IPA and Combining Diacritcals to cover more Latin-based African languages", however this is inaccurate, glyphs required are in IPA, Latin Extended-C, Latin Extended-D and composites can be found in Latin Additional.

Come to think of it, combining diacritics placement should be an improvent bug this one depends on, considering other languages or scripts need the feature (нога́ in Russian for example).

Revision history for this message
Paul Sladen (sladen) wrote :

Denis: Yup, sure, lots of other things need expansion... my impression with the original description is that this was /focused/ specifically on $some written African languages; and the codepoints specifically pointed to (but which aren't done yet) are in the IPA and Combining Diacritical blocks.

Revision history for this message
Denis Moyogo Jacquerye (moyogo) wrote :

Bruno,
The characters listed come from decrees setting national alphabets (Benin, Burkina Faso, Chad, Mali, Nigeria, Senegal) and orthography standards set by national linguists organizations (Cameroon, Congo-Kinshasa), or pan-african linguist (African Alphabet and African Reference Alphabet). Other sources include Hartell's Alphabets of Africa, many SIL alphabetization books, dictionaries, language learning books or proposals to encode in Unicode.

Unicode blocks are not characters set, things like MES-1, MES-2, MES-3B are. Character sets are subsets of Unicode and very often have characters in more than one Unicode Block. The fact that uppercase and lowercase of the same letter can be in different blocks should make that obvious. However, it is true that some characters sets have been encoded in Unicode as blocks.
I'm not arguing one shouldn't work by block, I'm just arguing it's not the most practical approach from a language coverage point of view.

Paul,
Yes, most non-composite characters are in the IPA block (27), and diacritics in the Combining diacriticals block (16).
But you can find in the Latin Extended-C block:
U+2C64 LATIN CAPITAL LETTER R WITH TAIL used in Sudan
U+2C6D LATIN CAPITAL LETTER ALPHA used in Cameroon
U+2C72 LATIN CAPITAL LETTER W WITH HOOK and U+2C73 LATIN SMALL LETTER W WITH HOOK used in Burkina Faso

in the Latin Extended-D block:
U+A789 MODIFIER LETTER COLON used in Congo-Kinshasa, Kenya and Côte d’Ivoire
U+A78A MODIFIER LETTER SHORT EQUALS SIGN used in Congo-Kinshasa
U+A78D LATIN CAPITAL LETTER TURNED H used in Liberia

In the Combining Diacritical Marks Supplement block:
U+1DC6 COMBINING MACRON-GRAVE and U+1DC7 COMBINING ACUTE-MACRON used in Nigeria

In the Spacing Modifier Letters block:
U+02D7 MODIFIER LETTER MINUS SIGN and U+02EE MODIFIER LETTER DOUBLE APOSTROPHE used in Côte d’Ivoire

In Latin Extended Additional block:
The 60 precomposed characters used in Nigeria, South Africa and others.

Revision history for this message
Matthew Paul Thomas (mpt) wrote :

I've found that bug reports are least confusing when they concentrate on describing problems to be solved, rather than pieces of work to be done.

For example, "Cant print all of pan-Nigerian in Ubuntu font" is a problem to be solved. "Cover more Latin-based African languages" is not. If you focus too much on the work rather than the problem, you can easily end up doing work that isn't necessary (for example here, having the introduction of contemporary characters bogged down by historical characters), discouraging contributors, or not being sure when to mark the bug as fixed.

It may be a good idea to have consistency requirements like the one Denis mentions, requiring all variants of an individual character (e.g. upper-case and lower-case) at the same time, or requiring all weights of a character at the same time. But that is something to be documented elsewhere, not something to be tracked by bugs reported by people who don't necessarily know those requirements.

So, I think this bug report would be most effective if it was split up into one per language. I've started by reporting bug 1396511 on Lingala.

affects: ubuntu-font-family → fonts-ubuntu (Ubuntu)
Changed in fonts-ubuntu (Ubuntu):
milestone: later → none
status: Incomplete → Confirmed
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers