Hi everybody,
Another interesting perspective on categorizing African orthographies is offered by Conrad Taylor. See http://www.ideography.co.uk/library/afrolingua.html
In my own work in language technology however, I do have problems with the lack of unique code points for high/low tone sub dotted vowels. This presents ambiguity because they can be achieve in more that one way; by subdotting a tone-marked vowel, or by tone-marking a subdotted vowel. Both look exactly the same to a human reader but requires extra lines of code for a computer to see both as the same. It starts getting distracting when you consider that this has to be cattered for in both lowe and upper cases.
The response to requests for these code points is that such english digraphs ans `sh` do not have code points. This totally misses the point because `sh` and `hs` do not look allike in any way. If we accept that Africans need to do more than read texts produced on a computer, if we accept that Africans need to take full advantage of developments in language technology, then UNICODE should concede these code points to the relevant languages.
Tunde
-----------------------------------------------------------------------------------------------
Tunde Adegbola (Ph.D.)
Executive Director
African Languages Technology Initiative
(Alt-I ... Inserting African issues into the agenda of the knowledge age)
President
Tiwa Systems Ltd.
11 Oluyole Way, New Bodija Ibadan, Nigeria.
+234 8034019398
------------------------------------------------------------------------------------------------
> Date: Thu, 20 Dec 2007 00:26:33 -0500 > From: charles.riley@xxxxxxxx > To: dzo@xxxxxxxxxxxx > CC: a12n-collaboration@xxxxxxxxxxxx; dwanders@xxxxxxxxx; africa@xxxxxxxxxxx; lisam@xxxxxxxxxx > Subject: [A12n-Collab] Re: [africa] 5 categories of African orthographies (Latin) > > To the question of how well the Unicode standard supports Latin-based African > orthographies-- > > Jim Agenbroad, retired from the Library of Congress, drew up a > nine-page report > a few years ago comparing the completeness of the Unicode repertoire back to > Latin-based character repertoires in use in Africa. I think the main gaps as > of that time were: no precomposed r with middle stroke (Kanuri), no s with a > palatal hook as used in some of the languages of Sudan, and gaps in capitals > for some extended latin characters. Since then, both the s with the palatal > hook and the Kanuri r have been brought in, and many of the capitals (glottal > stop, etc.) have been filled in the Unicode standard. > > The main reference source he used was Rhonda Hartell's 1993 "Alphabets des > Langues Africaines"; I think there were some remaining questions as to whether > that was a comprehensive source, but it was a great start. > > The implementation question is a separate issue, although Agenbroad's report > hinted at this as well, noting that it's problematic that the extended > latin is > scattered throughout in different parts of the standard. Grouping access to > subsets of extended latin characters, by means of language-specific tools, > largely still has yet to happen. Some of the limiting factors include demand > and expertise--according to the Ethnologue, the median number of speakers per > language in Africa is around 25,000. Native American communities, where > extended-latin orthographies are also heavily used, are faced with similar > constraints. > > As to the categories outlined below, Cat 1 and Cat 2 are pretty well > taken care > of, and this includes several of the more widely spoken languages: Kiswahili, > Zulu, Tswana, Kinyarwanda, Sotho, Igbo, Xhosa, Tsonga, Northern Sotho, Ndebele > (South Africa) and Ndebele (Zimbabwe), Shona, Kirundi, Oromo (in Qubee script) > and the Latin-based orthography for Somali, among others. > > Cat 3, as outlined below, would in principle include Venda but not > Chinyanja or > Yoruba. The extra characters that Venda uses have all been encoded as > precomposed characters (l, n, d, and t with circumflex below, including > capitals)--but these have not been widely implemented in system font support, > so often the combining sequence is used. The w with circumflex for Chinyanja > and the double diacritic behavior seen in Yoruba put them in Cat 4. Kanuri, > like Venda, is Cat 3 when going by the Unicode Standard, but Cat 4 in practice > when it comes to implementation, as its precomposed r with stroke (024C > & 024D) > is not widely supported in system fonts. > > Other good test cases include Serer (availability of 01AC & 01AD) and the > Khoesan languages (availability of specialized click consonants): although > their characters are fully encoded into the standard, they may still be de > facto in Cat 5 as long as fonts and input are lacking or not easily > accessible. > > There's a document posted at SIL's site, "SIL Corporate PUA Character > Assignments": > > http://tinyurl.com/yoerxb (downloads as a 1.1MB ZIP file) > > This would be one source to consider in determining which languages still fall > under Category 5, with extra Latin-based characters not yet encoded. > > Happy holidays! > > Chuck Riley > > > > > > Quoting Don Osborn <dzo@xxxxxxxxxxxx>: > > > At various points we and others have discussed categories of issues with > > regard to handling orthographies of African languages, and how well Unicode > > and its implementations support these. In particular, the numerous > > Latin-based orthographies tend to fall into 4 or 5 categories. I'm not aware > > of any formal terminology or frame of reference to easily refer to these so > > would like to propose the following categories of Latin-based orthographies. > > I've been thinking along these lines, with that thinking prompted in part by > > a re-reading of an old article by Laurent Bourbeau and François Pinard. I'd > > responded recently to a question by Lisa Moore and Debbie Anderson about how > > many African languages are supported by Unicode by breaking the orthography > > issues down and turning the question around - how many are *not* supported > > by Unicode (the answer is very few, but that support for orthographies using > > combining diacritics is still an issue). > > > > > > > > This formulation below was refined in an IM chat I had with Andrew > > Cunningham yesterday, with thanks to him for feedback: > > > > > > > > Category 1 orthographies: ASCII - all characters and combinations covered > > by the ASCII character set > > > > > > > > Category 2 orthographies: Latin-1, meaning all characters and combinations > > covered by characters in ISO/IEC 8859-1 / Windows 1252 > > > > > > > > Category 3 orthographies: Extended-Latin with no combining diacritics, > > meaning that the orthographies are covered by the Latin ranges of Unicode > > without need to use combining diacritics. Here there may be issues with > > systems for input and available fonts coverage. > > > > > > > > Category 4 orthographies: Latin as complex script, meaning the > > orthographies are covered by Extended-Latin with use of combining > > diacritics. Here there are issues with input, fonts and rendering that are > > not encountered in the above. > > > > > > > > Category 5 orthographies: Orthographies not fully supported by Unicode, > > which at this point would mean a missing character (these are probably very > > few). > > > > > > > > > > > > Obviously this is somewhat simplified, since for example a few orthographies > > might be both 3 & 4, meaning that they have diacritics for indicating tone, > > but that these are not always used - so for some uses only the support > > issues for category 3 are needed to get going, but category 4 issues are > > there for diacritic support. (Manding languages, or at least Bambara in > > Mali, would be an example). So in some cases one might say that a language > > is category 3 for full, ordinary usage and category 4 for optimal coverage. > > > > > > > > Also there are also degrees of complexity. Wolof for instance is category 3 > > because of its use of the "eng" letter (upper & lower case). On the other > > hand, Hausa and Fula use several characters in the extended ranges. > > Nevertheless, category would be determined for what is necessary for full > > support. > > > > > > > > For many uses, categories 1 & 2 are practically the same, but the latter > > does occasionally pose some issues wrt display and input (and in some cases > > still programming??). > > > > > > > > For reference, Category 2 would also include French orthography and category > > 1, English. Vietnamese would I think be category 4. > > > > > > > > All the above of course refers only to Latin-based orthographies. There are > > issues with orthographies based on Arabic script and Ge'ez/Ethiopic abugida, > > which have extended ranges of their own, as well as with dicritic support in > > N'Ko and issues regarding possible additional characters in Tifinagh. Not to > > mention some lesser-used writing systems that are not yet in Unicode. > > > > > > > > Anyway, I put this forth for discussion. Whatever the case I think it is > > useful to have a common typology for referring to the orthographies of > > African languages, and to begin to classify these according to that for > > further support, enabling, and localization issues. > > > > > > > > Don Osborn > > > > Bisharat.net > > > > PanAfriL10n.org > > > > > > > > > > > > > > > > > > > _______________________________________________ > A12n-collaboration mailing list > A12n-collaboration@xxxxxxxxxxxx > http://lists.kabissa.org/mailman/listinfo/a12n-collaboration
Express yourself instantly with MSN Messenger! MSN Messenger
|