a12n-collaboration Mailing List Archive: [A12n-Collab] Re: [africa] 5 categories of African orthographies (Latin)[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]
To the question of how well the Unicode standard supports Latin-based African orthographies-- Jim Agenbroad, retired from the Library of Congress, drew up a nine-page report a few years ago comparing the completeness of the Unicode repertoire back to Latin-based character repertoires in use in Africa. I think the main gaps as of that time were: no precomposed r with middle stroke (Kanuri), no s with a palatal hook as used in some of the languages of Sudan, and gaps in capitals for some extended latin characters. Since then, both the s with the palatal hook and the Kanuri r have been brought in, and many of the capitals (glottal stop, etc.) have been filled in the Unicode standard. The main reference source he used was Rhonda Hartell's 1993 "Alphabets des Langues Africaines"; I think there were some remaining questions as to whether that was a comprehensive source, but it was a great start. The implementation question is a separate issue, although Agenbroad's report hinted at this as well, noting that it's problematic that the extended latin is scattered throughout in different parts of the standard. Grouping access to subsets of extended latin characters, by means of language-specific tools, largely still has yet to happen. Some of the limiting factors include demand and expertise--according to the Ethnologue, the median number of speakers per language in Africa is around 25,000. Native American communities, where extended-latin orthographies are also heavily used, are faced with similar constraints. As to the categories outlined below, Cat 1 and Cat 2 are pretty well taken care of, and this includes several of the more widely spoken languages: Kiswahili, Zulu, Tswana, Kinyarwanda, Sotho, Igbo, Xhosa, Tsonga, Northern Sotho, Ndebele (South Africa) and Ndebele (Zimbabwe), Shona, Kirundi, Oromo (in Qubee script) and the Latin-based orthography for Somali, among others. Cat 3, as outlined below, would in principle include Venda but not Chinyanja or Yoruba. The extra characters that Venda uses have all been encoded as precomposed characters (l, n, d, and t with circumflex below, including capitals)--but these have not been widely implemented in system font support, so often the combining sequence is used. The w with circumflex for Chinyanja and the double diacritic behavior seen in Yoruba put them in Cat 4. Kanuri, like Venda, is Cat 3 when going by the Unicode Standard, but Cat 4 in practice when it comes to implementation, as its precomposed r with stroke (024C & 024D) is not widely supported in system fonts. Other good test cases include Serer (availability of 01AC & 01AD) and the Khoesan languages (availability of specialized click consonants): although their characters are fully encoded into the standard, they may still be de facto in Cat 5 as long as fonts and input are lacking or not easily accessible. There's a document posted at SIL's site, "SIL Corporate PUA Character Assignments": http://tinyurl.com/yoerxb (downloads as a 1.1MB ZIP file) This would be one source to consider in determining which languages still fall under Category 5, with extra Latin-based characters not yet encoded. Happy holidays! Chuck Riley Quoting Don Osborn <dzo@xxxxxxxxxxxx>: At various points we and others have discussed categories of issues with regard to handling orthographies of African languages, and how well Unicode and its implementations support these. In particular, the numerous Latin-based orthographies tend to fall into 4 or 5 categories. I'm not aware of any formal terminology or frame of reference to easily refer to these so would like to propose the following categories of Latin-based orthographies. I've been thinking along these lines, with that thinking prompted in part by a re-reading of an old article by Laurent Bourbeau and François Pinard. I'd responded recently to a question by Lisa Moore and Debbie Anderson about how many African languages are supported by Unicode by breaking the orthography issues down and turning the question around - how many are *not* supported by Unicode (the answer is very few, but that support for orthographies using combining diacritics is still an issue). This formulation below was refined in an IM chat I had with Andrew Cunningham yesterday, with thanks to him for feedback: Category 1 orthographies: ASCII - all characters and combinations covered by the ASCII character set Category 2 orthographies: Latin-1, meaning all characters and combinations covered by characters in ISO/IEC 8859-1 / Windows 1252 Category 3 orthographies: Extended-Latin with no combining diacritics, meaning that the orthographies are covered by the Latin ranges of Unicode without need to use combining diacritics. Here there may be issues with systems for input and available fonts coverage. Category 4 orthographies: Latin as complex script, meaning the orthographies are covered by Extended-Latin with use of combining diacritics. Here there are issues with input, fonts and rendering that are not encountered in the above. Category 5 orthographies: Orthographies not fully supported by Unicode, which at this point would mean a missing character (these are probably very few). Obviously this is somewhat simplified, since for example a few orthographies might be both 3 & 4, meaning that they have diacritics for indicating tone, but that these are not always used - so for some uses only the support issues for category 3 are needed to get going, but category 4 issues are there for diacritic support. (Manding languages, or at least Bambara in Mali, would be an example). So in some cases one might say that a language is category 3 for full, ordinary usage and category 4 for optimal coverage. Also there are also degrees of complexity. Wolof for instance is category 3 because of its use of the "eng" letter (upper & lower case). On the other hand, Hausa and Fula use several characters in the extended ranges. Nevertheless, category would be determined for what is necessary for full support. For many uses, categories 1 & 2 are practically the same, but the latter does occasionally pose some issues wrt display and input (and in some cases still programming??). For reference, Category 2 would also include French orthography and category 1, English. Vietnamese would I think be category 4. All the above of course refers only to Latin-based orthographies. There are issues with orthographies based on Arabic script and Ge'ez/Ethiopic abugida, which have extended ranges of their own, as well as with dicritic support in N'Ko and issues regarding possible additional characters in Tifinagh. Not to mention some lesser-used writing systems that are not yet in Unicode. Anyway, I put this forth for discussion. Whatever the case I think it is useful to have a common typology for referring to the orthographies of African languages, and to begin to classify these according to that for further support, enabling, and localization issues. Don Osborn Bisharat.net PanAfriL10n.org
Last Updated: Fri Dec 21 08:20:25 2007 |
a12n-collaboration is hosted on Kabissa - Space for Change in Africa
Your feedback is important. Click here to send a message to the Kabissa team.
Terms of Use | Privacy Notice | Web Site Credits © 1999-2006, Kabissa or its affiliates