Mailing List Hosted on Kabissa - Space for Change in Africa

a12n-collaboration Mailing List Archive: [A12n-Collab] Re: [africa] 5 categories of African orthographies (Latin)

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]

  • Subject: [A12n-Collab] Re: [africa] 5 categories of African orthographies (Latin)
  • From: charles.riley@xxxxxxxx
  • Date: Thu, 20 Dec 2007 00:26:33 -0500
To the question of how well the Unicode standard supports Latin-based African
orthographies--

Jim Agenbroad, retired from the Library of Congress, drew up a
nine-page report
a few years ago comparing the completeness of the Unicode repertoire back to
Latin-based character repertoires in use in Africa.  I think the main gaps as
of that time were:  no precomposed r with middle stroke (Kanuri), no s with a
palatal hook as used in some of the languages of Sudan, and gaps in capitals
for some extended latin characters.  Since then, both the s with the palatal
hook and the Kanuri r have been brought in, and many of the capitals (glottal
stop, etc.) have been filled in the Unicode standard.

The main reference source he used was Rhonda Hartell's 1993 "Alphabets des
Langues Africaines"; I think there were some remaining questions as to whether
that was a comprehensive source, but it was a great start.

The implementation question is a separate issue, although Agenbroad's report
hinted at this as well, noting that it's problematic that the extended
latin is
scattered throughout in different parts of the standard.  Grouping access to
subsets of extended latin characters, by means of language-specific tools,
largely still has yet to happen.  Some of the limiting factors include demand
and expertise--according to the Ethnologue, the median number of speakers per
language in Africa is around 25,000.  Native American communities, where
extended-latin orthographies are also heavily used, are faced with similar
constraints.

As to the categories outlined below, Cat 1 and Cat 2 are pretty well
taken care
of, and this includes several of the more widely spoken languages:  Kiswahili,
Zulu, Tswana, Kinyarwanda, Sotho, Igbo, Xhosa, Tsonga, Northern Sotho, Ndebele
(South Africa) and Ndebele (Zimbabwe), Shona, Kirundi, Oromo (in Qubee script)
and the Latin-based orthography for Somali, among others.

Cat 3, as outlined below, would in principle include Venda but not
Chinyanja or
Yoruba.  The extra characters that Venda uses have all been encoded as
precomposed characters (l, n, d, and t with circumflex below, including
capitals)--but these have not been widely implemented in system font support,
so often the combining sequence is used.  The w with circumflex for Chinyanja
and the double diacritic behavior seen in Yoruba put them in Cat 4.  Kanuri,
like Venda, is Cat 3 when going by the Unicode Standard, but Cat 4 in practice
when it comes to implementation, as its precomposed r with stroke (024C
& 024D)
is not widely supported in system fonts.

Other good test cases include Serer (availability of 01AC & 01AD) and the
Khoesan languages (availability of specialized click consonants): although
their characters are fully encoded into the standard, they may still be de
facto in Cat 5 as long as fonts and input are lacking or not easily
accessible.

There's a document posted at SIL's site, "SIL Corporate PUA Character
Assignments":

http://tinyurl.com/yoerxb (downloads as a 1.1MB ZIP file)

This would be one source to consider in determining which languages still fall
under Category 5, with extra Latin-based characters not yet encoded.

Happy holidays!

Chuck Riley





Quoting Don Osborn <dzo@xxxxxxxxxxxx>:

At various points we and others have discussed categories of issues with
regard to handling orthographies of African languages, and how well Unicode
and its implementations support these. In particular, the numerous
Latin-based orthographies tend to fall into 4 or 5 categories. I'm not aware
of any formal terminology or frame of reference to easily refer to these so
would like to propose the following categories of Latin-based orthographies.
I've been thinking along these lines, with that thinking prompted in part by
a re-reading of an old article by Laurent Bourbeau and François Pinard. I'd
responded recently to a question by Lisa Moore and Debbie Anderson about how
many African languages are supported by Unicode by breaking the orthography
issues down and turning the question around - how many are *not*  supported
by Unicode (the answer is very few, but that support for orthographies using
combining diacritics is still an issue).



This formulation below was refined in an IM chat I had with Andrew
Cunningham yesterday, with thanks to him for feedback:



Category 1 orthographies:  ASCII - all characters and combinations covered
by the ASCII character set



Category 2 orthographies:   Latin-1, meaning all characters and combinations
covered by characters in ISO/IEC 8859-1 / Windows 1252



Category 3 orthographies:  Extended-Latin with no combining diacritics,
meaning that the orthographies are covered by the Latin ranges of Unicode
without need to use combining diacritics. Here there may be issues with
systems for input and available fonts coverage.



Category 4 orthographies:  Latin as complex script, meaning the
orthographies are covered by Extended-Latin with use of combining
diacritics. Here there are issues with input, fonts and rendering that are
not encountered in the above.



Category 5 orthographies:  Orthographies not fully supported by Unicode,
which at this point would mean a missing character (these are probably very
few).





Obviously this is somewhat simplified, since for example a few orthographies
might be both 3 & 4, meaning that they have diacritics for indicating tone,
but that these are not always used - so for some uses only the support
issues for category 3 are needed to get going, but category 4 issues are
there for diacritic support. (Manding languages, or at least Bambara in
Mali, would be an example). So in some cases one might say that a language
is category 3 for full, ordinary usage and category 4 for optimal coverage.



Also there are also degrees of complexity. Wolof for instance is category 3
because of its use of the "eng" letter (upper & lower case). On the other
hand, Hausa and Fula use several characters in the extended ranges.
Nevertheless, category would be determined for what is necessary for full
support.



For many uses, categories 1 & 2 are practically the same, but the latter
does occasionally pose some issues wrt display and input (and in some cases
still programming??).



For reference, Category 2 would also include French orthography and category
1, English. Vietnamese would I think be category 4.



All the above of course refers only to Latin-based orthographies. There are
issues with orthographies based on Arabic script and Ge'ez/Ethiopic abugida,
which have extended ranges of their own, as well as with dicritic support in
N'Ko and issues regarding possible additional characters in Tifinagh. Not to
mention some lesser-used writing systems that are not yet in Unicode.



Anyway, I put this forth for discussion. Whatever the case I think it is
useful to have a common typology for referring to the orthographies of
African languages, and to begin to classify these according to that for
further support, enabling, and localization issues.



Don Osborn

Bisharat.net

PanAfriL10n.org










[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]

Last Updated: Fri Dec 21 08:20:25 2007

a12n-collaboration is hosted on Kabissa - Space for Change in Africa

Your feedback is important. Click here to send a message to the Kabissa team.

Terms of Use | Privacy Notice | Web Site Credits © 1999-2006, Kabissa or its affiliates