At various points we and others have discussed categories of
issues with regard to handling orthographies of African languages, and how well
Unicode and its implementations support these. In particular, the numerous Latin-based
orthographies tend to fall into 4 or 5 categories. I'm not aware of any formal
terminology or frame of reference to easily refer to these so would like to propose
the following categories of Latin-based orthographies. I've been thinking along
these lines, with that thinking prompted in part by a re-reading of an old article
by Laurent Bourbeau and François Pinard. I'd responded recently to a question
by Lisa Moore and Debbie Anderson about how many African languages are
supported by Unicode by breaking the orthography issues down and turning the
question around - how many are *not* supported by Unicode (the answer is very
few, but that support for orthographies using combining diacritics is still an
issue).
This formulation below was refined in an IM chat I had with
Andrew Cunningham yesterday, with thanks to him for feedback:
Category 1 orthographies: ASCII - all characters and
combinations covered by the ASCII character set
Category 2 orthographies: Latin-1, meaning all characters
and combinations covered by characters in ISO/IEC 8859-1 / Windows 1252
Category 3 orthographies: Extended-Latin with no combining
diacritics, meaning that the orthographies are covered by the Latin ranges of
Unicode without need to use combining diacritics. Here there may be issues with
systems for input and available fonts coverage.
Category 4 orthographies: Latin as complex script, meaning
the orthographies are covered by Extended-Latin with use of combining
diacritics. Here there are issues with input, fonts and rendering that are not
encountered in the above.
Category 5 orthographies: Orthographies not fully supported
by Unicode, which at this point would mean a missing character (these are
probably very few).
Obviously this is somewhat simplified, since for example a
few orthographies might be both 3 & 4, meaning that they have diacritics
for indicating tone, but that these are not always used - so for some uses only
the support issues for category 3 are needed to get going, but category 4 issues
are there for diacritic support. (Manding languages, or at least Bambara in
Mali, would be an example). So in some cases one might say that a language is
category 3 for full, ordinary usage and category 4 for optimal coverage.
Also there are also degrees of complexity. Wolof for
instance is category 3 because of its use of the "eng" letter (upper
& lower case). On the other hand, Hausa and Fula use several characters in
the extended ranges. Nevertheless, category would be determined for what is
necessary for full support.
For many uses, categories 1 & 2 are practically the
same, but the latter does occasionally pose some issues wrt display and input
(and in some cases still programming??).
For reference, Category 2 would also include French orthography
and category 1, English. Vietnamese would I think be category 4.
All the above of course refers only to Latin-based
orthographies. There are issues with orthographies based on Arabic script and
Ge'ez/Ethiopic abugida, which have extended ranges of their own, as well as
with dicritic support in N'Ko and issues regarding possible additional
characters in Tifinagh. Not to mention some lesser-used writing systems that
are not yet in Unicode.
Anyway, I put this forth for discussion. Whatever the case I
think it is useful to have a common typology for referring to the orthographies
of African languages, and to begin to classify these according to that for
further support, enabling, and localization issues.
Don Osborn
Bisharat.net
PanAfriL10n.org