Mailing List Hosted on Kabissa - Space for Change in Africa

a12n-collaboration Mailing List Archive: Re: [A12n-Collab] Re: [africa] 5 categories of African orthographies (Latin)

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]

  • Subject: Re: [A12n-Collab] Re: [africa] 5 categories of African orthographies (Latin)
  • From: "Denis Jacquerye" <moyogo@xxxxxxxxx>
  • Date: Sat, 22 Dec 2007 13:16:16 +0100
  • Dkim-signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:received:message-id:date:from:to:subject:cc:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:references; bh=zET0GEYkyvqUEsl+jl6/x8AmOgLJQwlfkjF5eS0112o=; b=PPZSwEdSeeh/8ioUUjxDXg1zGID6kw7RMfB4aFSI5Op8ay0wurtlmNVqdBVRubhNJbi7+RD85G5LtO0iOHLMBRCp8DpmmnDi8R8enlQiK+LN2OlL32TwZ71FG3d0xCvHF67TwZiN2568lWqcD+DTM9wfgvMnZH7GN4cBzJdRNss=
On Dec 22, 2007 3:15 AM, John Hudson <tiro@xxxxxxxx> wrote:
> Tunde Adegbola wrote:
>
> > In my own work in language technology however, I do have problems with
> > the lack of unique code points for high/low tone sub dotted vowels.
> > This presents ambiguity because they can be achieve in more that one
> > way; by subdotting a tone-marked vowel, or by tone-marking a subdotted
> > vowel.  Both look exactly the same to a human reader but requires extra
> > lines of code for a computer to see both as the same.  It starts getting
> > distracting when you consider that this has to be cattered for in both
> > lowe and upper cases.
>
> As Andrew Cunningham pointed out, handling different character sequences for 
> the same
> typeform is not very difficult, and it is for precisely this reason that 
> normalisation
> exists and is well defined in the Unicode standard. This is an issue that 
> affects any
> situation in which more than one mark is applied to a base letter, not just 
> some African
> orthographies, and since it is a given that any combining mark characters may 
> be combined
> in any quantity with any base characters, encoding precomposed combinations 
> not only is
> not a viable option but simply shifts the normalisation issue into a 
> comparison of
> precomposed and decomposed strings instead of comparision of variant 
> decomposed strings.

The main reason why people are advocating addition of precomposed
characters is because precomposed characters from legacy encodings are
very well supported when compared to composed characters. It's rather
paradoxal to have a simple solution, such a normalization on input or
before comparison, not being widely used.  There are too few
applications, on the desktop or online, that do it. Unicode support
does not imply normalization in general.
It's rather disappointing and discouraging to see companies with i18n
teams doing great work but totally failing in that aspect. Category 4
orthographies (with composed characters) face numerous basic issues
Category 3 or 2 orthographies don't, not only at input and display but
also at the data handling level. There really needs to be a greater
awareness for the need of normalization.

Happy holidays,
-- 
Denis Moyogo Jacquerye --- http://home.sus.mcgill.ca/~moyogo
Nkótá ya Kongó míbalé --- http://info-langues-congo.1sd.org/
DejaVu fonts --- http://dejavu.sourceforge.net/
Unicode (UTF-8)
[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]

Last Updated: Sun Dec 23 06:56:00 2007

a12n-collaboration is hosted on Kabissa - Space for Change in Africa

Your feedback is important. Click here to send a message to the Kabissa team.

Terms of Use | Privacy Notice | Web Site Credits © 1999-2006, Kabissa or its affiliates