Normalization Forms for Accented Characters in java « Intelligrape Groovy & Grails Blogs

Normalization Forms for Accented Characters in java

Posted by

Text Normalization is the process of “standardizing” text to a certain form, so as to enable, searching, indexing and other types of analytical processing on it. Often working with large quantities of text we encounter character with accents like é , â etc. Unicode provides multiple ways to create such characters . For example we can have é created using Unicode sequence \u00E9 (composite character) or we can create it using a combination of e + acute accent. that would be (e + \u0301).
Now the é created would look same in both the representations and would also mean the same thing, but for a java program they are actually not the same characters so the é created using the two methods are actually not equal for your program. Clearly we need to normalize these two different representations to a fixed standard.

And its here, that java.text.Normalizer class comes to our rescue. All we need to do is normalize things to a normalization form out of these 4 :

  1. NFC – Canonical Decomposition, followed by Canonical Composition.
  2. NFD – Canonical Decomposition
  3. NFKC – Compatibility Decomposition, followed by Canonical Composition
  4. NFKD – Compatibility Decomposition



Canonical Decomposition means, taking a character and decomposing it into its component characters



Compatibility decomposition means taking a character and decomposing it by compatibility and arranging them in specific order


Canonical Composition means recomposing characters based on their canonical equivalence.


Canonical equivalence further means that characters have the same appearance and meaning when printed or displayed.


To Fully summarize this in an example,

Consider, the Angstrom sign “Å”, (U+212B) and  the Swedish letter “Å” (U+00C5), both are expanded by NFD (or NFKD) into “A” and “°” (U+0041 and U+030A) which is then reduced by NFC (or NFKC) to the Swedish letter “Å” (U+00C5)  (Swedish Letter “Å” is canonically equivalent to Angstrom sign “Å” as they are printed and displayed as exactly same, though they are different).

Now we know how we can normalize unicode characters to a standard form wherever required.



Sachin Anand

@babasachinanand

sachin[at]intelligrape[dot]com

This entry was posted on June 29th, 2012 at 6:30 pm and is filed under Java tools . You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.

One Response to “Normalization Forms for Accented Characters in java”

  1. Shell cameos have a thin concave spine, using the exception of
    abalone and mommy of pearl which might be normally flat around the back and
    somewhat thicker. When temptation arises, you just simply
    look at the promise ring you wear on your finger to remind you of that promise you
    made to your partner or your loved one. Even low-cost software like
    Photoshop Components or ACDSee could make your item photography very much less difficult.

Leave a Reply