Text Normalization is the process of “standardizing” text to a certain form, so as to enable, searching, indexing and other types of analytical processing on it. Often working with large quantities of text we encounter character with accents like é , â etc. Unicode provides multiple ways to create such characters . For example we can have é created using Unicode sequence \u00E9 (composite character) or we can create it using a combination of e + acute accent. that would be (e + \u0301).
Now the é created would look same in both the representations and would also mean the same thing, but for a java program they are actually not the same characters so the é created using the two methods are actually not equal for your program. Clearly we need to normalize these two different representations to a fixed standard.
And its here, that java.text.Normalizer class comes to our rescue. All we need to do is normalize things to a normalization form out of these 4 :
- NFC – Canonical Decomposition, followed by Canonical Composition.
- NFD – Canonical Decomposition
- NFKC – Compatibility Decomposition, followed by Canonical Composition
- NFKD – Compatibility Decomposition
Canonical Decomposition means, taking a character and decomposing it into its component characters
Compatibility decomposition means taking a character and decomposing it by compatibility and arranging them in specific order
Canonical Composition means recomposing characters based on their canonical equivalence.
Canonical equivalence further means that characters have the same appearance and meaning when printed or displayed.
To Fully summarize this in an example,
Consider, the Angstrom sign “Å”, (U+212B) and the Swedish letter “Å” (U+00C5), both are expanded by NFD (or NFKD) into “A” and “°” (U+0041 and U+030A) which is then reduced by NFC (or NFKC) to the Swedish letter “Å” (U+00C5) (Swedish Letter “Å” is canonically equivalent to Angstrom sign “Å” as they are printed and displayed as exactly same, though they are different).
Now we know how we can normalize unicode characters to a standard form wherever required.
Sachin Anand
@babasachinanand
sachin[at]intelligrape[dot]com

Shell cameos have a thin concave spine, using the exception of
abalone and mommy of pearl which might be normally flat around the back and
somewhat thicker. When temptation arises, you just simply
look at the promise ring you wear on your finger to remind you of that promise you
made to your partner or your loved one. Even low-cost software like
Photoshop Components or ACDSee could make your item photography very much less difficult.