Perl Unicode Cookbook: Unicode Normalization

℞ 27: Unicode normalization

Prescription one reminded you to always decompose and recompose Unicode data at the boundaries of your application. Unicode::Normalize can do much more for you. It supports multiple Unicode Normalization Forms.

Normalization, of course, takes Unicode data of arbitrary forms and canonicalizes it to a standard representation. (Where a composite character may be composed of multiple characters, normalized decomposition arranges those characters in a canonical order. Normalized composition combines those characters to a single composite character, where possible. Without this normalization, you can imagine the difficulty of determining whether one string is logically equivalent to another.)

Typically, you should render your data into NFD (the canonical decomposition form) on input and NFC (canonical decomposition followed by canonical composition) on output. Using NFKC or NFKD functions improves recall on searches, assuming you’ve already done the same normalization to the text to be searched.

Note that this normalization is about much more than just splitting or joining pre-combined compatibility glyphs; it also reorders marks according to their canonical combining classes and weeds out singletons.

 use Unicode::Normalize;
 my $nfd  = NFD($orig);
 my $nfc  = NFC($orig);
 my $nfkd = NFKD($orig);
 my $nfkc = NFKC($orig);

Previous: ℞ 26: Custom Character Properties

Series Index: The Standard Preamble

Next: ℞ 28: Convert non-ASCII Unicode Numerics

Tags

Feedback

Something wrong with this article? Help us out by opening an issue or pull request on GitHub