With the advent of the Unicode standard, this all seems to be in the past. Unicode provides a unique number for every character and expands the number of characters encoded by using multiple bytes per character. There are both 16-bit (2 bytes) and 32-bit (4 bytes) versions of Unicode, which enables us to store and exchange over 95.000 characters, including Latin, Greek and Cyrillic alphabets, Hebrew and Arabic scripts, Japanese Hiragana, Chinese ideograms and Kangi radicals.
In complex international business environments however, where customer data integration, master data management, compliance to laws and regulations and operational excellence play an important role, Unicode is nothing more than a commodity. The real challenge in processing multilingual data lies in application of robust transliteration and transcription, normalization and intelligent comparison methods. Comparing characters from different writing systems does not only have historic value. The discovery of the Rosetta Stone shows the importance of transliteration avant-la-lettre. The stone, created in 196 B.C. and containing the same passage in three different languages, gave historians the key to two previously undecipherable Egyptian writing scripts (hieroglyphics) by determining the level of similarity with the known classical Greek alphabet.
Comparison of data, recorded in different writing systems poses many challenges. Naturally, a Unicode enabled environment is necessary to represent the data. But this is not where the real difficulties lie. Assessing the degree of similarity between records in Latin script and non-Latin script includes a much higher degree of complexity.
Interested? Get in touch with me and I'll send you a factsheet on the processing of non-Latin characters in an international business environment....