Overview of Unicode in SAP


Fundamentally, computers store letters and other characters by assigning a number for each one.
Unicode provides a unique number (Code point) for every character, no matter what the platform, no matter what the program, no matter what the language is.

Unicode = universally encoded character set to store information from any language. The Unicode standard primarily encodes scripts rather than languages. Scripts comprise several languages that historically share the same set of symbols. In many cases a script may serve to write dozens of languages (e.g. the Latin script). In other cases one script complies to one language (e.g. Hangul). Additionally it also includes punctuation marks, diacritics, mathematical symbols, technical symbols, musical symbols, arrows, etc. In all, the Unicode Standard comprises >95.000 characters, ideograph sets, symbols (version 4.0).

The Unicode Standard is a character coding system designed to support the worldwide interchange, processing and display of written text of the diverse languages and technical disciplines of the modern world. In addition, it supports classical and historical texts of many written languages.

Unicode: The last character set?. It is an open character set, which means that it keeps growing and adding less frequently used characters. The standard assigns numbers from 0 to 0x10FFFF, which is more than a million possible numbers for characters. 5% of this space is used, 5% is in preparation, about 13% is reserved for private use, and about 2% is reserved not for use. The remaining 75% is open for future use but not by any means expected to be filled up and finally there is a character set with plenty of space!

What’s the Need for Unicode

Hundreds of encodings have been developed, each for small groups of languages and special purposes. There is no single, authoritative source of precise definitions of many of the encodings and their names. No single encoding could contain enough characters: for example, the European Union alone requires several different encodings to cover all its languages. Even for a single language like English no single encoding was adequate for all the letters, punctuation, and technical symbols in common use. Incompatibilities between different code pages. These encoding systems conflict with one another. That is, two encodings can use the same number for two different characters, or use different numbers for the same character. Any given computer (especially servers) needs to support many different encodings; yet whenever data is passed between different encodings or platforms, that data always runs the risk of corruption. Programs are written to either handle one single encoding at a time and switch between them, or to convert between external and internal encodings.

Pre-Unicode Solutions by SAP

Single Code Page System : System using one standard code page which can support a specific set of languages.

Blended Code Page System (Release 3.0D) :Multi byte blended code pages, which contain characters out of several standard code pages. Blended code pages are not standard code pages, but SAP-customized code pages that were devised to support an increased number of possible language combinations in a single code page.
a) Ambiguous Blended Code Page System: Two characters can share the same code point.
b) Unambiguous Blended Code Page System: Each code point refers exactly to one character.

MDMP System Configuration (Release 3.1I). Multiple Display/Multiple Processing. System using more than one system code page on the application server. Allows languages to be used together in one system although the characters of those languages are not in the same code page.

It is possible for a user to log on with German and then manipulate the character set and font settings so that he can enter what appear to be Japanese characters; these characters will not be correctly stored in the database and this data will be corrupt
If a user wants to enter Japanese, he/she must log on in Japanese. To insure that no data corruption occurs, the following restrictions must be followed: Global data must contain only 7-bit ASCII characters, which are in all code pages, Users may use only the characters of their log-in language or 7-bit ASCII. Batch processes must be assigned with the correct user ID and language. EBCDIC code pages are not supported


Why SAP Adopted Unicode?

Globalization = Internationalization + Localization. The Unicode Standard has already been adopted by industry leaders as Apple, HP, IBM, JustSystem, Microsoft, Oracle, Sun, Sybase, Unisys and many others. Unicode is required by modern standards such as XML, Java, ECMAScript (JavaScript), LDAP, CORBA 3.0, WML, etc., It is the official way to implement ISO/IEC 10646.

Allows text data from different languages to be stored in one repository. Enable a single set of source code to be written to process data in virtually all languages. Simplifies addition of new language support to an e-business application since character processing and storage remains unchanged. Lowers cost of implementation, Faster speed to market and better customer satisfaction.

Copyright: ERPDB : Do not Copy