NAME

NLS - Native Language Support Overview

DESCRIPTION

Native Language Support (NLS) provides commands for a single worldwide operating system base. An internationalized system has no built-in assumptions or dependencies on language-specific or cultural-specific conventions such as:

All information pertaining to cultural conventions and language is obtained at program run time.

``Internationalization'' (often abbreviated ``i18n'') refers to the operation by which system software is developed to support multiple cultural-specific and language-specific conventions. This is a generalization process by which the system is untied from calling only English strings or other English-specific conventions. ``Localization'' (often abbreviated ``l10n'') refers to the operations by which the user environment is customized to handle its input and output appropriate for specific language and cultural conventions. This is a specialization process, by which generic methods already implemented in an internationalized system are used in specific ways. The formal description of cultural conventions for some country, together with all associated translations targeted to the native language, is called the ``locale''.

NetBSD provides extensive support to programmers and system developers to enable internationalized software to be developed. NetBSD also supplies a large variety of locales for system localization.

Localization of Information

All locale information is accessible to programs at run time so that data is processed and displayed correctly for specific cultural conventions and language.

A locale is divided into categories. A category is a group of language-specific and culture-specific conventions as outlined in the list above. ISO C specifies the following six standard categories supported by NetBSD:

LC_COLLATE
string-collation order information
LC_CTYPE
character classification, case conversion, and other character attributes
LC_MESSAGES
the format for affirmative and negative responses
LC_MONETARY
rules and symbols for formatting monetary numeric information
LC_NUMERIC
rules and symbols for formatting nonmonetary numeric information
LC_TIME
rules and symbols for formatting time and date information

Localization of the system is achieved by setting appropriate values in environment variables to identify which locale should be used. The environment variables have the same names as their respective locale categories. Additionally, the LANG, LC_ALL, and NLSPATH environment variables are used. The NLSPATH environment variable specifies a colon-separated list of directory names where the message catalog files of the NLS database are located. The LC_ALL and LANG environment variables also determine the current locale.

The values of these environment variables contains a string format as:

        language[_territory][.codeset][@modifier]

Valid values for the language field come from the ISO639 standard which defines two-character codes for many languages. Some common language codes are:


_L_a_n_g_u_a_g_e _N_a_m_e   _C_o_d_e   _L_a_n_g_u_a_g_e _F_a_m_i_l_y


ABKHAZIAN AB IBERO-CAUCASIAN AFAN (OROMO) OM HAMITIC AFAR AA HAMITIC AFRIKAANS AF GERMANIC ALBANIAN SQ INDO-EUROPEAN (OTHER) AMHARIC AM SEMITIC ARABIC AR SEMITIC ARMENIAN HY INDO-EUROPEAN (OTHER) ASSAMESE AS INDIAN AYMARA AY AMERINDIAN AZERBAIJANI AZ TURKIC/ALTAIC BASHKIR BA TURKIC/ALTAIC BASQUE EU BASQUE BENGALI BN INDIAN BHUTANI DZ ASIAN BIHARI BH INDIAN BISLAMA BI BRETON BR CELTIC BULGARIAN BG SLAVIC BURMESE MY ASIAN BYELORUSSIAN BE SLAVIC CAMBODIAN KM ASIAN CATALAN CA ROMANCE CHINESE ZH ASIAN CORSICAN CO ROMANCE CROATIAN HR SLAVIC CZECH CS SLAVIC DANISH DA GERMANIC DUTCH NL GERMANIC ENGLISH EN GERMANIC ESPERANTO EO INTERNATIONAL AUX. ESTONIAN ET FINNO-UGRIC FAROESE FO GERMANIC FIJI FJ OCEANIC/INDONESIAN FINNISH FI FINNO-UGRIC FRENCH FR ROMANCE FRISIAN FY GERMANIC GALICIAN GL ROMANCE GEORGIAN KA IBERO-CAUCASIAN GERMAN DE GERMANIC GREEK EL LATIN/GREEK GREENLANDIC KL ESKIMO GUARANI GN AMERINDIAN GUJARATI GU INDIAN HAUSA HA NEGRO-AFRICAN HEBREW HE SEMITIC HINDI HI INDIAN HUNGARIAN HU FINNO-UGRIC ICELANDIC IS GERMANIC INDONESIAN ID OCEANIC/INDONESIAN INTERLINGUA IA INTERNATIONAL AUX. INTERLINGUE IE INTERNATIONAL AUX. INUKTITUT IU INUPIAK IK ESKIMO IRISH GA CELTIC ITALIAN IT ROMANCE JAPANESE JA ASIAN JAVANESE JV OCEANIC/INDONESIAN KANNADA KN DRAVIDIAN KASHMIRI KS INDIAN KAZAKH KK TURKIC/ALTAIC KINYARWANDA RW NEGRO-AFRICAN KIRGHIZ KY TURKIC/ALTAIC KURUNDI RN NEGRO-AFRICAN KOREAN KO ASIAN KURDISH KU IRANIAN LAOTHIAN LO ASIAN LATIN LA LATIN/GREEK LATVIAN LV BALTIC LINGALA LN NEGRO-AFRICAN LITHUANIAN LT BALTIC MACEDONIAN MK SLAVIC MALAGASY MG OCEANIC/INDONESIAN MALAY MS OCEANIC/INDONESIAN MALAYALAM ML DRAVIDIAN MALTESE MT SEMITIC MAORI MI OCEANIC/INDONESIAN MARATHI MR INDIAN MOLDAVIAN MO ROMANCE MONGOLIAN MN NAURU NA NEPALI NE INDIAN NORWEGIAN NO GERMANIC OCCITAN OC ROMANCE ORIYA OR INDIAN PASHTO PS IRANIAN PERSIAN (farsi) FA IRANIAN POLISH PL SLAVIC PORTUGUESE PT ROMANCE PUNJABI PA INDIAN QUECHUA QU AMERINDIAN RHAETO-ROMANCE RM ROMANCE ROMANIAN RO ROMANCE RUSSIAN RU SLAVIC SAMOAN SM OCEANIC/INDONESIAN SANGHO SG NEGRO-AFRICAN SANSKRIT SA INDIAN SCOTS GAELIC GD CELTIC SERBIAN SR SLAVIC SERBO-CROATIAN SH SLAVIC SESOTHO ST NEGRO-AFRICAN SETSWANA TN NEGRO-AFRICAN SHONA SN NEGRO-AFRICAN SINDHI SD INDIAN SINGHALESE SI INDIAN SISWATI SS NEGRO-AFRICAN SLOVAK SK SLAVIC SLOVENIAN SL SLAVIC SOMALI SO HAMITIC SPANISH ES ROMANCE SUNDANESE SU OCEANIC/INDONESIAN SWAHILI SW NEGRO-AFRICAN SWEDISH SV GERMANIC TAGALOG TL OCEANIC/INDONESIAN TAJIK TG IRANIAN TAMIL TA DRAVIDIAN TATAR TT TURKIC/ALTAIC TELUGU TE DRAVIDIAN THAI TH ASIAN TIBETAN BO ASIAN TIGRINYA TI SEMITIC TONGA TO OCEANIC/INDONESIAN TSONGA TS NEGRO-AFRICAN TURKISH TR TURKIC/ALTAIC TURKMEN TK TURKIC/ALTAIC TWI TW NEGRO-AFRICAN UIGUR UG UKRAINIAN UK SLAVIC URDU UR INDIAN UZBEK UZ TURKIC/ALTAIC VIETNAMESE VI ASIAN VOLAPUK VO INTERNATIONAL AUX. WELSH CY CELTIC WOLOF WO NEGRO-AFRICAN XHOSA XH NEGRO-AFRICAN YIDDISH YI GERMANIC YORUBA YO NEGRO-AFRICAN ZHUANG ZA ZULU ZU NEGRO-AFRICAN

For example, the locale for the Danish language spoken in Denmark using the ISO 8859-1 character set is da_DK.ISO8859-1. The da stands for the Danish language and the DK stands for Denmark. The short form of da_DK is sufficient to indicate this locale.

The environment variable settings are queried by their priority level in the following manner:

Character Sets

A character is any symbol used for the organization, control, or representation of data. A group of such symbols used to describe a particular language make up a character set. It is the encoding values in a character set that provide the interface between the system and its input and output devices.

The following character sets are supported in NetBSD:

ASCII
The American Standard Code for Information Exchange (ASCII) standard specifies 128 Roman characters and control codes, encoded in a 7-bit character encoding scheme.

ISO 8859 family
Industry-standard character sets specified by the ISO/IEC 8859 standard. The standard is divided into 15 numbered parts, with each part specifying broad script similarities. Examples include Western European, Central European, Arabic, Cyrillic, Hebrew, Greek, and Turkish. The character sets use an 8-bit character encoding scheme which is compatible with the ASCII character set.

Unicode
The Unicode character set is the full set of known abstract characters of all real-world scripts. It can be used in environments where multiple scripts must be processed simultaneously. Unicode is compatible with ISO 8859-1 (Western European) and ASCII. Many character encoding schemes are available for Unicode, including UTF-8, UTF-16 and UTF-32. These encoding schemes are multi-byte encodings. The UTF-8 encoding scheme uses 8-bit, variable-width encodings which is compatible with ASCII. The UTF-16 encoding scheme uses 16-bit, variable-width encodings. The UTF-32 encoding scheme using 32-bit, fixed-width encodings.

Font Sets

A font set contains the glyphs to be displayed on the screen for a corresponding character in a character set. A display must support a suitable font to display a character set. If suitable fonts are available to the X server, then X clients can include support for different character sets. xterm(1) includes support for Unicode with UTF-8 encoding. xfd(1) is useful for displaying all the characters in an X font.

The NetBSD wscons(4) console provides support for loading fonts using the wsfontload(8) utility. Currently, only fonts for the ISO8859-1 family of character sets are supported.

Internationalization for Programmers

To facilitate translations of messages into various languages and to make the translated messages available to the program based on a user's locale, it is necessary to keep messages separate from the programs and provide them in the form of message catalogs that a program can access at run time.

Access to locale information is provided through the setlocale(3) and nl_langinfo(3) interfaces. See their respective man pages for further information.

Message source files containing application messages are created by the programmer and converted to message catalogs. These catalogs are used by the application to retrieve and display messages, as needed.

NetBSD supports two message catalog interfaces: the X/Open catgets(3) interface and the Uniforum gettext(3) interface. The catgets(3) interface has the advantage that it belongs to a standard which is well supported. Unfortunately the interface is complicated to use and maintenance of the catalogs is difficult. The implementation also doesn't support different character sets. The gettext(3) interface has not been standardized yet, however it is being supported by an increasing number of systems. It also provides many additional tools which make programming and catalog maintenance much easier.

Support for Multi-byte Encodings

Some character sets with multi-byte encodings may be difficult to decode, or may contain state (i.e., adjacent characters are dependent). ISO C specifies a set of functions using 'wide characters' which can handle multi-byte encodings properly. The behaviour of these functions is affected by the LC_CTYPE category of the current locale.

A wide character is specified in ISO C as being a fixed number of bits wide and is stateless. There are two types for wide characters: wchar_t and wint_t. wchar_t is a type which can contain one wide character and operates like 'char' type does for one character. wint_t can contain one wide character or WEOF (wide EOF).

There are functions that operate on wchar_t, and substitute for functions operating on 'char'. See wmemchr(3) and towlower(3) for details. There are some additional functions that operate on wchar_t. See wctype(3) and wctrans(3) for details.

Wide characters should be used for all I/O processing which may rely on locale-specific strings. The two primary issues requiring special use of wide characters are:

SEE ALSO

gencat(1), xfd(1), xterm(1), catgets(3), gettext(3), nl_langinfo(3), setlocale(3), wsfontload(8)

BUGS

This man page is incomplete.