Wednesday, June 23, 2010

BUILDING MULTI LINGUAL WEBSITE FOR INDIAN LANGUAGES


Gita Supersite is a multilingual website supporting the 10 Indian official languages. All the ten languages namely Devanagari, Oriya, Bengali, Punjabi, Gujrati, Telugu, Tamil, Assamese, Kannada and Malayalam are derived from Brahmi script. These languages have a common phonetic structure and this property is used for text storage and transliteration of the original text in Devanagari into other languages. The site contains texts related to Srimad BhagwadGita and some other famous Indian heritage books like Ramayana, Brahmasutra and Upanishads.

There are 14 officially recognized languages in India. Apart from Perso-Arabic scripts, all the other 10 scripts used for Indian languages have evolved from the ancient Brahmi script and have a common phonetic structure, making a common character set possible. Since all the languages have common phonetic structure the content of the website related to Indic scripts is saved in ISCII format.

Indian Script Code for Information Interchange or ISCII is a character based encoding system defining common phonetic character set that is used to represent this common character set. It was adopted by the Bureau of Indian Standards (BIS) in 1991 as a language for information exchange of Indian Languages. It is a single representation of all the Indian scripts with codes assigned in the upper ASCII region (160 - 255) for the aksharas of the language. The character encoding scheme also assign code for vowel extensions called matras, and includes special characters (like visarg, halant), to specify how a consonant in a syllable should be rendered. The representation for a syllable can be from one byte to as many as 10 bytes, making it is a multi-byte representation.

The most important point to mention about ISCII is that ISCII codes have nothing to do with fonts. A given text in ISCII may be displayed using many different fonts for the same script. This can be done by mapping ISCII codes to the glyphs in a matching font for that script. The rendering of content on the web browser for the current version is done with the use of CDAC (Center for Development and Advance Computing) technologies: GIST, ISFOC and TTF fonts.


The data needs to be represented in a universal format that is supported by current Web technologies. The Unicode Standard is a universal character encoding scheme for representing characters as integers. It defines a consistent way of encoding multilingual text that enables the exchange of text data internationally and creates the foundation for developing global software.

The format chosen should be displayed in fonts that can be installed on all platforms. Unicode format is displayed using OTF fonts that are intended to be cross-platform, and can be used on Mac OS, Windows and UNIX systems.

ISCII is UNICODE compatible. UNICODE has exactly the same character code map as that of ISCII for all the Indian Languages. So code for mapping characters from ISCII to Unicode can be written, and this will allow making use of the already created data in ISCII.

No comments: