Wednesday, November 08, 2006

Character Set Encoding Detection -- Part 1

Character set encoding detection becomes necessary when you starts working on processing non-English text.

I started working on south-east Asian languages a year back and I had to port some code. This particular code was working fine for English text and never gave any problems from some European non-English languages.

But when I started on working on SEA languages, I knew before starting that encoding issues will make our life hell and really it did.

Most of the softwares were just supporting English in olden days and that lead to common myth of 1byte=1char.
It takes time to digest things like characters bigger then one byte and character stream with characters of variable lengths.

Then comes the issue of which byte sequence is which character. Several countries follow different encodings, if one just gets some text as a stream of bytes and have no idea about the encoding, then there is small chance that this text will be processed correctly.

Character Set and Character encoding are the two generally interchangeably used concepts but sometimes they mean different things.

Character Set: Just a collection of characters
eg. Kannadda characters, Devanagri Characters, Japanese Characters, English alphabets
Character Encoding: Mapping a character from a character set to a numerical value.
eg. UTF-8, UTF-16, EUC-JP, EUC-KR, ISO-8859-1 to 7, ISO-2022-JP

European languages have less characters which can be fit in single byte space and so most of the European languages use ISO-8859-[1-7] character encoding.
But SEA languages, they are commonly referred as CJKV (Chines,Japanese, Korean and Vietnamese)
Best Reference for CJKV

And there are attempts made to standardize the character set, encodings
1. Unicode
2. Wikipedia Unicode link

Its very clear that different publishers have their personal choices in using different character encoding. Actually most of them are not aware of it. They just give out in default working encoding.

Now when somebody browses or crawls your page, he need to know the encoding of the text sent by you to read or programmatically process it properly. Here comes the problem of unknown encoding.
When browser or your program should do when it faces such issue. Most browser tries to detect the encoding and use. Detecting the encoding is not exact science but it works well for most of the pages.

In simple words this detection is done by checking the occurrence of certain patterns in the byte stream.
The detector which mozilla provides works better if you set the detector to detect encodings common to your language.

-----------
References
1. http://www.mozilla.org/projects/intl/chardet.html
2. http://sourceforge.net/projects/icu/

No comments: