Skip to main content

The Absolute Minimum Every Software Developer Absolutely, Pos...

Popularity Report

Total Popularity Score: 0

Loading...
Loading...
Loading...
Loading...
Loading...
Loading...

Rank

Bookmark History

Saved by 133 people (-38 private), first by anonymouse user on 2006-03-02


Public Comment

on 2006-07-25 by billso

Joel on Software

on 2006-09-07 by jcwinnie

Joel on Software

on 2007-06-03 by ycc2106

Nice story of encoding

on 2008-10-23 by yogsototh

At last, I've understood the difference between Unicode and Encoding.

Public Sticky notes

The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

Highlighted by jedatu

I've been dismayed to discover just how many software developers aren't really completely up to speed on the mysterious world of character sets, encodings, Unicode, all that stuff.

Highlighted by millette

The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

Highlighted by babybjorn

In this article I'll fill you in on exactly what every working programmer should know. All that stuff about "plain text = ascii = characters are 8 bits" is not only wrong, it's hopelessly wrong, and if you're still programming that way, you're not much better than a medical doctor who doesn't believe in germs. Please do not write another line of code until you finish reading this article.

Highlighted by cvdlinden

Ever wonder about that mysterious Content-Type tag? You know, the one you're supposed to put in HTML and you never quite know what it should be? Did you ever get an email from your friends in Bulgaria with the subject line "???? ?????? ??? ????"? I've been dismayed to discover just how many software developers aren't really completely up to speed on the mysterious world of character sets, encodings, Unicode, all that stuff. A couple of years ago, a beta tester for FogBUGZ was wondering whether it could handle incoming email in Japanese. Japanese? They have email in Japanese? I had no idea. When I looked closely at the commercial ActiveX control we were using to parse MIME email messages, we discovered it was doing exactly the wrong thing with character sets, so we actually had to write heroic code to undo the wrong conversion it had done and redo it correctly. When I looked into another commercial library, it, too, had a completely broken character code implementation. I corresponded with the developer of that package and he sort of thought they "couldn't do anything about it." Like many programmers, he just wished it would all blow over somehow.

Highlighted by aetles

The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

Highlighted by hunter107

Please do not write another line of code until you finish reading this article.

Highlighted by fullness

I should warn you that if you are one of those rare people who knows about internationalization, you are going to find my entire discussion a little bit oversimplified.

Highlighted by fullness

All that stuff about "plain text = ascii = characters are 8 bits" is not only wrong, it's hopelessly wrong

Highlighted by cmin

The only characters that mattered were good old unaccented English letters, and we had a code for them called ASCII

Highlighted by fullness

This could conveniently be stored in 7 bits. Most computers in those days were using 8-bit bytes, so not only could you store every possible ASCII character, but you had a whole bit to spare

Highlighted by fullness

Because bytes have room for up to eight bits, lots of people got to thinking, "gosh, we can use the codes 128-255 for our own purposes." The trouble was, lots of people had this idea at the same time, and they had their own ideas of what should go where in the space from 128 to 255.

Highlighted by fullness

Eventually this OEM free-for-all got codified in the ANSI standard. In the ANSI standard, everybody agreed on what to do below 128, which was pretty much the same as ASCII, but there were lots of different ways to handle the characters from 128 and on up, depending on where you lived. These different systems were called code pages. So for example in Israel DOS used a code page called 862, while Greek users used 737. They were the same below 128 but different from 128 up, where all the funny letters resided.

Highlighted by fullness

Meanwhile, in Asia, even more crazy things were going on to take into account the fact that Asian alphabets have thousands of letters, which were never going to fit into 8 bits.

Highlighted by fullness

But still, most people just pretended that a byte was a character and a character was 8 bits and as long as you never moved a string from one computer to another, or spoke more than one language, it would sort of always work. But of course, as soon as the Internet happened, it became quite commonplace to move strings from one computer to another, and the whole mess came tumbling down. Luckily, Unicode had been invented.

Highlighted by fullness

Some people are under the misconception that Unicode is simply a 16-bit code where each character takes 16 bits and therefore there are 65,536 possible characters. This is not, actually, correct. It is the single most common myth about Unicode, so if you thought that, don't feel bad.

Highlighted by jangondol

But still, most people just pretended that a byte was a character and a character was 8 bits and as long as you never moved a string from one computer to another, or spoke more than one language, it would sort of always work.

Highlighted by cmin

In fact, Unicode has a different way of thinking about characters, and you have to understand the Unicode way of thinking of things or nothing will make sense.

Highlighted by fullness

In Unicode, a letter maps to something called a code point which is still just a theoretical concept. How that code point is represented in memory or on disk is a whole nuther story.

Highlighted by fullness

Some people are under the misconception that Unicode is simply a 16-bit code where each character takes 16 bits and therefore there are 65,536 possible characters. This is not, actually, correct.

Highlighted by cmin

Every platonic letter in every alphabet is assigned a magic number by the Unicode consortium which is written like this: U+0639.  This magic number is called a code point. The U+ means "Unicode" and the numbers are hexadecimal.

Highlighted by fullness

There is no real limit on the number of letters that Unicode can define and in fact they have gone beyond 65,536 so not every unicode letter can really be squeezed into two bytes, but that was a myth anyway.

Highlighted by fullness

numbers in two bytes

Highlighted by fullness

two bytes

Highlighted by fullness

Couldn't it also be

Highlighted by fullness

early implementors wanted to be able to store their Unicode code points in high-endian or low-endian mode, whichever their particular CPU was fastest at, and lo, it was evening and it was morning and there were already two ways to store Unicode. So the people were forced to come up with the bizarre convention of storing a FE FF at the beginning of every Unicode string; this is called a Unicode Byte Order Mark and if you are swapping your high and low bytes it will look like a FF FE and the person reading your string will know that they have to swap every other byte. Phew. Not every Unicode string in the wild has a byte order mark at the beginning.

Highlighted by jangondol

couldn't bear the idea of doubling the amount of storage it took for strings, and anyway, there were already all these doggone documents out there using various ANSI and DBCS character sets

Highlighted by fullness

most people decided to ignore Unicode for several years and in the meantime things got worse.

Highlighted by fullness

Not every Unicode string in the wild has a byte order mark at the beginning.

Highlighted by cmin

UTF-8 was another system for storing your string of Unicode code points, those magic U+ numbers, in memory using 8 bit bytes. In UTF-8, every code point from 0-127 is stored in a single byte. Only code points 128 and above are stored using 2, 3, in fact, up to 6 bytes.

Highlighted by jangondol

This has the neat side effect that English text looks exactly the same in UTF-8 as it did in ASCII, so Americans don't even notice anything wrong. Only the rest of the world has to jump through hoops.

Highlighted by jangondol

English text looks exactly the same in UTF-8 as it did in ASCII, so Americans don't even notice anything wrong. Only the rest of the world has to jump through hoops.

Highlighted by fullness

UTF-8 was another system for storing your string of Unicode code points, those magic U+ numbers, in memory using 8 bit bytes.

Highlighted by cmin

So far I've told you three ways of encoding Unicode. The traditional store-it-in-two-byte methods are called UCS-2 (because it has two bytes) or UTF-16 (because it has 16 bits), and you still have to figure out if it's high-endian UCS-2 or low-endian UCS-2. And there's the popular new UTF-8 standard which has the nice property of also working respectably if you have the happy coincidence of English text and braindead programs that are completely unaware that there is anything other than ASCII.

There are actually a bunch of other ways of encoding Unicode. There's something called UTF-7, which is a lot like UTF-8 but guarantees that the high bit will always be zero, so that if you have to pass Unicode through some kind of draconian police-state email system that thinks 7 bits are quite enough, thank you it can still squeeze through unscathed. There's UCS-4

Highlighted by jangondol

UTF-16

Highlighted by fullness

And in fact now that you're thinking of things in terms of platonic ideal letters which are represented by Unicode code points, those unicode code points can be encoded in any old-school encoding scheme, too! For example, you could encode the Unicode string for Hello (U+0048 U+0065 U+006C U+006C U+006F) in ASCII, or the old OEM Greek Encoding, or the Hebrew ANSI Encoding, or any of several hundred encodings that have been invented so far, with one catch: some of the letters might not show up! If there's no equivalent for the Unicode code point you're trying to represent in the encoding you're trying to represent it in, you usually get a little question mark: ? or, if you're really good, a box. Which did you get? -> �

Highlighted by fullness

The traditional store-it-in-two-byte methods are called UCS-2 (because it has two bytes) or UTF-16 (because it has 16 bits)

Highlighted by cmin

There's something called UTF-7

Highlighted by cmin

There are hundreds of traditional encodings which can only store some code points correctly and change all the other code points into question marks.

Highlighted by fullness

any code point

Highlighted by fullness

The Single Most Important Fact About Encodings

If you completely forget everything I just explained, please remember one extremely important fact. It does not make sense to have a string without knowing what encoding it uses. You can no longer stick your head in the sand and pretend that "plain" text is ASCII.

There Ain't No Such Thing As Plain Text.

If you have a string, in memory, in a file, or in an email message, you have to know what encoding it is in or you cannot interpret it or display it to users correctly.

Highlighted by fullness

those unicode code points can be encoded in any old-school encoding scheme, too!

Highlighted by cmin

It does not make sense to have a string without knowing what encoding it uses. You can no longer stick your head in the sand and pretend that "plain" text is ASCII.

There Ain't No Such Thing As Plain Text.

Highlighted by jangondol

There are over a hundred encodings and above code point 127, all bets are off.

Highlighted by jangondol

For an email message, you are expected to have a string in the header of the form

Content-Type: text/plain; charset="UTF-8"

For a web page, the original idea was that the web server would return a similar Content-Type http header along with the web page itself -- not in the HTML itself, but as one of the response headers that are sent before the HTML page. 

Highlighted by jangondol

The web server itself wouldn't really know what encoding each file was written in, so it couldn't send the Content-Type header.

It would be convenient if you could put the Content-Type of the HTML file right in the HTML file itself, using some kind of special tag. Of course this drove purists crazy...

Highlighted by jangondol

It would be convenient if you could put the Content-Type of the HTML file right in the HTML file itself, using some kind of special tag. Of course this drove purists crazy... how can you read the HTML file until you know what encoding it's in?! Luckily, almost every encoding in common use does the same thing with characters between 32 and 127, so you can always get this far on the HTML page without starting to use funny letters:

<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">

But that meta tag really has to be the very first thing in the <head> section because as soon as the web browser sees this tag it's going to stop parsing the page and start over after reinterpreting the whole page using the encoding you specified.

Highlighted by fullness

What do web browsers do if they don't find any Content-Type, either in the http headers or the meta tag? Internet Explorer actually does something quite interesting: it tries to guess, based on the frequency in which various bytes appear

Highlighted by jangondol

Postel's Law about being "conservative in what you emit and liberal in what you accept" is quite frankly not a good engineering principle

Highlighted by jangondol

Internet Explorer actually does something quite interesting: it tries to guess

Highlighted by cmin

we decided to do everything internally in UCS-2 (two byte) Unicode, which is what Visual Basic, COM, and Windows NT/2000/XP use as their native string type

Highlighted by jangondol

When CityDesk publishes the web page, it converts it to UTF-8 encoding, which has been well supported by web browsers for many years.

Highlighted by jangondol

Postel's Law about being "conservative in what you emit and liberal in what you accept" is quite frankly not a good engineering principle

Highlighted by mortenpj