The Absolute Minimum Every Software Developer Absolutely, Pos...
Popularity Report
![]() |
|||
![]() |
|||
![]() |
|||
![]() |
|||
![]() |
|||
![]() |
URL Tag Cloud
Bookmark History
Saved by 133 people (-38 private), first by anonymouse user on 2006-03-02
- Campala on 2009-11-02 - Tags no_tag
- R_baxter on 2009-10-25 - Tags db_design_intl
- Andy-nihon on 2009-10-20 - Tags dev , i18n
- Sanlaville on 2009-10-15 - Tags unicode , encoding , utf8 , programming
- Youssefm on 2009-10-05 - Tags unicode , programming , encoding , i18n , development
Public Sticky notes
Highlighted by jedatu
Highlighted by millette
Highlighted by babybjorn
Highlighted by cvdlinden
Highlighted by aetles
Highlighted by hunter107
Highlighted by fullness
Highlighted by fullness
Highlighted by cmin
Highlighted by fullness
Highlighted by fullness
Highlighted by fullness
Highlighted by fullness
Highlighted by fullness
Highlighted by jangondol
Highlighted by cmin
Highlighted by fullness
Highlighted by fullness
Highlighted by cmin
Highlighted by fullness
Highlighted by fullness
Highlighted by fullness
Highlighted by fullness
Highlighted by fullness
Highlighted by jangondol
Highlighted by fullness
Highlighted by fullness
Highlighted by cmin
Highlighted by jangondol
Highlighted by jangondol
Highlighted by fullness
Highlighted by cmin
So far I've told you three ways of encoding Unicode. The traditional store-it-in-two-byte methods are called UCS-2 (because it has two bytes) or UTF-16 (because it has 16 bits), and you still have to figure out if it's high-endian UCS-2 or low-endian UCS-2. And there's the popular new UTF-8 standard which has the nice property of also working respectably if you have the happy coincidence of English text and braindead programs that are completely unaware that there is anything other than ASCII.
There are actually a bunch of other ways of encoding Unicode. There's something called UTF-7, which is a lot like UTF-8 but guarantees that the high bit will always be zero, so that if you have to pass Unicode through some kind of draconian police-state email system that thinks 7 bits are quite enough, thank you it can still squeeze through unscathed. There's UCS-4
Highlighted by jangondol
Highlighted by fullness
Highlighted by fullness
Highlighted by cmin
Highlighted by cmin
Highlighted by fullness
Highlighted by fullness
The Single Most Important Fact About Encodings
If you completely forget everything I just explained, please remember one extremely important fact. It does not make sense to have a string without knowing what encoding it uses. You can no longer stick your head in the sand and pretend that "plain" text is ASCII.
There Ain't No Such Thing As Plain Text.
If you have a string, in memory, in a file, or in an email message, you have to know what encoding it is in or you cannot interpret it or display it to users correctly.
Highlighted by fullness
Highlighted by cmin
It does not make sense to have a string without knowing what encoding it uses. You can no longer stick your head in the sand and pretend that "plain" text is ASCII.
There Ain't No Such Thing As Plain Text.
Highlighted by jangondol
Highlighted by jangondol
For an email message, you are expected to have a string in the header of the form
Content-Type: text/plain; charset="UTF-8"
For a web page, the original idea was that the web server would return a similar Content-Type http header along with the web page itself -- not in the HTML itself, but as one of the response headers that are sent before the HTML page.
Highlighted by jangondol
The web server itself wouldn't really know what encoding each file was written in, so it couldn't send the Content-Type header.
It would be convenient if you could put the Content-Type of the HTML file right in the HTML file itself, using some kind of special tag. Of course this drove purists crazy...
Highlighted by jangondol
It would be convenient if you could put the Content-Type of the HTML file right in the HTML file itself, using some kind of special tag. Of course this drove purists crazy... how can you read the HTML file until you know what encoding it's in?! Luckily, almost every encoding in common use does the same thing with characters between 32 and 127, so you can always get this far on the HTML page without starting to use funny letters:
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
But that meta tag really has to be the very first thing in the <head> section because as soon as the web browser sees this tag it's going to stop parsing the page and start over after reinterpreting the whole page using the encoding you specified.
Highlighted by fullness
Highlighted by jangondol
Highlighted by jangondol
Highlighted by cmin
Highlighted by jangondol
Highlighted by jangondol
Highlighted by mortenpj


Public Comment
on 2006-07-25 by billso
on 2006-09-07 by jcwinnie
on 2007-06-03 by ycc2106
on 2008-10-23 by yogsototh