Unicode and Character Sets plus MySQL Latin1 to UTF-8 Conversion

Binary DAD

I knew most of this… but alas, not all of it… BTW, here’s a relevant ThinkGeek.com present a friend gave me:

I found this interesting article on How To Change An Early WPMU Database from latin1 to utf8 Encoding, which has a bunch of useful links related to character encoding problems, WordPress (WPMU), and MySQL & PHP.

From the article in question:

So I have an announcement to make: if you are a programmer working in 2003 and you don’t know the basics of characters, character sets, encodings, and Unicode, and I catch you, I’m going to punish you by making you peel onions for 6 months in a submarine. I swear I will.

And one more thing:


Binary DADIn this article I’ll fill you in on exactly what every working programmer should know. All that stuff about “plain text = ascii = characters are 8 bits” is not only wrong, it’s hopelessly wrong, and if you’re still programming that way, you’re not much better than a medical doctor who doesn’t believe in germs. Please do not write another line of code until you finish reading this article.

And then there’s this juicy tidbit:

For a while it seemed like that might be good enough, but programmers were complaining. “Look at all those zeros!” they said, since they were Americans and they were looking at English text which rarely used code points above U+00FF. Also they were liberal hippies in California who wanted to conserve (sneer). If they were Texans they wouldn’t have minded guzzling twice the number of bytes. But those Californian wimps couldn’t bear the idea of doubling the amount of storage it took for strings, and anyway, there were already all these doggone documents out there using various ANSI and DBCS character sets and who’s going to convert them all? Moi? For this reason alone most people decided to ignore Unicode for several years and in the meantime things got worse.

The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) – Joel on Software

And some more:

Turning MySQL data in latin1 to utf-8 utf8

I’ve just finished one of the most difficult and tedious problems I’ve ever solved, so I have to share the solution here in a little tutorial of how I fixed this, even though I’m sure there are better ways, this is what worked for me.

My old CD Baby MySQL database from 1998 was filled with foreign characters and was in MySQL’s default (latin1) encoding.
For years, customers and clients had been using our web interface to give us their names, addresses, song titles, bio, and many things in all kinds of alphabets.
I wanted everything to be in UTF-8. (The database, the website, the MySQL client, everything.)

When I say “foreign characters” I mean not just Greek, Icelandic, Japanese, Chinese, Korean, and others shown at Omniglot, but also the curly-quotes, ellipsis, em-dash, and things described at alistapart.

And from AlexKing.org comes Fixing a MySQL Character Encoding Mismatch

We ran into an interesting MySQL character encoding issue at Crowd Favorite today while working to upgrade and launch a new client site.

Here is what we were trying to do: copy the production database to the staging database so we could properly configure and test everything before pushing the new site live. Pretty simple right? It was, until we noticed a bunch of weird character encoding issues on the staging site.

Character Encoding Issue

It turned out that while the database tables were set to a Latin-1 (latin1), the content that populated those tables was encoded as UTF-8 (utf8).