PHP really doesn’t do Unicode

I’ve heard many times that PHP doesn’t really do Unicode, or not properly. In seven years of working primarily in PHP, always in UTF-8, I’ve never really hit a problem, so I always figured it was something esoteric and unimportant for me. But already this year I’ve seen this problem twice, in different ways.

PHP strings are not Unicode aware

I’ve heard this many times, and it never really bothered me in my work. I’ve always passed data around as UTF-8, from file contents to databases to the generated HTML and any accepted form content. All of it has worked seamlessly, including so-called “special” characters (i.e. those outside the usual ASCII range). But recently, I needed to write a function to convert “fancy” quotes back into regular old ASCII quotes, to get back some product names┬áthat had been mangled by WordPress’ wptexturize() function.

As it turns out, PHP strings are implemented as an array of bytes without encoding information. Fine, whatever, very interesting… but it also means that Unicode-encoded characters are meaningless without context, so Unicode escape sequences are not implemented in PHP. If I want to encode the Unicode character for the┬áright double quotation mark, 0x201D, I can’t use ‘\u201D’ like I can in JavaScript.

The easiest way to get what I want is, basically, to use the fact that at least JavaScript knows what Unicode is and reverse a JSON string. Hat tip to this StackOverflow answer for that trick.

FWIW, here’s that function:

/**
* revert fancy quotes to normal ones so searches can work again...
* @param string $fancy
* @return string
*/
function untexturize($fancy) {
    static $fixes = false;

    if ($fixes === false) {
        $fixes = array(
            json_decode('"\u201C"') => '"',     // left  double quotation mark
            json_decode('"\u201D"') => '"',     // right double quotation mark
            json_decode('"\u2018"') => "'",     // left  single quotation mark
            json_decode('"\u2019"') => "'",     // right single quotation mark
            json_decode('"\u2032"') => "'",     // prime (minutes, feet)
            json_decode('"\u2033"') => '"',     // double prime (seconds, inches)
            json_decode('"\u2013"') => '-',     // en dash
            json_decode('"\u2014"') => '--',    // em dash
        );
    }

    $normal = strtr($fancy, $fixes);

    return $normal;
}

PHP file functions fake Unicode support

Yes, it’s true — PHP isn’t really doing Unicode-safe filename operations. It just happens to work nicely on Linux (and I presume, Unix), but it fails miserably on Windows. Essentially, PHP uses ISO-8859-1 when talking about filenames, at least on Windows. You can pass it nice Unicode filenames, and it’ll create files with those filenames, open them happily, copy them, etc. but if you check the filesystem you’ll see weird character encoding artefacts where you’d expect nice non-ASCII characters. This StackOverflow question and its answer give some more detail.

How this affects you in practice is that any files with Unicode-encoded filenames can either be accessible by PHP or by anything else, but not by both. For example, a file with a filename that has Chinese characters in it can be accessible by name through a web server (e.g. can load into a browser), or PHP can access it, but not both. Of course, the file is still accessible to both, but not by its real filename for both.

How I’ve seen this impact people is through website migration tools that copy files from Linux to Windows using PHP. On Linux, everything seems OK, and the migration tool happily copies the file to a Windows website, but because the file copy happens with PHP the filename becomes corrupted. The PHP code doesn’t notice because it can still access that file, but the website no longer loads the file. This is not the fault of the migration tools (I’ve seen two different ones tussle with this problem), it’s PHP not properly handling Unicode.

What to do about it

Well, apart from “suck it up” and wait for the mythical PHP6 to arrive, or ditch PHP in favour of another language, the best option is to muddle through with Linux/Unix hosting. PHP on Windows can be a nuisance many different ways, mostly tolerable, but the filenames vs Unicode problem means Windows makes a very poor PHP development system. I always wondered about all of those people running XAMPP on Windows, and whether it might bite them in the butt some day, and now I know. I’m just glad I made the jump to Linux when PHP started taking over my work life.

Edit: there’s a good discussion of the challenges in building PHP6 with Unicode support on the Programmers’ StackExchange question: “Why exactly can’t PHP have full unicode support?

Facebooktwittergoogle_plusredditlinkedinmailFacebooktwittergoogle_plusredditlinkedinmail