Simple URL cleanser in PHP

There are lots of ways to tidy up a URL to remove “special” characters. Some do more than others, some purport to do a lot but really don’t do much. Some, like WordPress’ sanitize_title_with_dashes, throw six or more regular expressions and a couple of multi-byte string conversions at your string but don’t handle accented characters. Here’s a simple one that does convert accented characters.

This isn’t for everyone; I wrote it for the sorts of website applications I get called on to build, where it is important for the page text to preserve all special characters but it’s OK for the page’s URL to be much simpler. It doesn’t have to take into account the possibility of a range of different string types and character sets, because I always work in UTF-8. But because of these assumptions I can make, it’s simple and fast (only using one very simple regular expression).

For conversion of non-ASCII characters including accented characters like àáâãäåèéêë etc. into their simplified basic ASCII character cousins, it uses the function iconv which is made exactly for the purpose. I’ve seen lots of examples that mess around with lookup arrays, but this function does the job right and does it quickly. You just tell it what character set your string is, what you want it to be, and how to make the transformation. NB: for iconv to know how to make that transformation, it needs to know what your locale is; if your environment doesn’t already set that, you should do so by calling setlocale.

To strip unwanted punctuation, or convert it into a nice, safe minus sign ‘-‘ it uses the strtr function. This function accepts an array to direct multiple transliterations, so it can be used to handle stripping or converting all the undesirable punctuation characters in one hit.

Once that’s done, there is a possibility of a silly-looking string of minus signs. They can be easily reduced to a single character with a simple regular expression using preg_replace. There might still be a silly looking minus sign at the end of the string though, so that should be removed (unless the string is now just ‘-‘ in which case we want to keep it!) and rtrim fits the bill nicely for that job. Yes, rtrim can trim more that just spaces!

Finally, just because I’m a belt-and-braces programmer, the string is URL-encoded to preserve anything I’ve missed (although I’ve yet to see anything pop up).

Here’s the code:

/**
* reduce rich character set string to URL-compatible string
* @param string $text original string
* @return string
*/
function stringForURL($text) {
    // replace accented characters with unaccented characters
    $newText = iconv('UTF-8', 'ASCII//TRANSLIT', $text);

    // remove unwanted punctuation, convert some to '-'
    static $punc = array(
        // remove
        "'" => '', '"' => '', '`' => '', '=' => '', '+' => '', '*' => '', '&' => '', '^' => '', '' => '',
        '%' => '', '$' => '', '#' => '', '@' => '', '!' => '', '<' => '', '>' => '', '?' => '',
        // convert to minus
        '[' => '-', ']' => '-', '{' => '-', '}' => '-', '(' => '-', ')' => '-',
        ' ' => '-', ',' => '-', ';' => '-', ':' => '-', '/' => '-', '|' => '-'
    );
    $newText = strtr($newText, $punc);

    // clean up multiple '-' characters
    $newText = preg_replace('/-{2,}/', '-', $newText);

    // remove trailing '-' character if string not just '-'
    if ($newText != '-')
        $newText = rtrim($newText, '-');

    // return a URL-encoded string
    return rawurlencode($newText);
}

And here’s a simple test with just a couple of examples. Note that I’m setting the locale first!

function test() {
    setlocale(LC_CTYPE, 'en_AU');

    $tests = array(
        'quick sly fox',
        "John Doe's Résumé",
        'Whole lot of Rosé',
        "Don't wait - act now!",
        '#9 pan-head slotted stainless (80%)',
    );

    echo "<table>n";
    echo "<tr><th>before</th><th>after</th></tr>n";
    foreach ($tests as $test) {
        echo '<tr><td>', htmlspecialchars($test), '</td><td>', htmlspecialchars(stringForURL($test)), "</td></tr>n";
    }
    echo "</table>n";
}

And the result of that test:

beforeafter
quick sly foxquick-sly-fox
John Doe’s RésuméJohn-Does-Resume
Whole lot of RoséWhole-lot-of-Rose
Don’t wait – act now!Dont-wait-act-now
#9 pan-head slotted stainless (80%)9-pan-head-slotted-stainless-80

There. Job-is-done.

Facebooktwittergoogle_plusredditlinkedinmailFacebooktwittergoogle_plusredditlinkedinmail
  • willem

    Just yours for the following. But it didn’t work at all. It even breaks:


    echo stringForURL("Mess'd up --text-- just (to) stress /test/ ?our! `little` clean url fun.ction!?-->")."n";
    echo stringForURL("Perché l'erba è verde?")."n";
    echo stringForURL("Académie_française is zo'n @ organisatie (xÇ{åç€ æ ) tekens?")."n"; // French
    echo stringForURL("Tänk efter nu – Tänk'n vi föser dig bort")."n"; // Swedish
    echo stringForURL("ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖÙÚÛÜÝßàáâãäåæçèéêëìíîïðñòóôõöùúûüýÿ")."n";
    echo stringForURL("Custom`delimiter*example")."n";
    echo stringForURL("My+Last_Crazy|delimiter/example")."n";

    [edited by mod for code formatting]

    • G’day Willem,

      I tried your examples, and this is what I got as output:

      Messd-up-text-just-to-stress-test-our-little-clean-url-fun.ction
      Perche-lerba-e-verde
      Academie_francaise-is-zon-organisatie-xC-acEUR-ae-tekens
      Tank-efter-nu-forrn-vi-foser-dig-bort
      AAAAAAAECEEEEIIIINOOOOOUUUUYssaaaaaaaeceeeeiiiinooooouuuuyy
      Customdelimiterexample
      MyLast_Crazy-delimiter-example

      Interesting that iconv transliterates € as EUR, but otherwise as I’d have expected.

      I’m running PHP 5.3.8 on Linux, with no default_charset defined in php.ini but with UTF-8 as my default charset in Apache’s httpd.conf, which is how Fedora sets things up by default. Perhaps your system has a different default character setting?

      Can you please describe how it breaks for you and what output you get?

      cheers,
      Ross

    • Oh, and I nearly forgot: iconv really needs to have a locale defined or it just doesn’t work at all; perhaps having a locale defined that expects some of these characters actually stops my function working as expected for you. I’m running with en-AU, set by this call:
      setlocale(LC_ALL, 'en_AU');
      Can you tell me please what your locale is, if set?

      cheers,
      Ross

  • Thank you for the code, great work.