Simple URL cleanser in PHP

This post is more than 13 years old.

There are lots of ways to tidy up a URL to remove “special” characters. Some do more than others, some purport to do a lot but really don’t do much. Some, like WordPress’ sanitize_title_with_dashes, throw six or more regular expressions and a couple of multi-byte string conversions at your string but don’t handle accented characters. Here’s a simple one that does convert accented characters.

This isn’t for everyone; I wrote it for the sorts of website applications I get called on to build, where it is important for the page text to preserve all special characters but it’s OK for the page’s URL to be much simpler. It doesn’t have to take into account the possibility of a range of different string types and character sets, because I always work in UTF-8. But because of these assumptions I can make, it’s simple and fast (only using one very simple regular expression).

For conversion of non-ASCII characters including accented characters like àáâãäåèéêë etc. into their simplified basic ASCII character cousins, it uses the function iconv which is made exactly for the purpose. I’ve seen lots of examples that mess around with lookup arrays, but this function does the job right and does it quickly. You just tell it what character set your string is, what you want it to be, and how to make the transformation. NB: for iconv to know how to make that transformation, it needs to know what your locale is; if your environment doesn’t already set that, you should do so by calling setlocale.

To strip unwanted punctuation, or convert it into a nice, safe minus sign ‘-‘ it uses the strtr function. This function accepts an array to direct multiple transliterations, so it can be used to handle stripping or converting all the undesirable punctuation characters in one hit.

Once that’s done, there is a possibility of a silly-looking string of minus signs. They can be easily reduced to a single character with a simple regular expression using preg_replace. There might still be a silly looking minus sign at the end of the string though, so that should be removed (unless the string is now just ‘-‘ in which case we want to keep it!) and rtrim fits the bill nicely for that job. Yes, rtrim can trim more that just spaces!

Finally, just because I’m a belt-and-braces programmer, the string is URL-encoded to preserve anything I’ve missed (although I’ve yet to see anything pop up).

Here’s the code:

/**
* reduce rich character set string to URL-compatible string
* @param string $text original string
* @return string
*/
function stringForURL($text) {
    // replace accented characters with unaccented characters
    $newText = iconv('UTF-8', 'ASCII//TRANSLIT', $text);

    // remove unwanted punctuation, convert some to '-'
    static $punc = array(
        // remove
        "'" => '', '"' => '', '`' => '', '=' => '', '+' => '', '*' => '', '&' => '', '^' => '', '' => '',
        '%' => '', '$' => '', '#' => '', '@' => '', '!' => '', '<' => '', '>' => '', '?' => '',
        // convert to minus
        '[' => '-', ']' => '-', '{' => '-', '}' => '-', '(' => '-', ')' => '-',
        ' ' => '-', ',' => '-', ';' => '-', ':' => '-', '/' => '-', '|' => '-'
    );
    $newText = strtr($newText, $punc);

    // clean up multiple '-' characters
    $newText = preg_replace('/-{2,}/', '-', $newText);

    // remove trailing '-' character if string not just '-'
    if ($newText != '-')
        $newText = rtrim($newText, '-');

    // return a URL-encoded string
    return rawurlencode($newText);
}

And here’s a simple test with just a couple of examples. Note that I’m setting the locale first!

function test() {
    setlocale(LC_CTYPE, 'en_AU');

    $tests = array(
        'quick sly fox',
        "John Doe's Résumé",
        'Whole lot of Rosé',
        "Don't wait - act now!",
        '#9 pan-head slotted stainless (80%)',
    );

    echo "<table>n";
    echo "<tr><th>before</th><th>after</th></tr>n";
    foreach ($tests as $test) {
        echo '<tr><td>', htmlspecialchars($test), '</td><td>', htmlspecialchars(stringForURL($test)), "</td></tr>n";
    }
    echo "</table>n";
}

And the result of that test:

beforeafter
quick sly foxquick-sly-fox
John Doe’s RésuméJohn-Does-Resume
Whole lot of RoséWhole-lot-of-Rose
Don’t wait – act now!Dont-wait-act-now
#9 pan-head slotted stainless (80%)9-pan-head-slotted-stainless-80

There. Job-is-done.