Yet another programmer blogging about code

Filter invalid characters from XMLReader input files

I just ran into a problem reading XML data exported from a Microsoft Access database. For whatever reason, Access has written VT (vertical tab) characters in the XML, which PHP’s XMLReader baulks at. To be able to handle that on each data load without requiring the user to edit their XML, I wrote a simple PHP stream filter that replaces each VT character with a LF (line feed).

Since PHP5, it’s very easy to define your own stream filters in PHP. This allows you to insert a filter function into any stream-based read or write, so that you can augment, clean up, or monitor what’s flowing through the stream. The documentation for streams is a bit terse, but with a bit of Googling and poking about, there’s enough information out there to supply some answers.

For my task at hand, the filter is a very simple one, so I’ll just dive in and show it:

* stream filter for cleaning up invalid data exported by Microsoft Access
class MSAccessXmlFilter extends php_user_filter {
    * @param resource $in a resource pointing to a bucket brigade which contains one or more bucket objects containing data to be filtered
    * @param resource $out a resource pointing to a second bucket brigade into which your modified buckets should be placed
    * @param int &$consumed should be incremented by the length of the data which your filter reads in and alters
    * @param bool $closing whether the stream is in the process of closing (and therefore this is the last pass through the filterchain)
    public function filter($in, $out, &$consumed, $closing) {
        while ($bucket = stream_bucket_make_writeable($in)) {
            $consumed += $bucket->datalen;

            // replace VT characters with LF
            $bucket->data = strtr($bucket->data, "\v", "\n");

            // send the clean data out
            stream_bucket_append($out, $bucket);
        return PSFS_PASS_ON;

Having defined the filter, it then needs to be registered.

stream_filter_register('msaccessxml', 'MSAccessXmlFilter');

To make XMLReader use this filter, build a path using the special php:// protocol telling it the filter(s) to use and the real path to open. This path is then given to XMLReader’s open() function instead of the real stream path.

$path = 'php://filter/read=msaccessxml/resource=' . $real_path;

Job is done, cleanly and on the run.