Parsing large XML files with PHP

Parsing a 250Mb XML file with around 25 000 "items" with PHP's XML parsers isn't easy. I'll provide a solution, but will also discuss the core problem.

I started with the easiest way of parsing XML in PHP, by using the simplexml_load_file().

<?php
ini_set('memory_limit', '-1');
$simpleXML = simplexml_load_file('file.xml');
echo memory_get_peak_usage();

This rendered my computer to be unresponsive for about 20 minutes. When it came back to life, I discovered that this also used about 8Gb of memory.

Loading a file into a variable, like below, will copy the file from disk to the memory.

<?php
// file.xml filesize is 250Mb
$contents = file_get_contents('file.xml');
// do something with this content

This will of course take at least 250Mb of memory. Space is space, one byte on disk is equal to one byte in memory.

By then parsing this content with a XML parser, I made sure that I use all available resources. SimpleXML works by turning every node into an instance of SimpleXMLElement, so there are a lot of them to hold in memory at the same time.

The default way of handling this in programs is by processing smaller parts and then time. In PHP you can read a file piece by piece by using the fread method:

<?php
$fh = fopen('file.xml', 'r');
$chunkSize = 1024;
while($data = fread($fh, $chunkSize)) {
    // Do something with data
}

If the data is formatted as a list (or nodes), you will need to read the items individually. So for a XML you want to get the 'inner' data per node. Normally you would use either regular expression or a XML parser to get that data. Regular expressions are very powerful but will very quickly be unmaintainable and buggy for more complicated documents. But as I shown, using a XML parser is not a viable option.

I needed a solution that would read through the file piece by piece and return node per node. Convert that into a SimpleXML node and then return it as an array. I googled around, but couldn't find a solution that I was happy with.

I did what every programmer usually do, make my own.

<?php

// Open the XML
$handle = fopen('file.xml', 'r');

// Get the nodestring incrementally from the xml file by defining a callback
// In this case using a anon function.
nodeStringFromXMLFile($handle, '<item>', '</item>', function($nodeText){
    // Transform the XMLString into an array and 
    print_r(getArrayFromXMLString($nodeText));
});
fclose($handle);

/**
 * For every node that starts with $startNode and ends with $endNode call $callback
 * with the string as an argument
 *
 * Note: Sometimes it returns two nodes instead of a single one, this could easily be
 * handled by the callback though. This function primary job is to split a large file
 * into manageable XML nodes.
 *
 * the callback will receive one parameter, the XML node(s) as a string
 *
 * @param resource $handle - a file handle
 * @param string $startNode - what is the start node name e.g <item>
 * @param string $endNode - what is the end node name e.g </item>
 * @param callable $callback - an anonymous function
 */
function nodeStringFromXMLFile($handle, $startNode, $endNode, $callback=null) {
    $cursorPos = 0;
    while(true) {
        // Find start position
        $startPos = getPos($handle, $startNode, $cursorPos);
        // We reached the end of the file or an error
        if($startPos === false) { 
            break;
        }
        // Find where the node ends
        $endPos = getPos($handle, $endNode, $startPos) + mb_strlen($endNode);
        // Jump back to the start position
        fseek($handle, $startPos);
        // Read the data
        $data = fread($handle, ($endPos-$startPos));
        // pass the $data into the callback
        $callback($data);
        // next iteration starts reading from here
        $cursorPos = ftell($handle);
    }
}

/**
 * This function will return the first string it could find in a resource that matches the $string.
 *
 * By using a $startFrom it recurses and seeks $chunk bytes at a time to avoid reading the 
 * whole file at once.
 * 
 * @param resource $handle - typically a file handle
 * @param string $string - what string to search for
 * @param int $startFrom - strpos to start searching from
 * @param int $chunk - chunk to read before rereading again
 * @return int|bool - Will return false if there are EOL or errors
 */
function getPos($handle, $string, $startFrom=0, $chunk=1024) {
    // Set the file cursor on the startFrom position
    fseek($handle, $startFrom, SEEK_SET);
    // Read data
    $data = fread($handle, $chunk);
    // Try to find the search $string in this chunk 
    $stringPos = mb_strpos($data, $string);
    // We found the string, return the position
    if($stringPos !== false ) {
        return $stringPos+$startFrom;   
    }
    // We reached the end of the file
    if(feof($handle)) {
        return false;
    }
    // Recurse to read more data until we find the search $string it or run out of disk
    return getPos($handle, $string, $chunk+$startFrom);
}

/**
 * Turn a string version of XML and turn it into an array by using the 
 * SimpleXML
 *
 * @param string $nodeAsString - a string representation of a XML node
 * @return array
 */
function getArrayFromXMLString($nodeAsString) {
    $simpleXML = simplexml_load_string($nodeAsString);
    if(libxml_get_errors()) {
        user_error('Libxml throws some errors.', implode(',', libxml_get_errors()));
    }
    return simplexml2array($simpleXML);
}

/**
 * Turns a SimpleXMLElement into an array
 *
 * @param SimpleXMLelem $xml
 * @return array 
 */
function simplexml2array($xml) {
    if(is_object($xml) && get_class($xml) == 'SimpleXMLElement') {
        $attributes = $xml->attributes();
        foreach($attributes as $k=>$v) {
            $a[$k] = (string) $v;
        }
        $x = $xml;
        $xml = get_object_vars($xml);
    }

    if(is_array($xml)) {
        if(count($xml) == 0) { 
            return (string) $x; 
        }
        $r = array();
        foreach($xml as $key=>$value) {
            $r[$key] = simplexml2array($value);
        }
        // Ignore attributes
        if (isset($a)) {
            $r['@attributes'] = $a;
        }
        return $r;
    }
    return (string) $xml;
}

Also see the gist 3045663 where any bugfixes will go.