Unfortunately, XML exists. For whatever reason, somebody once thought it was a good idea and now we’re stuck with it. Since most modern applications and APIs use JSON instead, it’s often necessary to convert XML into JSON. Here is a PHP function to do that very thing.

function xmlToArray($xml, $options = array()) {
	$defaults = array(
		'namespaceSeparator' => ':',//you may want this to be something other than a colon
		'attributePrefix' => '@',   //to distinguish between attributes and nodes with the same name
		'alwaysArray' => array(),   //array of xml tag names which should always become arrays
		'autoArray' => true,        //only create arrays for tags which appear more than once
		'textContent' => '$',       //key used for the text content of elements
		'autoText' => true,         //skip textContent key if node has no attributes or child nodes
		'keySearch' => false,       //optional search and replace on tag and attribute names
		'keyReplace' => false       //replace values for above search values (as passed to str_replace())
	);
	$options = array_merge($defaults, $options);
	$namespaces = $xml->getDocNamespaces();
	$namespaces[''] = null; //add base (empty) namespace

	//get attributes from all namespaces
	$attributesArray = array();
	foreach ($namespaces as $prefix => $namespace) {
		foreach ($xml->attributes($namespace) as $attributeName => $attribute) {
			//replace characters in attribute name
			if ($options['keySearch']) $attributeName =
					str_replace($options['keySearch'], $options['keyReplace'], $attributeName);
			$attributeKey = $options['attributePrefix']
					. ($prefix ? $prefix . $options['namespaceSeparator'] : '')
					. $attributeName;
			$attributesArray[$attributeKey] = (string)$attribute;
		}
	}

	//get child nodes from all namespaces
	$tagsArray = array();
	foreach ($namespaces as $prefix => $namespace) {
		foreach ($xml->children($namespace) as $childXml) {
			//recurse into child nodes
			$childArray = xmlToArray($childXml, $options);
			list($childTagName, $childProperties) = each($childArray);

			//replace characters in tag name
			if ($options['keySearch']) $childTagName =
					str_replace($options['keySearch'], $options['keyReplace'], $childTagName);
			//add namespace prefix, if any
			if ($prefix) $childTagName = $prefix . $options['namespaceSeparator'] . $childTagName;

			if (!isset($tagsArray[$childTagName])) {
				//only entry with this key
				//test if tags of this type should always be arrays, no matter the element count
				$tagsArray[$childTagName] =
						in_array($childTagName, $options['alwaysArray']) || !$options['autoArray']
						? array($childProperties) : $childProperties;
			} elseif (
				is_array($tagsArray[$childTagName]) && array_keys($tagsArray[$childTagName])
				=== range(0, count($tagsArray[$childTagName]) - 1)
			) {
				//key already exists and is integer indexed array
				$tagsArray[$childTagName][] = $childProperties;
			} else {
				//key exists so convert to integer indexed array with previous value in position 0
				$tagsArray[$childTagName] = array($tagsArray[$childTagName], $childProperties);
			}
		}
	}

	//get text content of node
	$textContentArray = array();
	$plainText = trim((string)$xml);
	if ($plainText !== '') $textContentArray[$options['textContent']] = $plainText;

	//stick it all together
	$propertiesArray = !$options['autoText'] || $attributesArray || $tagsArray || ($plainText === '')
			? array_merge($attributesArray, $tagsArray, $textContentArray) : $plainText;

	//return node as array
	return array(
		$xml->getName() => $propertiesArray
	);
}

The function takes a SimpleXMLElement object and returns as associative array. Crucially, it handles namespaces which none of the examples I could find online did.

Example

A complete usage example looks like this:

$xmlNode = simplexml_load_file('example.xml');
$arrayData = xmlToArray($xmlNode);
echo json_encode($arrayData);

Given an input XML file such as this:

<?xml version='1.0' ?>
<company>
	<name>Outlandish Ideas</name>
	<link href="http://outlandishideas.co.uk">Website</link>
	<person>Abi</person>
	<person>Harry</person>
	<person>Rasmus</person>
	<person>Tamlyn</person>
	<address street="yes">
		<street>Longford Street</street>
		<city>London</city>
	</address>
</company>

It produces output such as this:

{
    "company": {
        "name": "Outlandish Ideas",
        "link": {
            "@href": "http://outlandishideas.co.uk",
            "$": "Website"
        },
        "person": ["Abi", "Harry", "Rasmus", "Tamlyn"],
        "address": {
            "@street": "yes",
            "street": "Longford Street",
            "city": "London"
        }
    }
}

Notes:

  • Attributes are given a prefix to distinguish them from child nodes e.g. “@street” and “street”
  • If a node has text content as well as child nodes or attributes, that text content is stored in a key e.g. “$”
  • If a node has multiple children with the same tag name, they are converted into a numerically indexed array e.g. “person”

The function also accepts an array of options as the second parameter. Two of these options bare a little explanation.

autoArray and autoText

In short: leave these as true for more readable output; set to false for more consistently parseable output.

XML makes life really hard for programmers because it allows attributes, text content and multiple child nodes with the same name all at the same time. That means that when it comes to representing an XML document as normal data structures (trees, hash tables, arrays, objects etc.) we’re forced to make some decisions.

In the above example, the text of the <title> element is assigned directly to the title key but the text of the <link> element is assigned to the $ key because of the href attribute. Now assume another XML file for Another Company contains this line: <title parentCompany=”Big Corp”>Another Company</title>. In that case, instead of {“title”: “Outlandish Ideas} you would get {“title”: {“@parentCompany”: “Big Corp”, “$”: “Another Company”}}. If this is a problem for you, set autoText to false and text content will always be assigned to a $ key, regardless of other attributes (see example below).

In the above example, again, we turn the <person> elements into an array because there are several of them. But what about a company with only one person? How would the function know what to do? One option is to use the alwaysArray option to provide a list of tag names which should always be considered arrays. Another is to turn autoArray off which converts all tags into arrays, regardless of other child nodes.

Here’s that XML file again, but this time with autoArray and autoText turned off:

{
    "company": {
        "name": [{
            "$": "Outlandish Ideas"
        }],
        "link": [{
            "@href": "http://outlandishideas.co.uk",
            "$": "Website"
        }],
        "person": [{
            "$": "Abi"
        }, {
            "$": "Harry"
        }, {
            "$": "Rasmus"
        }, {
            "$": "Tamlyn"
        }],
        "address": [{
            "@street": "yes",
            "street": [{
                "$": "Longford Street"
            }],
            "city": [{
                "$": "London"
            }]
        }]
    }
}

As you can see it’s much more verbose and harder to read. However the data structure is less dependent on the exact content of the document.

Finally, here’s a real world example of taking the RSS feed from this site and inserting it into a MongoDB collection:

$xmlNode = simplexml_load_file('http://outlandishideas.co.uk/feed/');
$arrayData = xmlToArray($xmlNode, array(
	'alwaysArray' => array('item', 'category'),
	'keySearch' => '.', //MongoDB doesn't allow dots in keys
	'keyReplace' => '_' //so replace with underscores
));
$m = new Mongo();
$m->test->feeds->insert($arrayData);

That’s all

This code is free for anyone to use. We hope you like it. Let us know if you found it useful.

Known issues

  • The order of elements is not preserved.
  • Namespaces are not preserved.
  • The namespace prefix of the root element (if any) is not preserved.
  • autoArray=false doesn’t create an array for the root element.
Hiring: we are looking for experienced developers