Dev Blog

One of the tasks I've stumbled upon was to generate a set of Word documents based on template. The template itself was a Microsoft Word document with special tokens enclosed with curly braces. These were added just in word processor by inserting some word wrapped with said braces, for example {address}, which later were replaced with proper address en-masse from CSV data. For each row of CSV data, the tokens were replaced with corresponding value and one document was created. This article is mostly based on DOCX format, however similar technique can be used to process ODT format as well similar ones.




Format of Document

The office documents are in fact a bunch of files and folders compressed with zip compressor, just having different extension, so that can be recognized by operating system as a document and associated with application. Contents of each document type is slightly different between ODT and DOCX format in terms of templating. Our point of interest are XML files contained in archive. Although XML structure of each format is different, we will use simple string replacing and we will got the same pitfalls on each format.

Workflow

For the very first sight, workflow with replacing tokens in documents seems very simple, and consist of only few steps:

  1. Unzip archive
  2. Find all XML files
  3. Replace tokens in each XML file
  4. Zip back archive 

Why not all placeholders (tokens) were replaced?

Sometimes word processors insert XML markup into word, which is not visible in editor, but it is contained in file. So when trying to replace such token, it will not match because it contains extra markup between opening and closing bracket. To apply fix, we need to unzip document and locate problematic file. The extra markup can be removed in XML editor, or even in just a text editor. Just ensure to remove matching tags from within our token, then zip it back into archive. 

Example token which was split by word processor:
<text:span text:style-name="T2">{</text:span>
<text:span text:style-name="T4">parcel</text:span>
<text:span text:style-name="T2">}</text:span>

To fix this token we need to remove inner tags ensuring that the XML is balanced, ie opening tag matches closing pair. In this case we just need to remove all tags between opening and closing curly braces. While it might work that way, there is no guarantee if the tags are just as simply added, so it is better to remove them manually or programmatically with XML processor to keep proper document structure and styling.

Fixed token:
<text:span text:style-name="T2">{parcel}</text:span>

Unzipping archive

Unzipping archive with PHP is pretty straightforward with it's object oriented interface. We need to create ZipArchive object, then open, extract and close:

$zip = new ZipArchive();
$zip->open($filename);
$zip->extractTo($outputDir);
$zip->close();

The extracted archive will contain XML files in root folder if it is OpenOffice format or in word directory if it is Word format.

Finding XML files

Depending on format we extracted, we just need to iterate over XML files in either root of our archive or other folder or simply recursively through all XML files. These are XML documents, so will not contain curly braces, except our tokens. We need to iterate over files, as the document might be split in parts. For example the document footer might be in separate file. If we have tokens in footer we need to process footer file too.

Iterating over XML files:
$it = new DirectoryIterator($workDir);
foreach($it as $entry)
{
if($entry->getExtension() !== 'xml')
{
continue;
}
processFile($entry->getPathname(), $params);
}

The processFile function is described in next paragraph, this can be wrapped in class or just be even inlined. For sake of readablility function name is used in this example.

Processing XML file

For each file in extracted archive we need to open it, replace tokens and write back. This can be done with basic PHP functions.

Example processing logic:
$result = file_get_contents($filename);

foreach($params as $name => $value)
{
$token = sprintf('{%s}', $name);
$result = str_replace($token, $value, $result);
}
$bytes = file_put_contents($filename, $result); assert($bytes > 0);

This code snippet is the body of processFile. This might as well contain tags clearing logic, this is beyond scope of this article through.

Zipping back archive

The PHP's built in ZipArchive makes extracting very easy, however compressing back is somewhat confusing. I've had trouble to create correctly archive that will be properly opened by Word processor. I've ended up in calling zip command from PHP on the processed folder.

Example zipping logic
$cwd = getcwd();
chdir($sourcePath);
$output = escapeshellarg($outZipPath);
$command = sprintf('zip -rq %s .', $output);
shell_exec($command);
chdir($cwd);

The zipped file name need to have extension same as input document, for example DOCX

This article contains crucial information for replacing tokens in Word or OpenOffice documents. In real usage there should be added extra checks for file existence, possibly detection of document types. Depending on your needs this logic might be more or less extended.