Dev Blog
One of the tasks I've stumbled upon was to generate a set of Word
documents based on template. The template itself was a Microsoft
Word document with special tokens enclosed with curly braces.
These were added just in word processor by inserting some word
wrapped with said braces, for example {address}
,
which later were replaced with proper address en-masse from CSV
data. For each row of CSV data, the tokens were replaced
with corresponding value and one document was created. This
article is mostly based on DOCX format, however similar
technique can be used to process ODT format as well
similar ones.
Format of Document
The office documents are in fact a bunch of files and folders compressed with zip compressor, just having different extension, so that can be recognized by operating system as a document and associated with application. Contents of each document type is slightly different between ODT and DOCX format in terms of templating. Our point of interest are XML files contained in archive. Although XML structure of each format is different, we will use simple string replacing and we will got the same pitfalls on each format.
Workflow
For the very first sight, workflow with replacing tokens in documents seems very simple, and consist of only few steps:
- Unzip archive
- Find all XML files
- Replace tokens in each XML file
- Zip back archive
Why not all placeholders (tokens) were replaced?
Sometimes word processors insert XML markup into word, which is not visible in editor, but it is contained in file. So when trying to replace such token, it will not match because it contains extra markup between opening and closing bracket. To apply fix, we need to unzip document and locate problematic file. The extra markup can be removed in XML editor, or even in just a text editor. Just ensure to remove matching tags from within our token, then zip it back into archive.
Example token which was split by word processor:
<text:span text:style-name="T2">{</text:span>
<text:span text:style-name="T4">parcel</text:span>
<text:span text:style-name="T2">}</text:span>
To fix this token we need to remove inner tags ensuring that the XML is balanced, ie opening tag matches closing pair. In this case we just need to remove all tags between opening and closing curly braces. While it might work that way, there is no guarantee if the tags are just as simply added, so it is better to remove them manually or programmatically with XML processor to keep proper document structure and styling.
Fixed token:
<text:span text:style-name="T2">{parcel}</text:span>
Unzipping archive
Unzipping archive with PHP is pretty straightforward with it's
object oriented interface. We need to create
ZipArchive
object, then open, extract and close:
$zip = new ZipArchive();
$zip->open($filename);
$zip->extractTo($outputDir);
$zip->close();
The extracted archive will contain XML files in root folder if it
is OpenOffice format or in word
directory if it is
Word format.
Finding XML files
Depending on format we extracted, we just need to iterate over XML files in either root of our archive or other folder or simply recursively through all XML files. These are XML documents, so will not contain curly braces, except our tokens. We need to iterate over files, as the document might be split in parts. For example the document footer might be in separate file. If we have tokens in footer we need to process footer file too.
Iterating over XML files:
$it = new DirectoryIterator($workDir);
foreach($it as $entry)
{
if($entry->getExtension() !== 'xml')
{
continue;
}
processFile($entry->getPathname(), $params);
}
The processFile
function is described in next
paragraph, this can be wrapped in class or just be even inlined.
For sake of readablility function name is used in this example.
Processing XML file
For each file in extracted archive we need to open it, replace tokens and write back. This can be done with basic PHP functions.
Example processing logic:
$result = file_get_contents($filename);
foreach($params as $name => $value)
{
$token = sprintf('{%s}', $name);
$result = str_replace($token, $value, $result);
}
$bytes = file_put_contents($filename, $result); assert($bytes > 0);
This code snippet is the body of processFile
. This
might as well contain tags clearing logic, this is beyond scope
of this article through.
Zipping back archive
The PHP's built in ZipArchive
makes extracting very
easy, however compressing back is somewhat confusing. I've had
trouble to create correctly archive that will be properly opened
by Word processor. I've ended up in calling zip
command from PHP on the processed folder.
Example zipping logic
$cwd = getcwd();
chdir($sourcePath);
$output = escapeshellarg($outZipPath);
$command = sprintf('zip -rq %s .', $output);
shell_exec($command);
chdir($cwd);
The zipped file name need to have extension same as input document, for example DOCX.
This article contains crucial information for replacing tokens in Word or OpenOffice documents. In real usage there should be added extra checks for file existence, possibly detection of document types. Depending on your needs this logic might be more or less extended.