diffmk

Name

diffmk — Calculate the differences between two XML documents

Synopsis

diffmk [--debug] [--verbose] [--output filename ]
[--showdelete | --noshowdelete] [--includedeletedeleemnts | --noincludedeletedelements]
[--ignorewhitespace | --noignorewhitespace] [--usechanged | --nousechanged]
[--diff {text | element | both} ] [--doctype {docbook | xmlspec} ]
[--wrapper {element-name}] [--attribute {attribute-name}]
[--changed {attribute-value}] [--added {attribute-value}] [--deleted {attribute-value}]
file1.xml file2.xml [output.xml]

Description

The diffmk command loads two [XML] documents, compares them, and produces a third document. The third document is the second document (including the document type declaration and internal subset) augmented with additional markup to identify differences between the first and second documents.

The most common application of this command is to identify the changes made between two different revisions of the same document. In this case, the document produced by diffmk identifies the differences (additions, deletions, and changes) that have occurred between the two revisions provided.

The core of this utility is provided by the Perl [Algorithm::Diff] module which does the actual work of calculating the differences.

Identifying Differences in XML

After diffmk has identified the differences between two XML documents, it must annotate the result to indicate where these differences occur.

This version of diffmk does so by adding an attribute to elements that have changed. In addition, it may insert new elements, if necessary.

The Difference Attribute

If your schema does not already have a common “diff” attribute, you will have to add one. The DTD declaration for such an attribute generally looks like this:

        diff    (changed
                |added
                |deleted
                |off)           #IMPLIED

In [XML Schema], the declaration might be:

  <attribute name='diff'>
    <simpleType>
      <restriction base='string'>
        <enumeration value='changed'/>
        <enumeration value='added'/>
        <enumeration value='deleted'/>
        <enumeration value='off'/>
      </restriction>
    </simpleType>
  </attribute>

You can control the name of the attribute that diffmk adds with the --doctype and --attribute arguments.

The Difference Elements

A slightly more complicated case arises when diffmk must insert a new element. This occurs when a text node must be identified as somehow different.

If you are working with well-formed XML documents and do not care about validity, you can simply identify the name of the element that you would like to use for this purpose. More likely, you want the resulting documents to be valid. For that purpose, you will need to identify the document type that you are using and modify the diffmk.xml control file to use the correct elements in context.

The diffmk.xml Control File

In order to facilitate the use of diffmk with different schemas, the necessary element and attribute information is loaded from an external control file.

This control file is an XML document conforming to the diffmk DTD or XML Schema.

The control file consists essentially of a set of doctype elements:

<doctype name="docbook"
         attribute="revisionflag"
         changed="changed"
         added="added"
         deleted="deleted">
  <wrapper element="phrase"/>
  <wrapper parent="article" element="para"/>
  <!--...-->
</doctype>

From the command line, the doctype is identified by its name. The attribute, changed, added, and deleted attributes identify the name of the difference attribute and the values of that attribute for changed, added, and deleted content, respectively.

Within the doctype, each wrapper identifies the name of the element that should be used to wrap content in a given context. If no parent is provided, then the element specified will be used by default. If a parent is provided, then the element specified will only be used in that context.

Output Validity

By providing reasonable doctype parameters for the schema in use, the output of diffmk can usually be made to be valid. For the purposes of validity, the doctype declaration and internal subset of the second document is used for the result document. Note, however, that entity references are expanded before doing the comparison, so the content of the internal subset is often irrelevant.

Some combinations of options may produce documents that cannot be valid. For example, if you've changed a section title and you specify --showdelete, --includedeletedelements, and --nousechanged the result will include a section with two title elements (one marked “deleted” and another marked “added”). In many document schemas, that's invalid.

Options

Each of the following options may be abbreviated to the shortest name that is unambiguous; for example, --de is sufficient for the --debug option.

--debug

Enables debugging output.

--verbose

Provides more verbose, informative messages.

--diff difftype

Selects the type of difference to calculate. Possible values for difftype are element, text, or both.

If an element-only diff is performed, then the structure of the document is compared without concern for text content. If a text-only diff is performed, then only text nodes are compared. Specifying both, remarkably enough, considers both element and text nodes for the purposes of comparison.

--output filename

Write the resulting XML document to filename.

--showdelete, --noshowdelete

If --showdelete is specified, deleted content is identified in the result document. If --noshowdelete is specified, the result document does not contain any reference to deleted text.

--showdelete is the default.

--includedeletedelements, --noincludedeletedelements

If --includedeletedelements is specified, deleted elements are preserved in the result document. If --noincludedeletedelements is specified, the result document will not contain the deleted elements. In particular, note that this may cause the text content of deleted elements to “migrate” slightly.

This option avoids the potential validity problems associated with multiple title elements and other elements that cannot be duplicated.

--noincludedeletedelements is the default.

--ignorewhitespace, --noignorewhitespace

If --ignorewhitespace is specified, all text node content is normalized with respect to whitespace before comparison. Leading and trailing whitespace is removed and all internal sequences of whitespace are replaced by a single blank. This prevents changes in line breaks from appearing as significant changes.

--ignorewhitespace is the default.

--usechanged, --nousechanged

Internally, diffmk sees the differences between the two documents as the smallest number of deletes that would have to be made to each document in order to get the same result document. Effectively, deletes from the first document are additions to the second and deletes from the second are really deletes.

If --usechanged is specified, parallel deletes from both arrays (in other words, deletions followed by additions) are translated into “changes”.

--usechanged is the default.

--doctype doctypename

The elements and attributes used to identify the changes in the result document are loaded from the doctype named doctypename in the diffmk.xml Control File.

--attribute name

The attribute named name is used to identify differences in the result document.

--changed value

Changes in the result document are identified with the difference attribute value value.

--added value

Additions in the result document are identified with the difference attribute value value.

--deleted value

Deletions in the result document are identified with the difference attribute value value.

--wrapper element

The element is used as the wrapper element to identify differences in the result document.

Examples

Given the following two documents:

Example 1. The document test1.xml

<!DOCTYPE article 
  PUBLIC "-//Norman Walsh//DTD Simplified DocBook XML V4.1.2.3//EN"
  "http://nwalsh.com/docbook/simple/4.1.2.3/sdocbook.dtd">
<article>
<title>A Test Document</title>
<para>This is para 1.</para>
<para>This is para 2 <emphasis>with emphasis</emphasis> in it.</para>
<para>This is para 3.</para>
<para id="p4">This is para 4.</para>
<para id="p5">This is para 5.</para>
<para>This is para 6.</para>
<para>This is para 7.</para>
<para>This is para 8.</para>
<para>This is para 9.</para>
</article>


Example 2. The document test2.xml

<!DOCTYPE article 
  PUBLIC "-//Norman Walsh//DTD Simplified DocBook XML V4.1.2.3//EN"
  "http://nwalsh.com/docbook/simple/4.1.2.3/sdocbook.dtd">
<article><title>A Contrived Test Document</title>
<para>This is para 1.</para>
<para>This is para 2 <emphasis role="bold">with emphasis</emphasis> changed in it.</para>
<para>This is a new para 2b.</para>
<para>This is
para 3.</para>
<para id="p4">This is a different para 4.</para>
<para>This is a new para 4b.</para>
<para id="p5">This is para 5.</para>
<para>This is para 8.</para>
<para>This is para 9.</para>
</article>


The command:

diffmk --doctype docbook test1.xml test2.xml out.xml

will produce the following result:

Example 3. The document out.xml

<!DOCTYPE article PUBLIC "-//Norman Walsh//DTD Simplified DocBook XML V4.1.2.3//EN" "http://nwalsh.com/docbook/simple/4.1.2.3/sdocbook.dtd">
<?diffmk version='0.1'
    oldfile='test1.xml'
    newfile='test2.xml'
    attribute='revisionflag'
    changed='changed'
    added='added'
    deleted='deleted'
    diff='both'
    showdelete='1'
    includedeletedelements='0'
    ignorewhitespace='1'
    usechanged='1'
?>
<article><title revisionflag="changed">A Contrived Test Document</title>
<para>This is para 1.</para>
<para>This is para 2 <emphasis role="bold" revisionflag="changed">with emphasis</emphasis><phrase revisionflag="changed"> changed in it.</phrase></para>
<para revisionflag="added">This is a new para 2b.</para>
<para revisionflag="added">This is
para 3.</para>
<para id="p4" revisionflag="changed">This is a different para 4.</para>
<para revisionflag="deleted">This is para 5.</para><para revisionflag="changed">This is a new para 4b.</para>
<para id="p5" revisionflag="changed">This is para 5.</para>
<para>This is para 8.</para>
<para>This is para 9.</para>
</article>

Environment Variables

HOME

The first place that diffmk looks for the diffmk.xml Control File is in .diffmk.xml in the users home directory.

Bugs

If --ignorewhitespace is specified, whitespace changes are ignored even in elements that have xml:space set to preserve or are otherwise known to be line specific.

There's no way to use diffmk to maintain a running history of differences. Comparing the differences between documents that already contain “diff” markup is bound to lead to confusing, perhaps misleading, results.

The diffmk DTD

The format of the diffmk.xml Control File is constrained by the following DTD:

<!ELEMENT diffmk (doctype*)>
<!ATTLIST diffmk
    xmlns       CDATA   #FIXED
                "http://www.sun.com/xml/diffmk"
    xmlns:xsi   CDATA   #FIXED
                "http://www.w3.org/2000/10/XMLSchema-instance"
    xsi:schemaLocation  CDATA   #IMPLIED
>

<!ELEMENT doctype (wrapper*)>
<!ATTLIST doctype
    name        CDATA   #REQUIRED
    attribute   CDATA   #IMPLIED
    changed     CDATA   #IMPLIED
    added       CDATA   #IMPLIED
    deleted     CDATA   #IMPLIED
>

<!ELEMENT wrapper EMPTY>
<!ATTLIST wrapper
    parent      CDATA   #IMPLIED
    element     CDATA   #REQUIRED
>

The diffmk XML Schema

The format of the diffmk.xml Control File is constrained by the following XML Schema:

<!DOCTYPE schema SYSTEM "/share/doctypes/xmlschema/XMLSchema.dtd" [
<!ENTITY % schemaAttrs "
    xmlns:xsd   CDATA   #IMPLIED
    xmlns:diffmk    CDATA   #IMPLIED
">
]>

<schema xmlns='http://www.w3.org/2000/10/XMLSchema'
        targetNamespace='http://www.sun.com/xml/diffmk'
        xmlns:xsd='http://www.w3.org/2000/10/XMLSchema'
        xmlns:diffmk='http://www.sun.com/xml/diffmk'
        elementFormDefault='qualified'>

<element name='diffmk'>
  <complexType>
    <choice minOccurs='0' maxOccurs='unbounded'>
      <element ref='diffmk:doctype'/>
    </choice>
  </complexType>
</element>

<element name='doctype'>
  <complexType>
    <choice minOccurs='0' maxOccurs='unbounded'>
      <element ref='diffmk:wrapper'/>
    </choice>
    <attribute name='name' type='string' use='required'/>
    <attribute name='attribute' type='string'/>
    <attribute name='changed' type='string'/>
    <attribute name='added' type='string'/>
    <attribute name='deleted' type='string'/>
  </complexType>
</element>

<element name='wrapper'>
  <complexType>
    <complexContent>
      <restriction base='xsd:anyType'>
        <attribute name='parent' type='string'/>
        <attribute name='element' type='string' use='required'/>
      </restriction>
    </complexContent>
  </complexType>
</element>

</schema>

Author

Norman Walsh, <Norman.Walsh@East.Sun.Com>
Sun Microsystems, Inc.
One Network Drive
MS BUR02-201
Burlington, MA 01803-0902

Permission to use, copy, modify, and distribute this software and its documentation for any purpose and without fee is hereby granted, without any conditions or restrictions. This software is provided “as is” without express or implied warranty.

References

[XML] Bray, Tim, et. al., editors. Extensible Markup Language (XML) 1.0. World Wide Web Consortium, 1998.

[HM76] Hunt, J. W. and M. D. McIlroy. “An algorithm for differential file comparison”. Computing Science Technical Report 41. AT&T Bell Laboratories, Murray Hill, N.J. 1976.

[HS77] Hunt, J.W and T.G. Szymanski. “A fast algorithm for computing longest common subsequences”. Communications of the ACM. vol. 20, no. 5, pp. 350-353, 1977.

[Algorithm::Diff] Konz, Ned. Algorithm::Diff. Comprehensive Perl Archive Network.

[XML Schema] Thompson, Henry S., et. al., editors. XML Schema Part 1: Structures. World Wide Web Consortium, 2000.