Architectures in an XML World

Published in Markup Languages: Theory and Practice (MIT Press), volume 3, number 4, Fall 2001. Original version published in the proceedings of Extreme Markup Languages 2001, Montreal, Canada, August 14-17, 2001.

Last revised $Date: 2002/07/23 13:40:40 $

Abstract

XML (Extensible Markup Language) developers have at their disposal a variety of tools for achieving schema reuse. An often-overlooked reuse method is the specification of architectures for creating and processing data. Experience with APEX, an architecture processing tool implemented using XSLT (Extensible Style Language Transformations), demonstrates that architectures can fulfill a role not well served by alternative approaches to reuse.


Table of Contents

1. Introduction
2. About Architectures
2.1. Architectural Forms and Architecture Support Attributes
2.2. Architectural Processing
2.3. Example of Basic Architecture Processing
3. Architectures and APEX
3.1. Using APEX
3.2. Customizing APEX
4. Taxonomy of XML Reuse Methods
5. Architectures Versus Other Reuse Methods
Bibliography

Developers of markup languages have long recognized the importance of reuse. Since the early days of SGML (Standard Generalized Markup Language) [SGML], authors of DTDs (Document Type Definitions) have used parameter entities to help make markup declarations more reusable. Newer approaches to reuse run the gamut from the relatively simple concept of namespaces [Names] to more sophisticated methods such as the facilities available in the W3C's (World Wide Web Consortium's) XML Schema [XSchema] specification. As a result, XML developers today have at their disposal a variety of tools for achieving reuse.

An often-overlooked reuse method is the specification of architectures [AFDR] [Kimber] [Megginson] for creating and processing data. Architectures, alternatively referred to as “architectural forms” or “inheritable information architectures,” have been around since the mid-1990s. Although the architecture mechanism's invention predates the standardization of XML, architectures are still being used today — most notably in the ISO Topic Maps standard [TM] and in the W3C's XML Linking specification [XLink].

In this paper, I briefly present the architecture mechanism as it applies to XML processing. Next, I discuss APEX (Architectural Processor Employing XSLT), a tool implemented using XSLT (Extensible Style Language Transformations) [XSLT] for processing architectures. I conclude by discussing how architectures compare with some alternative reuse techniques.

Within the context of markup languages, an architecture is a collection of rules for creating and processing a class of documents. Architectures allow applications to:

Consider a simple architecture called inv for inventory processing. Suppose that software exists for processing data structured according to inv's syntax. Assume that data conforming to inv consists of an <item> element with a required ID attribute “id” and that <item> contains the elements <name>, <price>, and <quantity>. The following simple UML™ (Unified Modeling Language) [1] [UML] [Carlson] class diagram describes inv's XML syntax. The diagram uses <<element>> and <<attribute>> stereotypes to indicate whether a UML attribute denotes an XML element or an XML attribute.

inv class diagram

The following markup declarations define inv using DTD syntax. Alternatively, I could have used a non-DTD XML schema language such as W3C XML Schema or RELAX NG [RELAXNG] [RELAXNGT].

<!ELEMENT  item         (name, price, quantity)            >
<!ATTLIST  item
             id         ID                       #REQUIRED >
<!ELEMENT  name         (#PCDATA)                          >
<!ELEMENT  price        (#PCDATA)                          >
<!ELEMENT  quantity     (#PCDATA)                          >

Now suppose I want to create some XML data consisting of reproductions of works of art that I have, with each work of art having a unique identifier, title, artist, price, and quantity of reproductions on hand. The following UML class diagram specifies the artwork data's XML syntax:

Artwork class diagram.

Assume I have three copies of a painting, “Leapin' Lizards,” painted by “El Gecko,” with each copy selling for $15. This data can be represented as:

<art id="a1">
  <title>Leapin' Lizards</title>
  <artist>El Gecko</artist>
  <price>15</price>
  <quantity>3</quantity>
</art>

The following table shows the correspondence between the elements in my data and inv's architectural elements:

elementcorresponding architectural element
artitem
titlename
artist[no corresponding element]
priceprice
quantityquantity

In order to process my data using the software that already exists for inventory processing, I add a form attribute to my data. The form attribute is an architecture support attribute whose purpose is to provide the architecture engine with the information in the table above. My form attribute has the same name, inv, as the architecture name. With the form attribute added, the data for “Leapin' Lizards” looks like this:

<art id="a1" inv="item">
  <title inv="name">Leapin' Lizards</title>
  <artist>El Gecko</artist>
  <price inv="price">15</price>
  <quantity inv="quantity">3</quantity>
</art>

Although architecture support attributes add complexity to the data, hiding the complexity is easy. Because the form attribute values for the <art>, <title>, <price>, and <quantity> elements are the same for all works of art, these attribute values can be specified as defaults. Thus, the form attributes can be hidden from any architecture-unaware software tool. For example, suppose I have the following DTD with system identifier “art.dtd”:

<!ELEMENT  art          (title, artist, price, quantity)   >
<!ATTLIST  art
             inv        NMTOKEN              #FIXED "item"
             id         ID                       #REQUIRED >
<!ELEMENT  title        (#PCDATA)                          >
<!ATTLIST  title
             inv        NMTOKEN              #FIXED "name" >
<!ELEMENT  artist       (#PCDATA)                          >
<!ELEMENT  price        (#PCDATA)                          >
<!ATTLIST  price
             inv        NMTOKEN             #FIXED "price" >
<!ELEMENT  quantity     (#PCDATA)                          >
<!ATTLIST  quantity
             inv        NMTOKEN          #FIXED "quantity" >

Then I could specify the “Leapin' Lizards” data as:

<!DOCTYPE art SYSTEM "art.dtd">
<art id="a1">
  <title>Leapin' Lizards</title>
  <artist>El Gecko</artist>
  <price>15</price>
  <quantity>3</quantity>
</art>

Now suppose I tell an architecture engine to process my data using the inv architecture. The architecture engine should produce as output the following architectural document containing only the markup and data defined by inv:

<item id="a1">
  <name>Leapin' Lizards</name>
  <price>15</price>
  <quantity>3</quantity>
</item>

The architecture engine replaces each element from my data with its corresponding architectural element. The <artist> element is not processed because nothing in the architecture corresponds to it. If the architecture engine were a validating architecture engine, then it could also determine whether my data is valid with respect to inv's DTD or schema.

The preceding example shows only the most rudimentary capabilities of architectures. Other possibilities include, but are not limited to:

  • Renaming attributes;

  • Selectively ignoring markup and/or content during architecture processing;

  • Specifying and processing a document using multiple architectures.

APEX is a non-validating generic architecture engine written in XSLT. The APEX XSLT stylesheet is available as part of the XSLToolbox [XSLToolbox], a collection of XSLT stylesheets available from NIST. APEX implements a simple but useful subset of the AFDR (Architectural Form Definition Requirements) specified in Annex A.3 of ISO/IEC 10744:1997. APEX behaves similarly to David Megginson's XAF package [XAF] and differs from the AFDR in the same ways as XAF. Unlike other architecture engines, which use XML processing instruction syntax to specify architecture usage and control information, APEX obtains this information through XSLT stylesheet parameters [2]. Thus input to APEX consists of an XML document plus stylesheet parameters for identifying the document's architecture support attributes and for controlling architectural processing. APEX produces as output an architectural document conforming to the architecture specified by the stylesheet parameters and the input document's architecture support attributes.

The following UML deployment diagram shows how APEX can be used to enable my artwork data from the example in Section 2.3 to be processed using software supporting the inv inventory architecture.

Deploying APEX to use inv
architecture.

APEX's input is:

  • My artwork data augmented with architecture support attributes. The architecture support attributes may either be explicitly specified in the data, or they may be specified as defaults in a DTD or schema.

  • Stylesheet parameters directing APEX to process the data using inv.

APEX's output is data that can be fed to an inventory processing application. The inventory processing application need not be capable of processing artwork data. All it needs to know about are inv's syntax rules. The inv inventory architecture is the “glue” that holds everything together. It describes the inventory processing application's information requirements. It also influences the artwork data in that the data has to be derivable from inv using the data's architecture support attributes.

I mentioned back in Section 2.2 that it is possible to perform some architectural processing using architecture support attributes alone and without any a priori syntactic knowledge of the architecture itself. In these cases, a generic architecture engine can process a document with respect to an architecture in the absence of any formal specification of the architecture's syntax rules. APEX, which does not use such syntax rules, is such an architecture engine. Although this limits APEX's capabilities somewhat, APEX's ease of customization helps to mitigate this limitation. In fact, APEX's lack of dependence on any particular representation method for syntax rules (such as DTDs) can be viewed as an advantage because architectures processed by APEX are free to specify their syntax rules using any schema language they want.

Because APEX is written in XSLT instead of in a programming language, APEX's functionality is easy to extend using XSLT's <xsl:import> or <xsl:include> elements. Thus adding transformation capabilities to APEX that are supported by XSLT but not by the Architectural Form Definition Requirements is simple. For example, suppose I want to create an inv architectural view of my artwork data that retains the content of the <artist> element such that “Leapin' Lizards” appears as follows:

<item id="a1">
  <name>Leapin' Lizards, artist: El Gecko</name>
  <price>15</price>
  <quantity>3</quantity>
</item>

Augmenting APEX with a template rule that adds the <artist> element's content to the <title> element's content can do this. The following XSLT stylesheet accomplishes the customization:

<xsl:stylesheet version="1.0" 
  xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
  <xsl:include href="apex.xsl"/>
  <xsl:template match="art/title">
    <xsl:element name="name">
      <xsl:value-of select="."/>
      <xsl:text>, artist: </xsl:text>
      <xsl:value-of select="../artist"/>
    </xsl:element>
  </xsl:template>
</xsl:stylesheet>

In order to better see where architectures fit on XML landscape, let us now look at the various reuse methods available to XML application developers. These methods tend to fall into three categories.

The first category, which I call syntactic, consists of techniques that merely serve as shorthand for expressing XML vocabularies more succinctly. These techniques are analogous to macros in a programming language. They make coding more convenient, but offer little with regard to object-oriented design. Examples of syntactic methods include the use of parameter entities in SGML and XML DTDs and the use of <group> and <attributeGroup> elements in the W3C XML Schema definition language. Architectures do not fit into this category.

The second category of reuse techniques consists of those enabling information hiding. These methods allow for XML data to be selectively hidden from applications, making it possible for heterogeneous vocabularies to coexist within the same data set. XML namespaces are perhaps the most commonly used method for achieving information hiding. A namespace-aware XML application ignores any markup belonging to a foreign namespace. For example, XSLT processors ignore data belonging to namespaces other than the XSLT namespace, making it possible for stylesheets to combine XSLT transforms with arbitrary non-XSLT data such as user documentation.

Another example of information hiding is “lax” validation of XML data. W3C XML Schema allows for lax validation by means of the <any> element, which permits a type's content model to contain any well-formed XML. RELAX NG schemas use the <anyName> and <except> name class elements to specify lax validation.

Architectures fit into this information-hiding category. As the example in Section 2.3 illustrates, an architecture engine ignores any elements that are not defined as part of the architecture. Thus, my artwork data can have an <artist> element, even though this element has no counterpart in inv. The architecture engine effectively hides <artist> from my inventory processing software.

The third category, which I call semantic, includes those reuse methods that enable developers to represent relationships between concepts defined in XML vocabularies. “Semantic Web” technologies such as RDF (Resource Description Framework) [RDF], RDF Schema [RDFS], and Topic Maps clearly fit into this category. W3C XML Schema also has some semantic capabilities. For example, type derivation by extension or restriction enables reuse through inheritance. Also, substitution groups permit elements to be substituted for other elements, allowing elements to be used interchangeably.

Architectures fit into the semantic category of reuse methods as well. They support renaming of elements and attributes and also support selective suppression of markup and/or content during architectural processing. Using these capabilities, architectures can mimic W3C XML Schema's type derivation and element substitution behavior. For example, the <art> element in my artwork vocabulary in Section 2.3 can be thought of as having a type that is an extension of the type of architecture inv's <item> element.

Architectures are among the oldest of several DTD/schema reuse methods available to XML developers. This raises the question of whether newer methods for achieving reusability make architectures obsolete. To answer this question, consider three contemporary XML technologies: XML namespaces, W3C XML Schema, and XSLT.

Namespaces are a simple yet useful mechanism for information hiding. However, they fail to address the more semantic aspects of reuse. They cannot express how different names relate to one another. They cannot even convey any information about the namespace itself. As far as XML processing tools are concerned, the URI (Universal Resource Identifier) associated with a namespace prefix is nothing more than a unique identifier for the namespace. Although it is common practice for the namespace URI to point to documentation, a schema, or some other Web resource, namespace processors cannot assume this behavior.

Another issue with XML namespaces is that the syntax can make XML documents hard for humans to read. As David Megginson points out in his XAF documentation, architectures can actually be helpful here. By making elements with qualified names into architectural forms, it is possible to enjoy the advantages of XML namespaces without subjecting humans reading the data to endless prefixes and colons.

W3C XML Schema has several features designed to promote reusability. Examples mentioned in Section 4 include the <any> element, type derivation by extension or restriction, and substitution groups. As W3C XML Schema practitioners gain more experience [Costello], they might discover that these features can duplicate the benefits of architectures. However, even if using architectures were to add no value to W3C XML Schemas, architectures would still be worthwhile for applications not using W3C XML Schema. Because architecture processing is attribute-driven rather than schema-driven, architectures are compatible with any XML application, regardless of the schema language used.

XSLT, a language for transforming one XML document into another XML document, lets developers specify conversions between different XML vocabularies. XSLT is also handy for solving the common systems integration problem where XML documents almost but not quite conform to a given vocabulary. Although XSLT has considerable power, stylesheets that perform non-trivial transformation can be quite complex, and writing XSLT is often time-consuming. Thus, having to write an XSLT transform every time two systems need to talk to one another is a less than satisfying way to achieve interoperability.

Although architectures, like XSLT, can be used for transformation, the architecture mechanism also allows for validation and architecture-specific processing (although APEX does not support these capabilities). Further, XSLT transforms are specified differently than architectural mappings. In XSLT, mappings are specified algorithmically. With architectures, however, a developer need only formally state the conformance requirement. Also, an XSLT stylesheet (unlike an architecture's support attributes) is completely separate from the schema and data, making it potentially difficult to keep an XSLT stylesheet in sync with the vocabulary it is supposed to transform.

As the implementation of APEX in XSLT demonstrates, architectures and XSLT are complementary. Although the transformations architectures allow are more limited than those possible with XSLT, there is no guarantee in the general case that an XSLT transformation result is valid with respect to an intended vocabulary. Also, the verbosity and complexity of XSLT syntax makes it impractical to write an XSLT transform that could have been specified more succinctly using architecture support attributes. When used together though, architectures and XSLT allow developers to have the best of both worlds.

Acknowledgements

I wish to thank Simon Frechette, Don Libes, Sandy Ressler, and the Extreme Markup Languages 2001 peer reviewers for their helpful feedback and suggestions for improving earlier drafts of this paper. I am also grateful to NIST's Systems Integration for Manufacturing Applications program and Advanced Technology Program for funding this work.



[1] Commercial equipment and materials are identified in order to describe certain procedures. Such identification is not intended to imply recommendation or endorsement by the National Institute of Standards and Technology, nor is it intended to imply that the materials or equipment identified are necessarily the best available for the purpose. Unified Modeling Language, UML, Object Management Group, and other marks are trademarks or registered trademarks of Object Management Group, Inc. in the U.S. and other countries.

[2] If a processing instruction were used to supply this information, then APEX would need to parse the processing instruction's string value. Since XSLT processors do not parse this string value, APEX would have to be augmented with (non-XSLT) programming language code. Passing architecture usage and control information through stylesheet parameters is therefore a more sensible approach.