Last revised $Date: 2002/07/23 13:40:40 $
Abstract
XML (Extensible Markup Language) developers have at their disposal a variety of tools for achieving schema reuse. An often-overlooked reuse method is the specification of architectures for creating and processing data. Experience with APEX, an architecture processing tool implemented using XSLT (Extensible Style Language Transformations), demonstrates that architectures can fulfill a role not well served by alternative approaches to reuse.
Table of Contents
Developers of markup languages have long recognized the importance of reuse. Since the early days of SGML (Standard Generalized Markup Language) [SGML], authors of DTDs (Document Type Definitions) have used parameter entities to help make markup declarations more reusable. Newer approaches to reuse run the gamut from the relatively simple concept of namespaces [Names] to more sophisticated methods such as the facilities available in the W3C's (World Wide Web Consortium's) XML Schema [XSchema] specification. As a result, XML developers today have at their disposal a variety of tools for achieving reuse.
An often-overlooked reuse method is the specification of architectures [AFDR] [Kimber] [Megginson] for creating and processing data. Architectures, alternatively referred to as “architectural forms” or “inheritable information architectures,” have been around since the mid-1990s. Although the architecture mechanism's invention predates the standardization of XML, architectures are still being used today — most notably in the ISO Topic Maps standard [TM] and in the W3C's XML Linking specification [XLink].
In this paper, I briefly present the architecture mechanism as it applies to XML processing. Next, I discuss APEX (Architectural Processor Employing XSLT), a tool implemented using XSLT (Extensible Style Language Transformations) [XSLT] for processing architectures. I conclude by discussing how architectures compare with some alternative reuse techniques.
Within the context of markup languages, an architecture is a collection of rules for creating and processing a class of documents. Architectures allow applications to:
Extend XML vocabularies without breaking existing applications.
Create architecture-specific document views, retaining only relevant markup and character data while hiding all other content.
Promote data sharing between user communities with inconsistent terminologies by enabling the substitution of identifier names and by allowing simple document transformations.
Unlike a grammar, which defines every aspect of representation and processing for a class of documents, an architecture need not specify a complete document type. Instead, an architecture defines rules known as architectural forms that application designers can apply in defining their XML vocabularies.
An XML document using an architecture contains special attributes, called architecture support attributes, describing how its elements, attributes, and data correspond to their counterparts in the architecture (which are governed by the architecture's architectural forms). Because the values of architecture support attributes are usually invariant for all documents throughout an XML vocabulary, these attributes can be given default values. Hence, it is easy to hide a document's use of architectures from architecture-unaware software tools as well as from humans viewing or editing the data.
Software tools for processing architectures are called architecture engines. An architecture engine may be specific to a particular architecture, or it may be generic (able to process any architecture).
An architecture engine draws upon the following sources of information to process an XML document:
Instructions specifying the architecture being processed and which attributes in the document are the architecture support attributes.
The document's architecture support attribute values.
Syntax rules for the architecture itself. For a generic architecture engine, these rules might be in the form of a DTD, XML schema, or some other formalism for specifying syntax rules. For an architecture-specific engine, these rules could be hard-coded into the engine itself. Some syntax rules can even be implied using support attributes alone, making it possible in some cases for a generic architecture engine to process an architecture without reading the architecture's DTD or schema.
Consider a simple architecture called inv for inventory processing. Suppose that software exists for processing data structured according to inv's syntax. Assume that data conforming to inv consists of an <item> element with a required ID attribute “id” and that <item> contains the elements <name>, <price>, and <quantity>. The following simple UML™ (Unified Modeling Language) [1] [UML] [Carlson] class diagram describes inv's XML syntax. The diagram uses <<element>> and <<attribute>> stereotypes to indicate whether a UML attribute denotes an XML element or an XML attribute.

The following markup declarations define inv using DTD syntax. Alternatively, I could have used a non-DTD XML schema language such as W3C XML Schema or RELAX NG [RELAXNG] [RELAXNGT].
<!ELEMENT item (name, price, quantity) >
<!ATTLIST item
id ID #REQUIRED >
<!ELEMENT name (#PCDATA) >
<!ELEMENT price (#PCDATA) >
<!ELEMENT quantity (#PCDATA) >
|
Now suppose I want to create some XML data consisting of reproductions of works of art that I have, with each work of art having a unique identifier, title, artist, price, and quantity of reproductions on hand. The following UML class diagram specifies the artwork data's XML syntax:

Assume I have three copies of a painting, “Leapin' Lizards,” painted by “El Gecko,” with each copy selling for $15. This data can be represented as:
<art id="a1"> <title>Leapin' Lizards</title> <artist>El Gecko</artist> <price>15</price> <quantity>3</quantity> </art> |
The following table shows the correspondence between the elements in my data and inv's architectural elements:
| element | corresponding architectural element |
|---|---|
| art | item |
| title | name |
| artist | [no corresponding element] |
| price | price |
| quantity | quantity |
In order to process my data using the software that already exists for inventory processing, I add a form attribute to my data. The form attribute is an architecture support attribute whose purpose is to provide the architecture engine with the information in the table above. My form attribute has the same name, inv, as the architecture name. With the form attribute added, the data for “Leapin' Lizards” looks like this:
<art id="a1" inv="item"> <title inv="name">Leapin' Lizards</title> <artist>El Gecko</artist> <price inv="price">15</price> <quantity inv="quantity">3</quantity> </art> |
Although architecture support attributes add complexity to the data, hiding the complexity is easy. Because the form attribute values for the <art>, <title>, <price>, and <quantity> elements are the same for all works of art, these attribute values can be specified as defaults. Thus, the form attributes can be hidden from any architecture-unaware software tool. For example, suppose I have the following DTD with system identifier “art.dtd”:
<!ELEMENT art (title, artist, price, quantity) >
<!ATTLIST art
inv NMTOKEN #FIXED "item"
id ID #REQUIRED >
<!ELEMENT title (#PCDATA) >
<!ATTLIST title
inv NMTOKEN #FIXED "name" >
<!ELEMENT artist (#PCDATA) >
<!ELEMENT price (#PCDATA) >
<!ATTLIST price
inv NMTOKEN #FIXED "price" >
<!ELEMENT quantity (#PCDATA) >
<!ATTLIST quantity
inv NMTOKEN #FIXED "quantity" >
|
Then I could specify the “Leapin' Lizards” data as:
<!DOCTYPE art SYSTEM "art.dtd"> <art id="a1"> <title>Leapin' Lizards</title> <artist>El Gecko</artist> <price>15</price> <quantity>3</quantity> </art> |
Now suppose I tell an architecture engine to process my data using the inv architecture. The architecture engine should produce as output the following architectural document containing only the markup and data defined by inv:
<item id="a1"> <name>Leapin' Lizards</name> <price>15</price> <quantity>3</quantity> </item> |
The architecture engine replaces each element from my data with its corresponding architectural element. The <artist> element is not processed because nothing in the architecture corresponds to it. If the architecture engine were a validating architecture engine, then it could also determine whether my data is valid with respect to inv's DTD or schema.
The preceding example shows only the most rudimentary capabilities of architectures. Other possibilities include, but are not limited to:
Renaming attributes;
Selectively ignoring markup and/or content during architecture processing;
Specifying and processing a document using multiple architectures.
APEX is a non-validating generic architecture engine written in XSLT. The APEX XSLT stylesheet is available as part of the XSLToolbox [XSLToolbox], a collection of XSLT stylesheets available from NIST. APEX implements a simple but useful subset of the AFDR (Architectural Form Definition Requirements) specified in Annex A.3 of ISO/IEC 10744:1997. APEX behaves similarly to David Megginson's XAF package [XAF] and differs from the AFDR in the same ways as XAF. Unlike other architecture engines, which use XML processing instruction syntax to specify architecture usage and control information, APEX obtains this information through XSLT stylesheet parameters [2]. Thus input to APEX consists of an XML document plus stylesheet parameters for identifying the document's architecture support attributes and for controlling architectural processing. APEX produces as output an architectural document conforming to the architecture specified by the stylesheet parameters and the input document's architecture support attributes.
The following UML deployment diagram shows how APEX can be used to enable my artwork data from the example in Section 2.3 to be processed using software supporting the inv inventory architecture.

APEX's input is:
My artwork data augmented with architecture support attributes. The architecture support attributes may either be explicitly specified in the data, or they may be specified as defaults in a DTD or schema.
Stylesheet parameters directing APEX to process the data using inv.
APEX's output is data that can be fed to an inventory processing application. The inventory processing application need not be capable of processing artwork data. All it needs to know about are inv's syntax rules. The inv inventory architecture is the “glue” that holds everything together. It describes the inventory processing application's information requirements. It also influences the artwork data in that the data has to be derivable from inv using the data's architecture support attributes.
Since XSLT provides no standard syntax for specifying stylesheet parameters, APEX's architecture usage syntax is XSLT processor-dependent. To understand how this affects the use of APEX, suppose that APEX uses a fictitious XSLT processor called xslt whose command line syntax is as follows:
xslt xml-document style-sheet [parameter = value...]
To process my artwork data using the inv architecture, I use two parameters: name for the name of the architecture being processed (inv), and auto to specify how elements in the data are associated with elements in the architecture. I specify “nArcAuto” as the value for auto. This tells APEX not to automatically associate elements, i.e. not to process an element unless the element has a form attribute. The resulting XSLT processor invocation is:
xslt art.xml apex.xsl name=inv auto=nArcAuto |
A more complete discussion of APEX's usage and behavior is available as part of the documentation in the XSLToolbox distribution.
I mentioned back in Section 2.2 that it is possible to perform some architectural processing using architecture support attributes alone and without any a priori syntactic knowledge of the architecture itself. In these cases, a generic architecture engine can process a document with respect to an architecture in the absence of any formal specification of the architecture's syntax rules. APEX, which does not use such syntax rules, is such an architecture engine. Although this limits APEX's capabilities somewhat, APEX's ease of customization helps to mitigate this limitation. In fact, APEX's lack of dependence on any particular representation method for syntax rules (such as DTDs) can be viewed as an advantage because architectures processed by APEX are free to specify their syntax rules using any schema language they want.
Because APEX is written in XSLT instead of in a programming language, APEX's functionality is easy to extend using XSLT's <xsl:import> or <xsl:include> elements. Thus adding transformation capabilities to APEX that are supported by XSLT but not by the Architectural Form Definition Requirements is simple. For example, suppose I want to create an inv architectural view of my artwork data that retains the content of the <artist> element such that “Leapin' Lizards” appears as follows:
<item id="a1"> <name>Leapin' Lizards, artist: El Gecko</name> <price>15</price> <quantity>3</quantity> </item> |
Augmenting APEX with a template rule that adds the <artist> element's content to the <title> element's content can do this. The following XSLT stylesheet accomplishes the customization:
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:include href="apex.xsl"/>
<xsl:template match="art/title">
<xsl:element name="name">
<xsl:value-of select="."/>
<xsl:text>, artist: </xsl:text>
<xsl:value-of select="../artist"/>
</xsl:element>
</xsl:template>
</xsl:stylesheet>
|
In order to better see where architectures fit on XML landscape, let us now look at the various reuse methods available to XML application developers. These methods tend to fall into three categories.
The first category, which I call syntactic, consists of techniques that merely serve as shorthand for expressing XML vocabularies more succinctly. These techniques are analogous to macros in a programming language. They make coding more convenient, but offer little with regard to object-oriented design. Examples of syntactic methods include the use of parameter entities in SGML and XML DTDs and the use of <group> and <attributeGroup> elements in the W3C XML Schema definition language. Architectures do not fit into this category.
The second category of reuse techniques consists of those enabling information hiding. These methods allow for XML data to be selectively hidden from applications, making it possible for heterogeneous vocabularies to coexist within the same data set. XML namespaces are perhaps the most commonly used method for achieving information hiding. A namespace-aware XML application ignores any markup belonging to a foreign namespace. For example, XSLT processors ignore data belonging to namespaces other than the XSLT namespace, making it possible for stylesheets to combine XSLT transforms with arbitrary non-XSLT data such as user documentation.
Another example of information hiding is “lax” validation of XML data. W3C XML Schema allows for lax validation by means of the <any> element, which permits a type's content model to contain any well-formed XML. RELAX NG schemas use the <anyName> and <except> name class elements to specify lax validation.
Architectures fit into this information-hiding category. As the example in Section 2.3 illustrates, an architecture engine ignores any elements that are not defined as part of the architecture. Thus, my artwork data can have an <artist> element, even though this element has no counterpart in inv. The architecture engine effectively hides <artist> from my inventory processing software.
The third category, which I call semantic, includes those reuse methods that enable developers to represent relationships between concepts defined in XML vocabularies. “Semantic Web” technologies such as RDF (Resource Description Framework) [RDF], RDF Schema [RDFS], and Topic Maps clearly fit into this category. W3C XML Schema also has some semantic capabilities. For example, type derivation by extension or restriction enables reuse through inheritance. Also, substitution groups permit elements to be substituted for other elements, allowing elements to be used interchangeably.
Architectures fit into the semantic category of reuse methods as well. They support renaming of elements and attributes and also support selective suppression of markup and/or content during architectural processing. Using these capabilities, architectures can mimic W3C XML Schema's type derivation and element substitution behavior. For example, the <art> element in my artwork vocabulary in Section 2.3 can be thought of as having a type that is an extension of the type of architecture inv's <item> element.
Architectures are among the oldest of several DTD/schema reuse methods available to XML developers. This raises the question of whether newer methods for achieving reusability make architectures obsolete. To answer this question, consider three contemporary XML technologies: XML namespaces, W3C XML Schema, and XSLT.
Namespaces are a simple yet useful mechanism for information hiding. However, they fail to address the more semantic aspects of reuse. They cannot express how different names relate to one another. They cannot even convey any information about the namespace itself. As far as XML processing tools are concerned, the URI (Universal Resource Identifier) associated with a namespace prefix is nothing more than a unique identifier for the namespace. Although it is common practice for the namespace URI to point to documentation, a schema, or some other Web resource, namespace processors cannot assume this behavior.
Another issue with XML namespaces is that the syntax can make XML documents hard for humans to read. As David Megginson points out in his XAF documentation, architectures can actually be helpful here. By making elements with qualified names into architectural forms, it is possible to enjoy the advantages of XML namespaces without subjecting humans reading the data to endless prefixes and colons.
W3C XML Schema has several features designed to promote reusability. Examples mentioned in Section 4 include the <any> element, type derivation by extension or restriction, and substitution groups. As W3C XML Schema practitioners gain more experience [Costello], they might discover that these features can duplicate the benefits of architectures. However, even if using architectures were to add no value to W3C XML Schemas, architectures would still be worthwhile for applications not using W3C XML Schema. Because architecture processing is attribute-driven rather than schema-driven, architectures are compatible with any XML application, regardless of the schema language used.
XSLT, a language for transforming one XML document into another XML document, lets developers specify conversions between different XML vocabularies. XSLT is also handy for solving the common systems integration problem where XML documents almost but not quite conform to a given vocabulary. Although XSLT has considerable power, stylesheets that perform non-trivial transformation can be quite complex, and writing XSLT is often time-consuming. Thus, having to write an XSLT transform every time two systems need to talk to one another is a less than satisfying way to achieve interoperability.
Although architectures, like XSLT, can be used for transformation, the architecture mechanism also allows for validation and architecture-specific processing (although APEX does not support these capabilities). Further, XSLT transforms are specified differently than architectural mappings. In XSLT, mappings are specified algorithmically. With architectures, however, a developer need only formally state the conformance requirement. Also, an XSLT stylesheet (unlike an architecture's support attributes) is completely separate from the schema and data, making it potentially difficult to keep an XSLT stylesheet in sync with the vocabulary it is supposed to transform.
As the implementation of APEX in XSLT demonstrates, architectures and XSLT are complementary. Although the transformations architectures allow are more limited than those possible with XSLT, there is no guarantee in the general case that an XSLT transformation result is valid with respect to an intended vocabulary. Also, the verbosity and complexity of XSLT syntax makes it impractical to write an XSLT transform that could have been specified more succinctly using architecture support attributes. When used together though, architectures and XSLT allow developers to have the best of both worlds.
I wish to thank Simon Frechette, Don Libes, Sandy Ressler, and the Extreme Markup Languages 2001 peer reviewers for their helpful feedback and suggestions for improving earlier drafts of this paper. I am also grateful to NIST's Systems Integration for Manufacturing Applications program and Advanced Technology Program for funding this work.
[SGML] ISO 8879:1986. Information processing — Text and office systems — Standard Generalized Markup Language (SGML).
[Names] World Wide Web Consortium. Namespaces in XML. W3C Recommendation 14 January 1999. See http://www.w3.org/TR/REC-xml-names.
[XSchema] World Wide Web Consortium. XML Schema Part 1: Structures. W3C Recommendation 2 May 2001. See http://www.w3.org/TR/xmlschema-1.
[AFDR] ISO/IEC 10744:1997. Information processing — Time-based Structuring Language (HyTime)- 2d edition. Annex A.3 Architectural Form Definition Requirements (AFDR). See http://www.ornl.gov/sgml/wg8/docs/n1920/.
[Kimber] W. Eliot Kimber. A Tutorial Introduction to SGML Architectures. ISOGEN International Corp. See http://www.isogen.com/papers/archintro.html.
[TM] ISO/IEC 13250:2000. Topic Maps: Information Technology — Description and Markup Languages. See http://www.y12.doe.gov/sgml/sc34/document/0129.pdf.
[XLink] World Wide Web Consortium. XML Linking Language (XLink) Version 1.0. W3C Recommendation 27 June 2001. See http://www.w3.org/TR/xlink/.
[XSLT] World Wide Web Consortium. XSL Transformations (XSLT) Version 1.0. W3C Recommendation 16 November 1999. See http://www.w3.org/TR/xslt.
[RDF] World Wide Web Consortium. Resource Description Framework (RDF) Model and Syntax Specification. W3C Recommendation 22 February 1999. See http://www.w3.org/TR/REC-rdf-syntax/.
[RDFS] World Wide Web Consortium. Resource Description Framework (RDF) Schema Specification. W3C Candidate Recommendation 27 March 2000. See http://www.w3.org/TR/rdf-schema/.
[UML] Unified Modeling Language v1.4 specification, Object Management Group, 2001. See http://www.omg.org/technology/documents/formal/uml.htm.
[Carlson] David Carlson, Modeling XML Applications with UML, Addison-Wesley (ISBN: 0-201-70915-5), June 2001.
[RELAXNG] Organization for the Advancement of Structured Information Standards. RELAX NG Specification. Committee Specification 3 December 2001. See http://www.oasis-open.org/committees/relax-ng/spec.html.
[1] Commercial equipment and materials are identified in order to describe certain procedures. Such identification is not intended to imply recommendation or endorsement by the National Institute of Standards and Technology, nor is it intended to imply that the materials or equipment identified are necessarily the best available for the purpose. Unified Modeling Language, UML, Object Management Group, and other marks are trademarks or registered trademarks of Object Management Group, Inc. in the U.S. and other countries.
[2] If a processing instruction were used to supply this information, then APEX would need to parse the processing instruction's string value. Since XSLT processors do not parse this string value, APEX would have to be augmented with (non-XSLT) programming language code. Passing architecture usage and control information through stylesheet parameters is therefore a more sensible approach.