To appear in Markup Languages: Theory and Practice, Volume 1 (1999), MIT Press
Converted to HTML from the original SGML 1999-07-30
Businesses and organizations are increasingly finding that HTML (Hyper-Text Markup Language) offers no help whatsoever in managing the information on their web sites. SGML (Standard Generalized Markup Language) provides the flexibility and reuse lacking in HTML. However, SGML alone does not address the problems involved in maintaining on-line document repositories. Although traditional database management systems are clumsy at managing hyperlinked documents, a system combining SGML, database technology, and the protocols of the Web can provide a reasonably robust environment for developing and maintaining a web site. Two possible site designs employing SGML are discussed and evaluated with respect to a set of design objectives and choices. The likely impact of the emerging XML (eXtensible Markup Language) standard on web site design is also discussed.
Businesses and organizations are increasingly turning to the World Wide Web as a means of sharing information. They are finding that, while the Web provides a convenient platform-independent and geography-independent way for individuals to communicate with one another and to interact with software applications, the language of the Web — HTML (Hyper-Text Markup Language) [HTML] — offers no help whatsoever in managing information. Although HTML is a good electronic delivery medium for many texts, it is often a poor choice for representing source documents. HTML provides no standard means for reusing the same text in multiple documents. Also, HTML limits content providers to a fixed set of tags.
SGML (Standard Generalized Markup Language)[SGML2][SGML1] provides the flexibility and reuse lacking in HTML. By using an application-specific DTD (Document Type Definition) to represent a web site's contents, site developers can take advantage of the structure defined in the DTD when building data access interfaces and document management tools. Even if the site's contents are represented using one of the standard HTML DTDs, developers can still take advantage of SGML features such as entities and marked sections to promote data reuse. However, SGML alone does not address the problems involved in maintaining on-line document repositories. Although traditional database management systems are clumsy at managing hyperlinked documents, a system combining SGML, database technology, and the protocols of the Web can provide a reasonably robust environment for developing web sites. In this paper, I discuss and evaluate two possible site designs employing SGML, with respect to a set of design objectives and choices. I then discuss how the emerging XML (eXtensible Markup Language)[XML] standard is likely to influence web site design.
This discussion addresses several web site design objectives. Each objective's relative importance depends on a site's application-specific characteristics: the kinds of documents being accessed, the volume of information being managed, and who is providing the content. These objectives are not meant to be an exhaustive list. In particular, they do not address HTML's shortcomings with respect to hyperlink navigation, nor do they consider formatting issues.
It should be easy for content providers to create and modify documents. HTML does a fairly good job meeting this goal. HTML has a simple tag set and, with the help of a high quality HTML editing software package, it is not hard for good writers to learn enough HTML to create nice looking, easy to read documents. Editing SGML documents, on the other hand, can be cumbersome when the DTD is complex.
Numerous search engines with publicly accessible HTML user interfaces are available for locating resources on the Web. These search engines rely on “web spider” programs [ROBOT] that traverse the Web to build their search indexes. Therefore, in order for a web site's documents to be searchable using these web search engines, the documents need to be in the form of static HTML (or plain text) files. Document authors have the option of providing cues to a search engine through HTML's “META” tag, which can be used to specify name/value pairs describing arbitrary characteristics of the document. However, different search engines may use different “META” syntax, since HTML does not dictate any particular set of properties.
If context-sensitive searching is required, then the data should be structured in an application-specific manner, i.e. according to an SGML DTD. HTML alone provides only very limited structuring (titles, headings, paragraphs, etc.) and fails to represent application-specific relationships. Therefore, this objective conflicts with the accessibility objective. The RDF (Resource Description Framework) [RDF] standard specifies an infrastructure using XML syntax that enables content providers as well as digital librarians to specify metadata about a web document. If RDF-based metadata formats are implemented in web browsers, then web site designers will no longer be forced to choose between the accessibility objective and the sensitivity objective.
Once the total size of all static HTML documents in a web site grows beyond a few hundred kilobytes, it becomes necessary to spend a lot of effort keeping the site's contents from becoming stale. This requires checking for dead hyperlinks, making sure information is up to date, and maintaining consistency between the documents. HTML provides no support for any of these tasks. In particular, HTML lacks a mechanism for reuse, requiring document authors to specify the same content over and over again, inviting inconsistencies.
One of the reasons for the Web's growing popularity is the ability of a web user to quickly surf from one web site to another with a simple click of the mouse on a hot spot. The degree to which a web site provides hyperlinks to documents on other web sites is referred to as the site's luminosity [BRAY]. Although luminosity is generally a good thing, off-site hyperlinks should be used with discretion since too many of them can be distracting to the web surfer.
A web site designer using SGML has several decisions to make. Which choices are best depends on the relative importance of five objectives given the application. In particular, the designer must address the following issues:
The designer must decide whether to represent the source documents as HTML conforming to an HTML DTD or to use a DTD specifically tailored to the Internet's requirements. Each approach has its advantages and disadvantages. It is usually easier for content providers to author documents in HTML than to use an application-specific DTD. On the other hand, documents conforming to an application-specific DTD can be searched using an SGML search engine according to the structures in that DTD. Thus, using SGML-conforming HTML to represent source documents favors the creation objective, and using an application-specific DTD favors the sensitivity objective.
Until most web browsers support the display of XML documents, the majority of SGML-based web sites will need to display their SGML source documents as “browser-ready” HTML. Even if the source document is SGML-conforming HTML, all entity references and marked sections must be normalized. Therefore, an SGML-based site design needs to include an SGML-to-HTML translator to render the SGML source as browser-ready HTML. While the implementation of a translator is straightforward, the designer needs to decide whether the translation should be done on demand whenever a web browser requests access to a document or whether the translation should be done a single time after a document is created or modified. Figures 1 and 2 illustrate high-level SGML-based site designs using dynamically generated HTML and static HTML respectively.
Each alternative has its advantages and disadvantages. Generating the browser-ready HTML dynamically simplifies the web site's design because it eliminates the need to keep statically generated HTML documents in sync with the SGML source. On the other hand, dynamically-generated documents are not accessible to external search engines. Also, dynamic generation of HTML requires the use of CGI (Common Gateway Interface)[HTML], a standard for interfacing applications with web servers. CGI adds overhead to the server, in addition to the processing required to generate an HTML document. Thus, frequent requests to generate HTML from SGML on demand can degrade web server performance. To summarize, dynamically generated HTML works best when the maintainability objective is more important than the accessibility objective and when the generated HTML is in response to a request not likely to be repeated frequently (such as a database query).
Serving static HTML documents enables the use of external search engines and eliminates the need for the web server to translate SGML to HTML and to incur CGI overhead every time a document is requested. Also, because translation from SGML to HTML is done independently of HTTP (Hypertext Transfer Protocol) [HTTP] requests from web browsers, the SGML document repository can exist on a server other than the web server. However, keeping the SGML source and the HTML consistent requires an infrastructure for configuration management. The larger the number of documents in the web site, the more sophisticated the data management needs to be. A very large site requires a high powered relational or object database management system to keep track of the associations between the SGML source and the browser-ready HTML. Thus, serving static HTML works best in situations where either of the following conditions hold:
The more hyperlinks to off-site documents a web site has, the more luminous it is. On the other hand, pointers to external resources need to be checked regularly for staleness since off-site documents can disappear or undergo modification at any given time. Thus, a site design promoting references to external documents favors the luminosity objective over the maintainability objective while restricting external references favors the maintainability objective over the luminosity objective.
I am currently implementing two web site designs. The sites being developed use very different approaches with respect to the issues previously discussed and emphasize different design objectives. The first site [ASME97][SGML96], being used as part of an environment for developing and deploying standards, uses application-specific DTDs, an SGML search engine installed on the web server, and makes extensive use of CGI. The second site, containing a mix of project information, home pages, and documentation, relies on SGML-conforming HTML to represent the data and serves static browser-ready HTML in response to requests for URL (Universal Resource Locator)s. Although these two web sites are interesting to compare with one another, they by no means represent the only design choices available. For example, a web site designer can combine an application-specific DTD with generation of static browser-ready HTML documents as Norman Walsh has done for his web site at http://nwalsh.com. Walsh represents the source data for his entire site as a single SGML document instance conforming to the DocBook DTD [DOCBOOK] and has built an SGML-to-HTML translator that creates multiple hyperlinked HTML documents from the source. Walsh's approach enables him to rely on an SGML-validating parser to ensure that all cross references internal to the web site are consistent although another means must be used to check for stale links to external documents.
This web site design is centered around a repository of SGML documents that are indexed for fast structure-based retrieval using an SGML search engine. The DTD used was created specifically to represent documents comprising STEP (STandard for the Exchange of Product model data) (ISO 10303)[STEP], a family of standards that attempt to define an ontology for the exchange of product data throughout a product's life cycle. Therefore, the DTD supports queries that are highly application-specific. A set of CGI scripts generate HTML in response to queries composed from information entered onto HTML forms. The CGI scripts use a library of access functions which provides an interface to the repository's search engine. If the result of a query is a block of tagged text, then a translation module converts the raw SGML data to HTML. Because all HTML is created on the fly, this web site design closely resembles the one shown in Figure 1.
Figure 3 shows an example of the CGI application's output. The HTML shown is a response to a query for a UoF (Unit of Functionality) from a particular STEP standard. UoFs are collections of concepts (known among developers of STEP standards as application objects) that are complete and unambiguous. UoFs serve as a mechanism for modularizing STEP. STEP developers normally study UoFs, application objects, and their interrelationships by reading either paper copies of the standards or on-line versions in word processor formats. The interface provided by Site 1 is a potentially far more useful way to disseminate this information, not only because the interface provides fast access, but also because the use of SGML makes it possible to view data associations not explicitly stated in the standards documents themselves.
For example, the SGML representing the actual definition from the standard for the UoF shown in Figure 3 is as follows:
<uof name="faceted_csg_representation"> <uof.def> The faceted_csg_representation UoF consists of CSG primitive elements that are bounded by planar surfaces. The primitives are used to represent the complete shapes and component shapes of building elements; they can be combined using boolean operations. </uof.def> <appobj.ref.list> <appobj.ref.list.item appobj.name.linkend="Block"> <appobj.ref.list.item appobj.name.linkend="Truncated_pyramid"> </appobj.ref.list> </uof>This SGML data specifies only the name of the UoF, its definition, and the application objects it uses. It does not explicitly specify the other UoFs that use this UoF's application objects, although this information is provided in the right hand column of the table in Figure 3. The contents of the right hand column are obtained by searching all the other UoFs in Site 1's repository and, for each of these UoFs, comparing application objects used to those of the faceted_csg_representation UoF.
Site 1 succeeds well at the sensitivity objective because it permits queries with search criteria tailored to the semantics of the documents in the repository. It does so, however, at the expense of the creation objective in that the complexity of the DTD makes it difficult to author new standards in SGML as well as convert existing standards created in word processor formats. Although Site 1 does not meet the accessibility objective, the accessibility objective's importance is diminished because the ability to perform context-sensitive searches lessens the need to use the Web's general purpose search engines. Site 1 meets the maintainability objective to the extent that a validating SGML parser can detect inconsistent ID references and entity references. Because Site 1's DTD does not support HyTime (Hyper-media Time-based Structuring Language) [HYTIME], XML linking [XLINK][XPOINTER], or some other means for representing hyperlinks from one document to another, the luminosity objective is not met. However, this is more a limitation of the DTD than of the framework of the design itself. If support for hyperlinks between documents and off-site were added to the DTD, then Site 1's library of access functions and CGI scripts could be augmented to achieve the luminosity objective.
Site 1 is best suited for applications where data is highly structured, modification of documents is tightly controlled, and (unless the DTD used supports hyperlinks to other documents) data is fairly self-contained. Site 1 is also a good choice when the volume of data is large, and it would be impractical to maintain it as static HTML. The area of application in which Site 1 is being deployed meets the above criteria. STEP standards are highly structured, and a lengthy approval process is required to change their contents. They also tend to be large - some of them are thousands of pages long.
This web site design, which resembles the one shown in Figure 2, specifies a collection of SGML documents conforming to the HTML 4.0 DTD. Each SGML-conforming HTML document is stored in a file. These documents may contain SGML constructs not understood by today's web browsers such as references to general entities defined specifically for the web site and marked sections. In particular, all HTML links (i.e. anchors containing HREF attributes) pointing to external documents are defined as general entities. Whenever a document is created or its source modified, browser-ready HTML is created by running a script that uses the spam SGML normalizer, resolving entity references and instantiating any marked sections (see http://www.jclark.com/sp) [KIMBER].
As an example, consider the following SGML-conforming HTML:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" [ <!ENTITY % myents PUBLIC "-//NIST//ENTITIES General Entities for Web Pages//EN"> %myents; ]> <html> <head>&basicstylesheet; <title>Manufacturing Standards Methodology Group</title> </head> <body bgcolor="#FFFFFF"> <h2>Manufacturing Standards Methodology Group</h2> <a href="#mission">What We Do</a> | <a href="#staff">Staff</a> | <a href="#links">Related Activities</a> | &msid; home pageThe normalizer converts this to the following browser-ready HTML, expanding the entity references to basicstylesheet and msid:
<html> <head> <link title="Basic Style Sheet" href="http://www.mel.nist.gov/div826/subject/apde/basic.css" rel="stylesheet" type="text/css"> <title>Manufacturing Standards Methodology Group</title> </head> <body bgcolor="#FFFFFF"> <h2>Manufacturing Standards Methodology Group</h2> <a href="#mission">What We Do</a> | <a href="#staff">Staff</a> | <a href="#links">Related Activities</a> | <a href="http://www.mel.nist.gov/div826/msid/">Manufacturing Systems Integration Division</a> home page
In order to prevent the hyperlinks from getting stale and to manage the generation of web-browser-ready HTML, a database is needed to keep track of entity use. The database's data model, shown in Figure 4, contains objects representing authors, web documents, and entities. Each author object contains identification information for the author as well as pointers to all the documents for which the author is responsible. Document objects contain a pointer to the author responsible for maintaining the document, the location of the SGML source file, the URL for the corresponding browser-ready HTML file, and a list of all entity references specified in the document. Each entity object contains the entity's name, the entity's value, and pointers to all the documents referencing it.
Whenever a content provider creates a new document or modifies an existing one, a “check-in” procedure performs the following steps:
Whenever an entity is added or modified, the following steps are performed:
A link-checking program is needed to ensure that there are no dead or stale hyperlinks in the entities in the database. Any problems uncovered by the link-checking program should be brought to the attention of the web master by email. The server should be configured to run this program automatically on a regular schedule, perhaps daily, so that the web master does not have to remember to do it.
Site 2 meets all design objectives except the sensitivity objective. Document creation and modification is simple provided that the authors have access to software that creates SGML-conforming HTML and supports referencing of non-HTML-defined entities. Because the web-browser-ready HTML is generated only once after document modification rather than dynamically in response to queries, the web-browser-ready HTML documents can be accessed using the Web's general purpose search engines. Context-sensitive searching is not supported because HTML's tag set is not application-specific. Ease of maintenance is achieved, although it requires a complex and possibly expensive infrastructure. The only limitations on luminosity are those self-imposed by the content providers.
Site 2 is best suited for loosely structured web sites with multiple content providers where the content providers are free to provide hyperlinks to wherever they please. Many of today's web sites fit this description. Site 2 also makes sense for web masters whose Internet service provider does not provide them with support for CGI scripting as part of their site hosting package. Depending on the size of the web site, implementing Site 2 may be possible using a low-powered database implementation, or it may require a full featured SQL (Structured Query Language) or object-oriented database engine [ULLMAN].
XML is likely to increasingly impact web site design as the standards continue to mature and as support from software vendors increases. As metadata formats based on RDF gain acceptance, we can expect to see more and more web sites use application-specific metadata tags and an increasing number of search engines on the Web supporting queries based on those tags. Metadata support is likely to be among the earliest benefits of XML realized because only minimal modifications to existing HTML documents are needed. Also, metadata support does not depend on the newer, less-stable XML standards such as the linking and pointer languages[XLINK][XPOINTER] (XLink and XPointer), which are in draft status at the date of this writing.
XML is likely to eventually eliminate the need for SGML applications to down-translate their data to HTML for display on the Web. Many existing SGML DTDs already conform to XML or can easily be made to do so. Even if a DTD is not XML compliant, documents conforming to the DTD can in most cases easily be converted into equivalent XML documents using a tool such as sx. As standards for XML style sheets emerge, and as web browser vendors implement support for these style sheet standards, it will become possible to display XML without having to translate it to another format first. This will simplify the design of web sites like Site 1 by eliminating the need for an SGML-to-HTML translation step. Documents in web sites like Site 2 will be able to be both written and displayed as XML, eliminating the need to manage the normalization of entity references and marked sections. Finally, support for XML style sheets in web browsers will remove the formatting limitations imposed by HTML's fixed tag set, making it easier to display documents requiring specialized formatting instructions.
XLink and XPointer, if the major Internet browsers eventually support them, will expand the possibilities for hyperlinking beyond what HTML currently allows. In particular, they will make transclusion, the “dynamic inclusion of data from one document in another”[TRANSCLUSION], feasible on the Web. Thus it will become possible for document authors to reference specific portions of other documents on the Web (using XPointers) and specify how the referenced data should be presented to the web surfer (using XLink's actuate and show attributes), enabling authors to tailor the referenced data's presentation to the referencing context. For example, data could be referenced in one context using a hot button to be clicked on, while in another context the same data could be seamlessly embedded into the web page currently being displayed.
Figure 5 shows the design for a web site of the future patterned after Site 2. Hyperlinked documents are written using XML with XLink and XPointer and displayed using XML style sheets. The hyperlinks may point either to objects in other XML documents or to objects in arbitrary non-XML documents. Depending on the size of the site, it might be desirable to store the XML documents and/or the non-XML documents in databases in order to speed access to the objects referenced by the hyperlinks.
Yet another potential benefit of XML to web site developers is improved integration between web server and client. Support for XML in web browsers will provide sites with the ability to include applets embedding capabilities supporting the manipulation of XML data structures in web clients[BOSAK]. This will reduce the burden on the server and, more importantly, will open a new world of possibilities for interaction between SGML/XML repositories and other databases and applications. If future versions of mainstream desktop applications are able to import XML, new possibilities for mining the content of a site will become available to web surfers. For example, a surfer might query an on-line database, import the results into a spreadsheet application, and create a bar chart or pie chart from the results.
Sites 1 and 2 are both under development, and the portions of them that have already been implemented are experiencing use. Site 1's underlying framework - the data access interface to its SGML search engine - is fully implemented. Although the UoF example discussed in the sub-section "Site 1: HTML Dynamically Generated From SGML Database Queries" covers only a tiny subset of STEP, a much wider variety of queries are possible given the application-specificity of the STEP DTD. Current efforts are focusing on making improvements to the DTD in order to increase its modularity and make it easier for authors to use. Future efforts may involve developing a document architecture for STEP and related standards using architectural forms[HYTIME] and creating additional DTDs using this architecture.
Development of Site 2 is in its early stages. The translator for converting SGML-conforming HTML documents to browser-ready HTML and a catalog of entities currently exist. I maintain a variety of web pages with source written in SGML and using the entity catalog. Since the document management database discussed in the previous section has not yet been implemented, I check my documents manually using a third party HTML link validator. Once implementation of the document management database is complete, this web site will be expanded to include more documents and to support content providers other than myself.
Sites 1 and 2 illustrate a dilemma faced by today's web site developers who wish to take advantage of the benefits of SGML. On the one hand, they can rely heavily on SGML's ability to represent data in an application-specific, structured manner and on CGI to dynamically generate browser-ready web output in response to SGML database queries. While such a site design enables users to quickly find information through application-specific queries and is easier to maintain than a collection of HTML documents, it requires extra effort on the part of content providers, additional server overhead, and the implementation of hyperlinking if links to off-site web pages are desired. On the other hand, web site developers may choose to minimize the burden on content providers and to maximize server performance, interoperability with web search engines, and linkage with other web sites. In this case, they must sacrifice application-specific structured query capability and implement tools for managing entities and maintaining hyperlinks.
The emerging XML standards promise to provide web site developers with the best of both worlds, allowing them to enjoy most of the benefits of SGML while not sacrificing the convenience of HTML and interoperability with the rest of the Web. If XML is ultimately successful, not only will it be easier for web site developers to use SGML, but also they will be able to take advantage of newly available capabilities to make their content easier for users to read and easier for web clients and other desktop applications to interpret.
More information about the work discussed in this paper is available on the Internet at http://www.nist.gov/apde.
 The CGI specification is on-line at http://hoohoo.ncsa.uiuc.edu/cgi/.
 Third party software tools are identified in this paper to foster understanding. Such identification does not imply recommendation or endorsement by the National Institute of Standards and Technology, nor does it imply that the products identified are necessarily the best available for the purpose.
 This software tool is part of James Clark's sp package, available at http://www.jclark.com/sp.