There are numerous problems with basic HTML templates. They stem from the fact that information of the page is inseparable from the layout. It is possible to break out the styling with CSS, but that is only half the job. To alleviate this issue a more complex but flexible methodology for publishing, using XML and related technologies, is available. Specifically this new procedure will involve DTD, XML and XSLT (Extensible Stylesheet Language Transformations) technologies. An introduction to XSLT will be presented in the next article in this series and will follow shortly.
XML, Extensible Markup Language, has become more popular recently. Not in the small part, this happened due to adoption of the format by financial institutions for data exchange (The FIX protocol). XML can be used for information storage and short data exchange. XML is not well suited for data storage. By data storage, rather than information storage, I mean large sets that can be easily and intuitively organized in tabular form to be stored in databases. Where as information storage is a way of representing published works that are large, longer than some arbitrary number of characters, that would make it impractical to store them in a database in any usable manner. XML is very useful in conjunction with databases. In this situation the database can store meta data related to an article that itself is stored in some XML format. XML is very verbose, this fact makes it uncomfortable to use it for large data exchange, for this purpose CSV (Comma Separated Value) is more practical in most situations. This article will concentrate on the benefits that can be reaped by the use of XML in the case of information storage.
XML allows the author to define his own vocabulary for the document at hand that best fits the information model. Tag names are arbitrary and it is up to the creator to pick them, as is the order and the attributes these tags have. There is a way to solidify the particular specification using DTDs, as discussed later. Using intuitively named markup, the documents become eminently readable by people. Because they are structured they are also easy to manipulate in software.
XML is not so much a language as it is a set of grammar rules that allow the user to define his own structured specifications for data markup. XML, for example, specifies that all tags must be either self-closing or have an end tag. Additionally all attributes must have values associated with them (Earlier versions of HTML allowed the user to shorten attributes to contain just the name and no proper, quoted value. This is no longer true in XHTML 1.0, which essentially amounts to a reformulation of HTML 4.01 in XML). All tags must be properly nested, there is also a required xml header.
In the Simple website templates article there is a basic example of a template page that has four main components. There is the page itself. The page then contains three parts, a header, a footer and the main body. Before this is translated into XML, lets look at the common elements all XML files contain. The only relevant part for now is the xml version directive.
Now lets consider our page components. A first attempt at a template follows.
The basic code above is well-formed XML. A document is called well-formed when it complies with the XML specification it uses, as entered in the xml directive. There are a few attributes that make it that way. It contains the proper xml tag at the start. All tags are properly closed. All tags are properly nested, that is no tags contain unclosed tags. An incorrect sample is provided below. As you can see, the page tag is closed before the content tag is.
The code in example two works fine, but it does not provide any detailed information about the document. The page header may contain information like the title of the page, its author perhaps, it could also contain the publication date. The date rather than being a stream of characters should be composed of easily addressable parts such as the year the document was created. All this can be included inline in the header, the document however would be much more useful if this information was easily extracted from it. Addition of the aforementioned elements is shown in example four.
It is important to keep in mind that information contained in the XML file need not all appear in the final published version of that file in XHTML or another output format. Only a subset of the data may be presented and in a completely different order than it is composed in the XML file. This type of conversion or publishing will be done with the help of XSLT. The XML source simply exists as a means of storage, and as such should contain a good amount of detail about the information in it.
The document template developed in the previous section works as a template, there is however more that can be done to further the task of component separation. Once we have a good idea of what our data will look like, we need to create a document type definition file (DTD). The order, number and hierarchy of tags implemented in our XML template is described in the DTD. DTDs exist so that it is easy to validate a particular document against a specification that it is supposed to follow. HTML files contain a DTD directive so that the browser knows how to properly render the document. The W3C HTML validator service uses the DTD for the specified HTML version to ensure compliance. DTDs can be also used as templates. It is possible to define the minimum required set of tags in the document. Once this is done, it is trivial to generate a template from the DTD rather than making one by hand. In this article the XML template came first. However as the reader gets more proficient at defining XML structure, generating the DTD will be the first step in the process.
The syntax for specifying DTDs is simple. It provides for elements, attributes of elements, their respective order and number. The following snippet of code defines the elements in the above XML template.
The code above can be placed in template.dtd and will constitute a proper DTD document. Let us now look at the syntax. The ELEMENT tag defines an element. There is one such tag for every tag allowed in the XML document. Tags that have children define these children in the ELEMENT tag of the parent. Then each of the children tags have their own definition specified. It is possible to specify the order and the number of each of the child elements. The number of elements is defined by the character directly following the name of the given element as typed in the definition of the parent. There are three such characters: '?', '+', '*'. In the next code sample, element document has four possible children: content, source, author and date. This element can have one and only one child named content, any number of children named source, one or more elements named author, and none or one element named date.
If the tag has no children then it must specify if it contains character data or if it is empty. The date tag in the template is empty, it has no children and cannot contain text, the title tag on the other hand must contain a string of characters.
The order in which tags may appear in a valid document can be defined. In the template above, the header element may contain its child tags in any order. On the other hand, the page element must contain the header, content and footer tags in that order. This distinction is made by putting the elements with a strictly defined order in parenthesis.
The ATTLIST tag defines the attributes for tags if such are needed. Not all tags in our template have attributes, only the date tag does, so only the it has ATTLIST specified for it. The ATTLIST tag lists the name of the tag it is describing followed by the name of the attribute, the data type of the attribute and optionally the presence requirement. The date tag must have the year attribute defined and that attribute must contain character data. The day attribute of the same element, on the other hand, is not required, neither is the month attribute. The default presence specification is #IMPLIED.
Linking an XML document to a DTD is done the same way it is in HTML documents. Unlike in HTML however there is one more consideration. HTML specifications are public, so the doctype line links to a URL of the DTD and says that the doctype is public. Developers may not wish to publish the document specifications they developed. In this case the doctype line will list a local system path to the DTD and set it as system. Examples of this are in the next code snippet.
Once the DOCTYPE directive is attached to the XML document it can be validated against the DTD it is linked to. If the document complies with the DTD it is called a valid XML document of the type named in the DTD directive. In the example above, the XML document would be a valid sitedoc document, were it to comply with the DTD.
XML templates provide a robust and scalable way of managing data. This will become more evident after the concepts of XSLT are introduced. Unlike the HTML+CSS counter parts, XML templates allow for a high level of flexibility while maintaining a specification defined by the linked DTD. Storing information as XML sources allows content creators to easily publish their articles to a multitude of formats, as well as change the style and presentation with ease by modifying only source style sheets instead of all content files as HTML templates would have required.
Once the DTD is written and XML documents are generated it is time to publish them. This is the topic of the next article in this series. For a more in-depth look at DTDs, check out More about DTDs.
XML 1.0 specification is located at W3C.