Week 2: XML and TEI

What is XML

Last week we learned HTML or HyperText Markup Language. Today we will be discussing XML, or eXtensible Markup Language. While HTML is primarily concerned with describing the display of content, XML is designed for classifying content into categories.

Ok, ok, I know last week we said that in the current versions of HTML, its actually CSS that handles display and HTML is used for classifying content into categories...that's true, but the categories that HTML uses are still more or less associated with display. XML can be used to classify things in categories that likely have nothing to do with display. For instance, XML will allow you to define the character who is speaking in a play. It is unlikely that you would want to display any one character's words in a format that is different than another's (Hamlet's verse speeches, for instance, will likely be represented in the same way as those of Horatio), and so you would probably not create a "div" tag with a class "HamletSpeech." You might, however, create a similar tag in XML

Why XML

XML developed out of SGML (Standard Generalized Markup Language) which was created to store data in a form that was machine independent. The storage of data in an organized and searchable form has long been a primary goal of computing. The problem is computers and software are updated often and quickly become obsolete. There is a real danger that data that depends upon proprietary software will become useless when the software can no longer be run. XML helps to solve this problem by encoding data in a text file using a regularized format that can easily processed by any computer (at least any computer that resembles what we have, for the last 25 years or so, called a computer).

The X in XML stands for eXtensible, which refers to the fact that, unlike HTML, there is no theoretical limit to the number of tags that can be used in XML. For instance, <div id="HamletSpeech"> and <HamletSpeech> are both perfectly good XML tags. Each project defines its own set of tags which are considered appropriate for its particular goals. These tags are defined and organized into a hierarchy in a set of rules. Originally, these rules were known as a Document Type Definition or a DTD. For the last ten years or so, XML users have been developing other ways of representing the rules for a particular set of XML tags (the most common being something called RelaxNG schemas), but DTDs probably remain the most common way of creating these rules.

Designing a DTD

Although this may soon be as obselete as programming on punch cards, let's spend a few minutes learning how to construct a rule list for DTDs. W3Schools, which we discussed last week, maintains an excellent tutorial on DTD building at http://www.w3schools.com/dtd/default.asp, but I have included a shorter lesson below.

DTDs consist of parts: Entities, Elements, and Attributes. You are already familiar with the last two. Elements are tag names. In HTML (which can more or less be thought of as XML), DIV is an element. The attributes in XML are the same as the attributes in HTML. A Div Element can include (among others) the attributes "id" and "class". An entity is a shortcut for a character or series of characters that may be used in the text. For example, in HTML, the less than sign (<) cannot be represented simply by typing < because the web brower will interpret this as the beginning of a tag. Instead you have to type "&lt;". The ampersand (&) signifies (in both HTML and XML) the beginning of an entity and a semi-colon (;) represents the end. Therefore "lt" is the entity for a less than sign in HTML. In a DTD, you define an entity as follows:

<!ENTITY signature "Doug Reside, Ph.D.">

In the above example, any time the XML parser encountered the text &signature;, it would replace it with "Doug Reside, Ph.D."

An element declaration defines the tag name and the tags that can come immediately under that tag in an XML document (it's "children"). For example, the BODY tag in HTML can enclose a DIV tag, so DIV is a child of BODY. If the element can contain text that is not tagged further, it should include the child #CDATA (which stands for Character Data) or, if the text includes entities (as it often will) #PCDATA (which stands for "Parsed Character Data."). Elements are defined in the DTD as follows:

<!ELEMENT BODY (#PCDATA, DIV, SPAN) >

In the above example, an element named BODY is defined as having three children, some text, DIV and SPAN. Now, this example does not really represent the HTML BODY element for several reasons. First, the actual BODY element has many more children (in fact virtually every HTML tag is a child of BODY). Also, this definition says that the BODY element has one, and only one, text section, one and only one DIV element and one, and only one, SPAN element (no more, no less), and that the text section must precede the one DIV element which must come before the SPAN element in the code.

To specify the greater variety of configurations texts, DIV and SPAN elements allowable in the BODY tag, we need to make a more complicated rule.

Rules for children

To specify the order and number of occurences of the children of an element, we need to use the special characters reserved for use by DTDs. They are: *, +, ? and are placed after the child's name (for example (DIV*, SPAN?) )

If a child is not followed by any of these special characters, it must occur once, and only once.

If you want to specify that the code may include one of several options, connect these elements with a vertical bar | rather than a comma (in programming a vertical bar conventionally represents the word "or", as in "either/or"). You can also enclose a set of of children in parentheses and then apply the special character to the whole group. So, for example, (#PCDATA,(DIV|SPAN)*) means there must be one, and only one, text string, which can be any number of DIV and SPAN elements (which can appear in any order).

Attributes

Attributes are defined in DTDs with the following syntax:

<!ATTLIST DIV
id CDATA #IMPLIED
class CDATA #IMPLIED>

The word "!ATTLIST specifies that you are beginning the list of attributes for the element which is named in the next word, in this case DIV. By convention, this is followed by a new line (but need not be). You then name the attribute (in the first example, "id") and give its type. The most common types of attributes are CDATA (that is, plain text), ID (a reserved type that uniquely identifies an element), IDREF (a reference to another element), IDREFS (references to a list of other elements, or a list of values enclosed by parenthesis and separated by vertical bars (for example color (red|blue|green)).

The attribute type is then followed by the default value. You can specify this value as follows:

<!ATTLIST bookCover color (red|blue|green) "red">

However, you can also specify that the attribute is optional (as are id and class in the HTML DIV tag) by setting the default to "#IMPLIED" like this:

<!ATTLIST bookCover color (red|blue|green) #IMPLIED>

You can also require that the user define a value, by setting the default to "#REQUIRED" using the same syntax as above.

The Text Encoding Initiative (TEI)

In 1987, a group of humanities scholars decided to develop a standardized set of tags for encoding texts of interest to them. This group called themselves the Text Encoding Initiative or TEI and their rules have been generally accepted by most working with electronic editions of texts. The group continues to update and perfect its rules, and the current version, called P5, has just been released. TEI maintains a website at http://www.tei-c.org/, runs a email discussion list (which can be subscribed to via a web form at http://listserv.brown.edu/archives/cgi-bin/wa?SUBED1=tei-l&A=1, and holds an annual conference (which, this year, will be held at the University of Maryland).

Because TEI defines the rules for many types of texts, it also has many different DTDs. Although now slightly outdated by the 5th generation of TEI which attempts to move from DTDs to Schemas, Lou Burnard's TEI pizza chef website is an easy way to construct the DTD you need for your specific text. It is located at http://www.tei-c.org/pizza.html.

After you've created the DTD, you can tag a document in XML using the tags in your DTD in much the same way you would tag an HTML document. Remember, though, to begin every XML document with:

<?xml version="1.0" encoding="ISO-8859-1"?>
and to save the file with the extension ".xml".

XSLT

After you encode your XML document, you'll want to display it. The most common way to do this is through XSLT or eXtensible Stylesheet Language Transformations. XSL (eXtensible Stylesheets Language) is complicated, and again our best friend is probably the w3schools page at http://www.w3schools.com/xsl/. However, as with HTML, you can do a lot with just one or two commands. Here, the most useful ones are probably the template commands.

You always begin an XSL Stylesheet with the following tags:

<?xml version="1.0" encoding="ISO-8859-1"?>
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

You then create a series of rules that crawl down through the hierarchy of your XML document and define what the parser should do when it encounters each one. Consider a TEI-XML marked version of the poem we tagged last week:

<?xml version="1.0" encoding="ISO-8859-1"?>
<TEI.2>
<teiHeader>
<fileDesc>
<titleStmt>
<title>The Red Wheelbarrow</title>
</titleStmt>
</fileDesc>
</teiHeader>
<text>
<body>
<l n="1" id="1">so much depends</l>
<l n="2" id="2">upon a red wheel</l>
<l n="3" id="3">barrow glazed with rain</l>
<l n="4" id="4">water</l>
<l n="5" id="5">beside the white</l>
<l n="6" id="6">chickens.</l>
</body>
</text>
</TEI.2>

You then create a set of rules that will define how to display the elements of the XML document. Consider the follow example:

<?xml version="1.0" encoding="ISO-8859-1"?>
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:template match="/TEI.2">
<html>
<head>
<style type="text/css">
.poem{
background-color: grey;
color: white;
}
</style>
<title>
<xsl:value-of select="./teiHeader/fileDesc/titleStmt/title"/>
</title>
</head>
<body>
<xsl:value-of select="./teiHeader/fileDesc/titleStmt/title"/>
<xsl:apply-templates select="text"/>
</body>
</html>
</xsl:template>
<xsl:template match="TEI.2">
<xsl:value-of select="."/>
</xsl:template>
<xsl:template match="text">
<div class="poem">
<xsl:apply-templates select="./body"/></div>
</xsl:template>
<xsl:template match="l">

<xsl:value-of select="."/><br />
</xsl:template>
</xsl:stylesheet>

Consider each tag like a new folder in a directory on your computer. The command

xsl: apply-templates select="XXXX"
says create the following code for when the parser encounters folder XXXX.