Document Markup
Or: "The Good, the Bad, and the Ugly"
I have consulted numerous clients with existing web sites on which they wanted new or added functionality. Occasionally, the client knows their site has problems and they ask me to just start over while keeping just their raw information. However, most clients aren't even aware that the person they paid to create their site originally either did it so long ago or did it before they really understood markup that their current site is a fairly out-of-date mess.
I've also seen the problem come from sites that were originally created using WYSIWYG editors like Dreamweaver or FrontPage. For personal pages and very small web sites, I don't think that using a WYSIWYG is a bad thing. It helps beginners really get rolling since memorizing markup is a fairly time-consuming effort if you're only going to do it once or twice in your life. However, if your site has any commercial or industrial applications, you simply can not afford the bloat and non-conformance of the markup that is generated by even the best WYSIWYG editor.
Fortunately, though, converting older, non-conforming documents into nice, modern documents is very simple. I've personally converted a site with around 150 separate documents to modern markup in about three hours. Additionally, there are now software utilities available to help automatically convert old markup to new markup.
Intended Audience
Before we start wading through the details, it's important that you have at least a little experience creating a basic web page. Some of the terms with which you should be familiar include "tag," "attribute," document structure, the document tree, and the document object. If you don't already know that all web documents are internally-structured documents that are arranged by the placement of SGML-like tags in a plain text file, well, be advised.
SGML is, simply, the idea that a document specifies structure and information by separating regions of the text file with "tags" that are indicated by enclosing keywords (and, possibly, additional information) between angle brackets (< and >). In modern applications, most data files are stored as either SGML or XML (which borrows SGML's basic tag concept) internally. Even when you're storing what is traditionally called "binary" files (like images, movies, music, etc) a lot of software is starting to use binary forms of SGML or XML since it allows programmers more flexibility in reading and writing such files. If you've ever opened a postscript file (files used for printing all types of documents), you've seen an SGML file. If you've poked around the files that current web browsers (like Mozilla Firefox) use to store information about settings, bookmarks, histories, etc, you've seen XML files.
Why Are Most Sites Using Old Markup?
It's important to understand that most web sites on the web right now (early 2005) do not use modern markup. A lot of time the reason for this is pure ignorance. I worked for a web development department for a large company only months ago and I found that only one of the three other developers was even interested in using current markup standards. The others seemed to understand that there are a few changes in markup, but since their old techniques were still working for the web browsers that they used, they didn't feel the need to update their markup.
A quick indication as to whether the developers of a web site even know or understand the current markup practices is to check to see if they have indicated a document type reference at the top of their documents. A document type reference points the browser to what's called a document type definition, or DTD. It must be the very first line in a document and it's formatted using a simple SGML tag:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
If you see a tag similar to this one before the very first <html> tag, you know that the developer at least understands the basics of how to use modern markup techniques.
In a nutshell, a DTD tells the browser a lot of information about how the document should be constructed. What it doesn't say is anything about how the page should be graphically rendered. All it says is how information should be organized throughout the various sections of the document based on the types of tags that are used.
The DTD allows us to create many different types of documents with different lexicons and still allow browsers to understand the information in the document without forcing the browser developers to predict the future changes in document types. This means that if someone is using an up-to-date web browser from 2005 in the year 2010 and I create a web document in 2010 using a newer document type, the old browser will be able to use the new DTD to make more appropriate guesses at how the information in the document should be arranged.
Another reason why web sites are still using the older techniques is because the site was originally constructed when the code they are using was a part of the standard. The current standards were formalized in 1999. Web sites created before that time might conform to current standards since the current standards are really just a more strict interpretation of the old rules. But, since no one said "you can't use tables to lay out a multiple-column web page," back then, and it worked so well, most designers went ahead and used a series of nested "table" tags to create multiple columns in their layout. It occurred to a few of us back then that a "table" really doesn't seem appropriate since tables should be used to display sets of related information and not used to tell a browser where to position pictures and text.
To be fair, using tables for columnar layouts is not technically outside of the current standard. However, the tricks that most people use to get the table cells to line up and look good involve using attributes that are not a part of the current document type definition (HTML 4.01 Strict).
Finally, combining the ideas of ignorance and proliferation of cheap WYSIWYG editors, most teenagers with enough time to play with the software possess nearly marketable skills as web designers. A person with an intuitive sense of graphical design and psycho-visual concepts could produce amazing looking web sites using absolutely terrible markup (these designers rarely even look at the markup, so it's not at all their fault).
When the web was still a new concept, I think a lot of companies just told one of their employees to, "Get us one of those web sites." With very little knowledge themselves, the person in charge of arranging a new web site probably hired their computer-savvy nephews or daughters or other young relatives to build the web site. And, since it was an ancillary aspect of most businesses back then, a FrontPage site with mind-bogglingly bloated code soup that rendered reasonably well in Microsoft browsers was good enough (never mind the fact that some of those sites contain about 1000 times more markup than actual information).
So what are a few document types?
A document type defines the tag names and basic markup constructs that are allowed to occur in a document of that type. This could include semantics and syntax such as forcing XML-style tag attributes as opposed to allowing the looser HTML-style set of attribute indicators. The following is a list of some document types and the kind of code that you might see under that DTD.
-
DTD: HTML (no version is assumed to be HTML 2.0)
DTD: HTML 3.2These DTDs are hard to illustrate in code since it looks a lot like HTML version 4 documents. The difference in these older standards is that there are fewer features available for the document (such as tables, frames, and other elements that are commonly used today).
-
DTD: HTML 4.01
Most books and tutorials about writing HTML these days will usually orient themselves to teaching HTML 4.01. Generally, most WYSIWYG editors try to produce 4.01 code; even if they get it wrong. Within HTML 4.01, there are three specific document types: loose, strict, and frameset.
-
HTML 4.01 Loose
For people trying to upgrade their site from old markup to new markup, "loose" documents will probably be the easiest to use. The loose DTD will allow you to keep using some of the older methods of layout (like "color" and "bgcolor" attributes) in the hopes of bridging the gap between using the older markup standards and current standards.
<H2><FONT color=blue><b>Some Text</b></FONT>
<p><font color="green"><b>Some Other Text</b></font> -
HTML 4.01 Strict
The "strict" document type is what most developers consider the current standard. For people who don't care much about the internal code of their documents, but want their document to be useful to most current and future browsers, this is the preferred standard.
<H2>Some Text
<p>Some Other TextThe big difference here is that we've removed all the text and font information. This is an important difference between the older and new documents. It doesn't mean you give up control over the visual format of your document--quite the contrary--it means that you have even greater control over the layout.
A few tags that you may be used to using that are no longer accepted include "b," "i," "center," and "font." Most of the attributes of tags that define any kind of visual formatting (like colors, background images, borders, etc) are also no longer accepted. All text layout, font formatting, colors, and positioning is now done through external style sheets (more on these later).
-
HTML 4.01 Frameset
For completeness, it's worth mentioning that there is a third document type that allows the use of HTML frames. In 99.9% of all situations frames are not a valid solution for any web site. I have run into exactly one use for frames that could not otherwise be duplicated through other means and it's not likely that you're going to run into a completely justified reason to use frames any time soon. In simple terms: don't use frames.
-
-
DTD: XHTML 1.0
The newer XHTML standards parallel the HTML 4.01 standards in most respects. There is a very important difference, though: XHTML documents can be validated and parsed by an XML parser. When you're writing and editing web documents, you may not realize the full implications of creating documents that conform to XML syntax. The subtle differences in tags, semantics, and constructs seem arbitrary at first. But, you should consider using XHTML for all documents you create from now on. You can think of it as the difference between scribbling your document on a napkin with a crayon and using a permanent marker on a piece of poster board; one won't last more than 10 years of average use and it might be hard to understand the words--the other is permanent and will withstand a lot of handling over the years without loosing the meaning of the information.
The reason I say it will last longer is that the web is starting to favor transporting pure information which can be sent and received as XML documents. If you create your web document with XHTML (which is XML), you'll be sure that your document will be useful to software far into the future.
The interesting thing about using XHTML over HTML is that the document constructs are actually simpler than HTML since XHTML forces you to use more consistent semantics.
-
XHTML 1.0 Loose
XHTML Loose parallels the list of allowed tags and attribute keys that are allowed in HTML 4.01 Loose. The difference, though, is that the document has to conform to XML tag structures.
<h2><font color="blue"><b>Some Text</b></font></h2>
<p><font color="green"><b>Some Other Text</b></font></p>Some the changes you might notice are there because of the requirements of XML. Here's a quick list of things that you'll need to be aware of to make any document conform to XML:
- All tag names must be lowercase.
- All attribute keys must be lowercase.
- All attribute values must be enclosed in quotation marks (they're not required to be lowercase since an attribute value is considered a part of the information and not a part of the document's structure).
-
All tags must be closed. This rule gets tricky without the concept of CDATA. CDATA represent a tag's "character data."
This is the text between an opening tag and a closing tag. For instance "Some Text" is the CDATA of the "font" tag in the previous example.
Many tags do not have CDATA (such as img, br, hr, input, link, and meta). Tags without CDATA are closed within the same tag as they are opened:
<img src="" alt="" />The final forward slash (/) and angle bracket (>) indicate that the tag is finished.
However, most tags do contain CDATA (or, at least, should contain CDATA). Besides the six tags listed above, you're not likely encounter tags that do not contain CDATA. All other tags you will use must have a closing tag:
<p>Some Text</p>This rule includes tags that people often do not close like the "li" tag in a list or the "option" tag in a select form element.
-
All flag-style attributes must be constructed like a key/value attribute:
<input type="checkbox" checked="checked" />At first, this seems crazy, when the old way of doing it was just to specify the word "checked" all on its own and this looks more complicated. In practice, though, this means that now we know all attributes in all tags are formed the same way:
key="value"
-
XHTML 1.0 Strict
XHTML Strict parallels the list of allowed tags and attribute keys that are allowed in HTML 4.01 Strict. The idea is that if you want your document to last forever in all browsers now and in the future, you need to write it for the strict specification. In my experience, as soon as you understand XML and the newer HTML standard (4.01 Strict), writing documents for XHTML 1.0 Strict requires no more effort and actually makes it easier to remember how to form the markup when editing by hand.
<h2>Some Text</h2>
<p>Some Other Text</p>There's still some arguments about moving completely over to XHTML. I feel that most arguments are moot since writing a document that can be handled by any XML parser allows the information in your document to be very easily converted into any new document format that can be imagined. Furthermore, since XHTML will work as valid HTML, you can revert by simply changing the document's type reference.
-
XHTML 1.0 Frameset
Again, this DTD is included only for completeness. Just say "no."
-
Examples
I think the easiest way to demonstrate the reasons and practices of using modern markup is through some simple examples of things you'll probably find on many current web sites that still use the old markup. I'll list two samples of code followed by a side-by-side comparison of both samples.
The 1-pixel Border Box with a Title
I've seen this trick on hundreds of different sites and have even used the trick a few times myself (both the wrong and correct ways).
Old HTML Layout
<table width="100%" cellspacing="0" cellpadding="1"
bgcolor="#000000" border="0">
<tr>
<td>
<table width="100%" cellspacing="0" cellpadding="3"
bgcolor="#FFFFFF" border="0">
<tr bgcolor="#DDDDDD">
<td>
<font size="+2" color="#0000CC" face="Arial">
Box Title
</font>
</td>
</tr>
<tr>
<td>
<font color="#000000" face="Times New Roman">
This is some information.
</font>
</td>
</tr>
</table>
</td>
</tr>
</table>
XHTML 1.0/HTML 4.01*
<div class="fancy">
<h2>Box Title</h2>
<p>This is some information.</p>
</div>
Side-by-Side
| Old Markup | Current Markup | |
|---|---|---|
| Size (bytes) | 796 | 89 (89% savings) |
| Layout Data | Downloaded by browser each time it views the information. | Downloaded once (for the site) then remembers how to layout the information as it changes from page to page. |
| Layout Format | Visual layout is inextricably convolved with the information. | Visual layout is completely separate from the information allowing us to easily change the layout without touching the information. |
| Presentation | Only suitable for color, graphical, web browsers on computers with a monitor larger than 14 inches (may require a larger monitor if you have used pixel specifications of size instead of proportional specifications). | Used with a simple style sheet, this is suitable for color, graphical browsers, old browsers (such as Mosaic and early versions of Netscape), black and white, text/console browsers (such as Lynx), full-page printers, tape/card printers, PDA browsers, cell phone browsers, audible screen readers (for the visually impared), and all modern web browsers on all platforms. |
| Indexing | Search engine spiders are unable to "weigh" the importance of the box's title any heavier than the content. This means that your web site has "weaker" indexing for all words on the page instead of "stronger" indexing for only the important words on the page. | Since the markup indicates what kind of information is being displayed, search engine spiders can easily index only the important words on your page which allows search engines to generate more suitable hits for your site. |
| Old Browsers | Older browsers may render the information, but the layout could be lost and make the information confusing. | Older browsers won't bother trying render the layout elements (since we're using a style sheet). But, since we're using document structure-oriented tags, what shows up in the older browser will, at least, make sense. |
| Future Browsers | Future web browsers may drop support for some of the tags used in this document with unexpected rendering results (if it renders it at all). | Future browsers will always support the basic document structure illustrated here. |
* It's important to note that the information to duplicate the colors, backgrounds, and borders of the box are supplied by a separate document: a style sheet. If a person only views one page of your site, the savings in amount of downloading for the page is usually about the same. However, if a visitor views more pages, the layout information is cached in a local copy in the web browser which means that only the actual content is downloaded each time a visitor views a new page. On some web sites, the savings in amount of information downloaded increases multiplicatively as they continue to browse the site.
The most important statistic to most people involved in the process of browsing the site is the size of the document. Here is just a small list of things that improve with smaller web documents:
- Necessary client bandwidth is reduced. This means people on modems are not penalized and people on broadband are able to browse your site as fast as they can click on the links.
- Necessary web server bandwidth and transfer aggregate is reduced. This is significant since it can reduce the costs of your site if you're paying for hosting. If your company is self-hosting, this simplifies hardware requirements and reduces ISP costs also making the site cheaper to operate.
- Server CPU/hardware load is reduced. This means that a large and content-rich web site that serves many thousands of hits per day does not suffer from as much server lag. This makes your site easier to browse and a more useful resource to your visitors.
- Server disk space is reduced. This could mean the difference between a $100/month and a $500/month hosting charge for similarly-equipped web sites.
- Custom processing and filtering is faster. This allows you to implement extremely powerful content management devices on your site while incurring minimal penalties in your site's performance.
Certainly, this example is one of the more extreme examples, but it illustrates just one of the ways that a modern document is superior to the old HTML "hacks."
Graphical Enhancements
Almost all web sites that try to look fancy use graphics in one form or another to try to make the site more appealing to the eye. Unfortunately, this has lead to most sites ignoring all the other means in which their site can be "read." An extreme example that most people point out is that some visually impaired people use screen readers to read the information in a web site to them. They aren't seeing the site at all, but they may want to be able to "read" the information on the site. This is one of the reasons why using markup that is specific to fonts and visual display (like the "b" tag which indicates "bold," when you can use the "strong" tag which indicates that this phrase is important compared to the surrounding text) is being abandoned for more generically-termed tags.
When a lot of designers lay their sites out, they start by creating the graphics that will go around the edges, in the background, or around little sections of the document on each page. The thing that distinguishes designers that are using outdated techniques is if they include those graphics in their web site using HTML markup; the modern techniques involve little if any HTML makrup--that's the point.
If you have a graphic that goes along the top of every page on your site, the last thing you want to use is an "img" tag. The "img" tag should only be used to display images that are actual components of the information you are communicating with that particular page. For instance, if you're writing an article about rats, you should probably throw in a few pictures of various rats in the article. Those pictures are just as much a part of the information as the headings and paragraphs of the article.
However, if you host your article about rats on a biology web site, the graphics that are used to make your page look like the rest of the site have nothing to do with your article. They're not a part of the information in your document. This means that those graphics are used purely for layout enhancement. All elements that simply enhance the layout to make the web site more visually appealing or even those that make the page more organized and easier to read, should be included by other means than markup.
Actually implementing this separation of markup and layout is tricky for a lot of designers since web browsers are still a little inconsistent in their implementation of the standards (especially when it comes to any Microsoft product). To achieve this separation, it is still possible that you will need to alter the markup to achieve certain visual effects. However, there is a difference between adding a little document structure to accommodate the possibility of more layout capabilities and adding markup to your document that has nothing to do with the information that is being presented on the page. I explore this area of confusion in depth in another article.
I can tell you this, though, almost all layout graphics can be simply added to a web page's layout using CSS's "background" rule. Again, this means that every time you want to change how your site looks, you do so in a style sheet and not in the markup.
In practice, a web site's "template" is usually implemented through some sort of templating software or mechanism on the web server. This forms a software-based solution that keeps browsers ignorant about how your original document actually looks and "wraps" it in a nice, pretty layout that adds more (hopefully, structural) markup and manages style sheets. All of my web site solutions involve a template solution in one form or another. Many templating systems also allow more advanced features for your document such as automatically generating links to specific areas of the document, adding links to external resources based on phrases in your document, highlighting words in your document based on a search query used to find the document, and many other features.
Conclusion
This article was written to help those that are wondering some of the reasons why the newer markup standards are better. Basically, if you want your information to last a long time and be effective to browsers and software that we haven't yet thought of, it's important that you update how you write or generate markup. Not only does this make your information more useful and forward-compatible, but it also makes it more accessible and easier to serve and browse. Could you imagine how much faster the Internet would be if all the web sites in the world all of sudden dropped 80% of their documents while still allowing visitors to read the same wealth of information?
If you would like more information about the standards for web documents, the World Wide Web Consortium is the definitive resource. Even if the articles on their site are a little over your head, they provide links to other sites with more pedestrian explanations of everything you need to know to bring your level of design standards into the 21st century and possibly beyond.