Document Markup

Or: "The Good, the Bad, and the Ugly"

I have consulted numerous clients with existing web sites on which they wanted new or added functionality. Occasionally, the client knows their site has problems and they ask me to just start over while keeping just their raw information. However, most clients aren't even aware that the person they paid to create their site originally either did it so long ago or did it before they really understood markup that their current site is a fairly out-of-date mess.

I've also seen the problem come from sites that were originally created using WYSIWYG editors like Dreamweaver or FrontPage. For personal pages and very small web sites, I don't think that using a WYSIWYG is a bad thing. It helps beginners really get rolling since memorizing markup is a fairly time-consuming effort if you're only going to do it once or twice in your life. However, if your site has any commercial or industrial applications, you simply can not afford the bloat and non-conformance of the markup that is generated by even the best WYSIWYG editor.

Fortunately, though, converting older, non-conforming documents into nice, modern documents is very simple. I've personally converted a site with around 150 separate documents to modern markup in about three hours. Additionally, there are now software utilities available to help automatically convert old markup to new markup.

Intended Audience

Before we start wading through the details, it's important that you have at least a little experience creating a basic web page. Some of the terms with which you should be familiar include "tag," "attribute," document structure, the document tree, and the document object. If you don't already know that all web documents are internally-structured documents that are arranged by the placement of SGML-like tags in a plain text file, well, be advised.

SGML is, simply, the idea that a document specifies structure and information by separating regions of the text file with "tags" that are indicated by enclosing keywords (and, possibly, additional information) between angle brackets (< and >). In modern applications, most data files are stored as either SGML or XML (which borrows SGML's basic tag concept) internally. Even when you're storing what is traditionally called "binary" files (like images, movies, music, etc) a lot of software is starting to use binary forms of SGML or XML since it allows programmers more flexibility in reading and writing such files. If you've ever opened a postscript file (files used for printing all types of documents), you've seen an SGML file. If you've poked around the files that current web browsers (like Mozilla Firefox) use to store information about settings, bookmarks, histories, etc, you've seen XML files.

Why Are Most Sites Using Old Markup?

It's important to understand that most web sites on the web right now (early 2005) do not use modern markup. A lot of time the reason for this is pure ignorance. I worked for a web development department for a large company only months ago and I found that only one of the three other developers was even interested in using current markup standards. The others seemed to understand that there are a few changes in markup, but since their old techniques were still working for the web browsers that they used, they didn't feel the need to update their markup.

A quick indication as to whether the developers of a web site even know or understand the current markup practices is to check to see if they have indicated a document type reference at the top of their documents. A document type reference points the browser to what's called a document type definition, or DTD. It must be the very first line in a document and it's formatted using a simple SGML tag:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">

If you see a tag similar to this one before the very first <html> tag, you know that the developer at least understands the basics of how to use modern markup techniques.

In a nutshell, a DTD tells the browser a lot of information about how the document should be constructed. What it doesn't say is anything about how the page should be graphically rendered. All it says is how information should be organized throughout the various sections of the document based on the types of tags that are used.

The DTD allows us to create many different types of documents with different lexicons and still allow browsers to understand the information in the document without forcing the browser developers to predict the future changes in document types. This means that if someone is using an up-to-date web browser from 2005 in the year 2010 and I create a web document in 2010 using a newer document type, the old browser will be able to use the new DTD to make more appropriate guesses at how the information in the document should be arranged.

Another reason why web sites are still using the older techniques is because the site was originally constructed when the code they are using was a part of the standard. The current standards were formalized in 1999. Web sites created before that time might conform to current standards since the current standards are really just a more strict interpretation of the old rules. But, since no one said "you can't use tables to lay out a multiple-column web page," back then, and it worked so well, most designers went ahead and used a series of nested "table" tags to create multiple columns in their layout. It occurred to a few of us back then that a "table" really doesn't seem appropriate since tables should be used to display sets of related information and not used to tell a browser where to position pictures and text.

To be fair, using tables for columnar layouts is not technically outside of the current standard. However, the tricks that most people use to get the table cells to line up and look good involve using attributes that are not a part of the current document type definition (HTML 4.01 Strict).

Finally, combining the ideas of ignorance and proliferation of cheap WYSIWYG editors, most teenagers with enough time to play with the software possess nearly marketable skills as web designers. A person with an intuitive sense of graphical design and psycho-visual concepts could produce amazing looking web sites using absolutely terrible markup (these designers rarely even look at the markup, so it's not at all their fault).

When the web was still a new concept, I think a lot of companies just told one of their employees to, "Get us one of those web sites." With very little knowledge themselves, the person in charge of arranging a new web site probably hired their computer-savvy nephews or daughters or other young relatives to build the web site. And, since it was an ancillary aspect of most businesses back then, a FrontPage site with mind-bogglingly bloated code soup that rendered reasonably well in Microsoft browsers was good enough (never mind the fact that some of those sites contain about 1000 times more markup than actual information).

So what are a few document types?

A document type defines the tag names and basic markup constructs that are allowed to occur in a document of that type. This could include semantics and syntax such as forcing XML-style tag attributes as opposed to allowing the looser HTML-style set of attribute indicators. The following is a list of some document types and the kind of code that you might see under that DTD.

Examples

I think the easiest way to demonstrate the reasons and practices of using modern markup is through some simple examples of things you'll probably find on many current web sites that still use the old markup. I'll list two samples of code followed by a side-by-side comparison of both samples.

The 1-pixel Border Box with a Title

I've seen this trick on hundreds of different sites and have even used the trick a few times myself (both the wrong and correct ways).

Old HTML Layout

<table width="100%" cellspacing="0" cellpadding="1"
    bgcolor="#000000" border="0">
    <tr>
        <td>
            <table width="100%" cellspacing="0" cellpadding="3"
                bgcolor="#FFFFFF" border="0">
                <tr bgcolor="#DDDDDD">
                    <td>
                        <font size="+2" color="#0000CC" face="Arial">
                            Box Title
                        </font>
                    </td>
                </tr>
                <tr>
                    <td>
                        <font color="#000000" face="Times New Roman">
                            This is some information.
                        </font>
                    </td>
                </tr>
            </table>
        </td>
    </tr>
</table>

XHTML 1.0/HTML 4.01*

<div class="fancy">
    <h2>Box Title</h2>
    <p>This is some information.</p>
</div>

Side-by-Side

  Old Markup Current Markup
Size (bytes) 796 89 (89% savings)
Layout Data Downloaded by browser each time it views the information. Downloaded once (for the site) then remembers how to layout the information as it changes from page to page.
Layout Format Visual layout is inextricably convolved with the information. Visual layout is completely separate from the information allowing us to easily change the layout without touching the information.
Presentation Only suitable for color, graphical, web browsers on computers with a monitor larger than 14 inches (may require a larger monitor if you have used pixel specifications of size instead of proportional specifications). Used with a simple style sheet, this is suitable for color, graphical browsers, old browsers (such as Mosaic and early versions of Netscape), black and white, text/console browsers (such as Lynx), full-page printers, tape/card printers, PDA browsers, cell phone browsers, audible screen readers (for the visually impared), and all modern web browsers on all platforms.
Indexing Search engine spiders are unable to "weigh" the importance of the box's title any heavier than the content. This means that your web site has "weaker" indexing for all words on the page instead of "stronger" indexing for only the important words on the page. Since the markup indicates what kind of information is being displayed, search engine spiders can easily index only the important words on your page which allows search engines to generate more suitable hits for your site.
Old Browsers Older browsers may render the information, but the layout could be lost and make the information confusing. Older browsers won't bother trying render the layout elements (since we're using a style sheet). But, since we're using document structure-oriented tags, what shows up in the older browser will, at least, make sense.
Future Browsers Future web browsers may drop support for some of the tags used in this document with unexpected rendering results (if it renders it at all). Future browsers will always support the basic document structure illustrated here.

* It's important to note that the information to duplicate the colors, backgrounds, and borders of the box are supplied by a separate document: a style sheet. If a person only views one page of your site, the savings in amount of downloading for the page is usually about the same. However, if a visitor views more pages, the layout information is cached in a local copy in the web browser which means that only the actual content is downloaded each time a visitor views a new page. On some web sites, the savings in amount of information downloaded increases multiplicatively as they continue to browse the site.

The most important statistic to most people involved in the process of browsing the site is the size of the document. Here is just a small list of things that improve with smaller web documents:

Certainly, this example is one of the more extreme examples, but it illustrates just one of the ways that a modern document is superior to the old HTML "hacks."

Graphical Enhancements

Almost all web sites that try to look fancy use graphics in one form or another to try to make the site more appealing to the eye. Unfortunately, this has lead to most sites ignoring all the other means in which their site can be "read." An extreme example that most people point out is that some visually impaired people use screen readers to read the information in a web site to them. They aren't seeing the site at all, but they may want to be able to "read" the information on the site. This is one of the reasons why using markup that is specific to fonts and visual display (like the "b" tag which indicates "bold," when you can use the "strong" tag which indicates that this phrase is important compared to the surrounding text) is being abandoned for more generically-termed tags.

When a lot of designers lay their sites out, they start by creating the graphics that will go around the edges, in the background, or around little sections of the document on each page. The thing that distinguishes designers that are using outdated techniques is if they include those graphics in their web site using HTML markup; the modern techniques involve little if any HTML makrup--that's the point.

If you have a graphic that goes along the top of every page on your site, the last thing you want to use is an "img" tag. The "img" tag should only be used to display images that are actual components of the information you are communicating with that particular page. For instance, if you're writing an article about rats, you should probably throw in a few pictures of various rats in the article. Those pictures are just as much a part of the information as the headings and paragraphs of the article.

However, if you host your article about rats on a biology web site, the graphics that are used to make your page look like the rest of the site have nothing to do with your article. They're not a part of the information in your document. This means that those graphics are used purely for layout enhancement. All elements that simply enhance the layout to make the web site more visually appealing or even those that make the page more organized and easier to read, should be included by other means than markup.

Actually implementing this separation of markup and layout is tricky for a lot of designers since web browsers are still a little inconsistent in their implementation of the standards (especially when it comes to any Microsoft product). To achieve this separation, it is still possible that you will need to alter the markup to achieve certain visual effects. However, there is a difference between adding a little document structure to accommodate the possibility of more layout capabilities and adding markup to your document that has nothing to do with the information that is being presented on the page. I explore this area of confusion in depth in another article.

I can tell you this, though, almost all layout graphics can be simply added to a web page's layout using CSS's "background" rule. Again, this means that every time you want to change how your site looks, you do so in a style sheet and not in the markup.

In practice, a web site's "template" is usually implemented through some sort of templating software or mechanism on the web server. This forms a software-based solution that keeps browsers ignorant about how your original document actually looks and "wraps" it in a nice, pretty layout that adds more (hopefully, structural) markup and manages style sheets. All of my web site solutions involve a template solution in one form or another. Many templating systems also allow more advanced features for your document such as automatically generating links to specific areas of the document, adding links to external resources based on phrases in your document, highlighting words in your document based on a search query used to find the document, and many other features.

Conclusion

This article was written to help those that are wondering some of the reasons why the newer markup standards are better. Basically, if you want your information to last a long time and be effective to browsers and software that we haven't yet thought of, it's important that you update how you write or generate markup. Not only does this make your information more useful and forward-compatible, but it also makes it more accessible and easier to serve and browse. Could you imagine how much faster the Internet would be if all the web sites in the world all of sudden dropped 80% of their documents while still allowing visitors to read the same wealth of information?

If you would like more information about the standards for web documents, the World Wide Web Consortium is the definitive resource. Even if the articles on their site are a little over your head, they provide links to other sites with more pedestrian explanations of everything you need to know to bring your level of design standards into the 21st century and possibly beyond.

Last Modified: April 26, 2005