Yet More HTML

Basic Terms

HTML `elements' are the individual components that a web page is built up from, such as headings or list items.
`Tags' are the individual bits of HTML embedded in the document. A tag is enclosed in "angle brackets": < and >. Most elements, such as a heading, need two tags: an opening tag and a closing one.
<H1>Heading</H1>
In HTML some elements, such as HR, only need one tag.
`Attributes' allow you to specify or fine-tune the behaviour of each instance of an element, by including supplementary information. For example, you have seen how the IMG element should have four attributes specified: SRC, WIDTH, HEIGHT and ALT. Other attributes are only used for occasional effect, e.g.:
<HR> <HR WIDTH="90%"> <HR WIDTH="80%" SIZE="20" ALIGN=right>
Numerical values for attributes should be enclosed in quotes.

HTML is not case-sensitive, which means that it shouldn't matter whether you type capital or small letters in the source code. A common convention is to use capital letters for HTML elements and lower case for attribute values.

XML and XHTML require all elements to be in lower case. Although in principle it would be better to deploy web pages that meet the XHTML specification, it is much harder to find and correct small errors when using simple tools like pico. There are also many other requirements that the code has to meet to be correct; for example all elements must have a closing tag; not just p and li, but even tags like <HR> must be rewritten <hr></hr> (or <hr /> for short). Managing these is best left to web authoring tools, such as Dreamweaver which you will be using next semester.

The `META` Tag

So far we have concentrated on internal mark-up, that is, HTML tags to allow a browser to correctly interpret and display the contents of the current document to a human user. The META tag is designed as an extensible way to allow HTML documents to communicate with a variety of outside agents, such as search engines or web caches. The META tag should appear in the HEAD of your page.

Search engines

There are a number of ways that we can influence how a web page is indexed by search engines, and thus how likely it is to be seen by interested surfers. The first is by robot control (sadly unrelated to Robot Wars!). Many search engines automatically surf the web, following links between sites and classifying the pages they find. The software agent that browses the Web and thus generates the search engine's database is known as a `robot', `crawler' or `spider' (e.g. Altavista's robot is named "Scooter"). Given that search engine databases can be retained for many years, if you have a set of pages that is regularly changed or moved you may want to avoid having it indexed, to prevent people finding your site in a search engine from being sent to an irrelevant or even non-existent page. This is done with the META ROBOTS tag:

<META NAME="ROBOTS" CONTENT="NOINDEX,FOLLOW">

Robots may traverse this page but not index it. Different search engines will support different variations, or may even ignore the tag completely. In the case of AltaVista the following options are supported:

NOINDEX prevents anything on the page from being indexed.
NOFOLLOW prevents the spider from following the links on the page and indexing the linked pages.
NOIMAGEINDEX prevents the images on the page from being indexed but the text on the page can still be indexed.
NOIMAGECLICK prevents the use of links directly to the images, instead there will only be a link to the page.

Even if the site has been correctly indexed, any simple search is likely to return at least ten or hundreds of hits on pretty well any topic. Two ways to improve the chances of our site actually being chosen by the user are to provide keywords and a short description:

<TITLE>The Doughnut Shack On-Line</TITLE> <META NAME="description" CONTENT="Selling doughnuts over the Web!"> <META NAME="keywords" CONTENT="doughnuts, custard doughnuts, lard">

The description can be displayed by the search engine as a single line along with the URL and TITLE of your page, to explain to the surfer why they should visit. If no description is included the search engine will include the first line of "content" on the page, which is often cryptic scripting or unhelpful navigational information.

To ensure that the viewer sees the link to your page in the first place, you can use the keywords option: search engines should rank pages whose keywords match the search terms more highly than those whose don't. Embedded keywords are another example of the phenomenon by which concepts of limited but significant utility are promptly misused (either deliberately or through ignorance) and thus rendered nearly useless by the resulting mis-trust: a lot of pornography sites attempted to "hijack" surfers by registering pages with pretty well a small dictionary as keywords, to make their (irrelevant) site score highly in a search on almost any topic. As a result many search engines no longer use keywords, or will only accept pages with a very small number (3-5) - you should check on this when submitting your website to a particular search engine. There are also variations in how the keywords are interpreted: the example above should be read as three keywords, the middle one being "custard doughnuts". The singular and plural must normally be given separately.

HTTP Equivalents

A number of useful features are normally controlled by the server using the HTTP protocol by which the HTML files are downloaded. META tags with the HTTP-EQUIV attribute are equivalent to these HTTP headers and may be used to refine the information provided by the actual headers. Such tags would have an equivalent effect if specified as an HTTP header, and in some servers they may actually be translated to actual HTTP headers automatically or by a pre-processing tool.

Typically, they control the action of browsers or caches. Some examples include:

Refresh and Redirection

We can have the browser wait for the specified period of time (in seconds, here 3) and then automatically load in either the same page again (for rapidly changing information) or move to a completely different page:

<META HTTP-EQUIV="Refresh" CONTENT="3;URL=http://www.site.com/page.html">

Expiry

You can specify how long the information in the current page is to be valid for. This allows the document to be flushed from caches automatically so that the viewer will always see valid data:

<META HTTP-EQUIV="expires" CONTENT="Wed, 26 Feb 2004 08:21:57 GMT">

Avoid Caching

For dynamic content you may just prefer to page to not be cached at all:

<META HTTP-EQUIV="Cache-Control" CONTENT="no-cache">

Cookies

At least in Netscape Navigator it is possible to set up a cookie from within an HTML document:

<META HTTP-EQUIV="Set-Cookie" CONTENT="cookievalue=hobnob;expires=Friday, 31-Dec-05 23:59:59 GMT; path=/">

Cookies created with an expiry date are considered "permanent" and will be saved to disk on exit.

Defining the character set in use

Browsers in different countries may interpret a particular character as different symbols or accented letters. To make sure that the page is displayed correctly you should state which character encoding you are using. Currently the most common is "UTF-8":

<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=utf-8">

Note that this tag describes the character set used in the source code, so you can still insert other characters by using the appropriate character entity code, e.g. É for É. In the future there is likely to be increasing direct use of the Unicode character set, especially with non-Roman alphabets, but backwards compatibility with older browsers remains a problem.

The META tag also provides many other possibilities, e.g.

Prevent Microsoft XP from adding "Smart Tags" to your page when displayed:

<META NAME="MSSmartTagsPreventParsing" CONTENT="TRUE">

Image Maps

One very convenient aid to navigation is the image map, which allows the user to select a link by clicking in the appropriate location within an image. In the original implementation, known as `server-side' image maps, the browser sent the location on which the mouse was clicked back to the web server which then decided which document should be returned. Newer browsers and versions of HTML implement an alternate form, `client-side' image maps, that avoid the extra net traffic and the need to run scripts on the server; we will only discuss client-side maps.

The underlying principles of both types are the same: we need to associate areas of the image with each desired URL. We do this by defining zones of simple shapes (circles and rectangles) within the image, each pointing to a single URL. Since more than one zone can point to a given URL, we can make up more complex shapes by combining simple ones.

We embed a client-side image map the same way as any other image, but with a new attribute:

<IMG SRC="image.gif" WIDTH="250" HEIGHT="150" ALT="Sample Map" USEMAP="#mapname">

USEMAP indicates a URL that includes the named map data - the layout of the zones across the image. This data is usually embedded in the current document, so we only need the bookmark name as above.

The map data is given by a - surprise, surprise - MAP tag, which defines the map's name and a series of areas in the image and the URLs that they point to:

<MAP NAME="mapname"> <AREA SHAPE="CIRCLE" COORDS="100,100,50" HREF="first.html"> ... </MAP>

The AREA tags define a series of regions of the image in turn and assign a URL to each one. The SHAPE attribute determines the shape of the selected block, and the COORDS attribute gives the relevant co-ordinates needed to define the sized and position of the block within the image. SHAPE can take the following values:

RECT or RECTANGLE - Rectangular region - COORDS="left, top, right, bottom"
CIRC or CIRCLE - Circular region - COORDS="centre_x, centre_y, radius"
POLY or POLYGON - Polygon region - COORDS="x₁, y₁, x₂, y₂ ,..., x_n, y_n"

RECT and CIRC should be self-explanatory. POLY lets you define an arbitrary polygon by giving the co-ordinates of each vertex in turn. The shape should be closed, so if x_n,y_n is not the same as x₁,y₁ then the browser will connect them itself. Browsers can be finicky about the shape names, so stick to upper case. Netscape seems to prefer "CIRCLE" to be spelled out in full.

The newest browsers/HTML also include a new shape: DEFAULT. This is used the same way as RECT, but the indicated URL is also used for any areas of the image left undefined. The defined areas may overlap - the areas are checked to see if they enclose the actual mouse click in the order that they appear in the MAP tag. Co-ordinates are expressed in pixels, starting in the top left corner of the image.

As an example, suppose we want to turn the following image into a map:
Denizens of the Deep
Clicking on the turtle will lead to turtle.html, while the fish connects to fish.html.

First we need to locate areas we want to activate: by loading the image into either Photoshop or Microsoft Photo Editor we can get a display of the position in pixels of the cursor over the image. For now we'll just represent each critter by a single shape. In this case the turtle's shell is covered by a rectangle about with top left corner at 116,16 and bottom right corner at 215,87. The fish is covered by a circle 15 pixels radius and is down at 25,130. We can then enter these co-ordinates into the map:

<MAP NAME="sealife"> <AREA SHAPE="RECT" COORDS="116,16,215,87" HREF="turtle.html"> <AREA SHAPE="CIRCLE" COORDS="25,130,15" HREF="fish.html"> <AREA SHAPE="RECT" COORDS="0,0,250,150" HREF="blank.html"> </MAP>

Note that I've included a simple rectangle covering the entire image to give a default destination. Although this overlaps the circles, as they are checked first any clicks on the hotspots will be interpreted correctly (support for DEFAULT is still patchy). All that's left is to include the display of the map itself:

<IMG SRC="image.gif" WIDTH="250" HEIGHT="150" ALT="Sample Map" USEMAP="#sealife">

In practice the MAP tag can be placed anywhere in the HTML source you find convenient. It doesn't need to appear before the image that calls it, so it's often put right at the end, out of the way. You can include as many MAPs in a single document as you like, as long as they have different names. One MAP definition can be used by many separate image maps.

Two last refinements: first, we may occasionally want to create an region of the image that does nothing, not even the default action. We can define such a "cold-spot" by using the NOHREF attribute in the AREA tag instead of the usual HREF. Secondly, so far we have only provided a simple text alternative to the map as a whole - there is no way to follow the links without a graphics-enabled browser. Again, recent extensions to the HTML specification include adding the ALT attribute to the AREA tag. This allows the browser to present the list of URLs associated with an image map in some other way.

Naming HTML Entities

It is useful to get into the habit of always giving names to the various components of your web pages, such as frames, maps and forms. Unfortunately browsers can be inconsistent about issues such as case-sensitivity, so you should stick to the following guidelines:

When referring to an existing name, match upper and lower case exactly
When creating a new name, don't use capitalisation to distinguish objects
Only use a name once in a particular document - don't try to have both an image map and a form called "main", for example
Avoid spaces or symbols in names - use capitalisation to denote words instead

Validating your HTML

As you are probably aware, there is no such language as simply "HTML" - the language changes as new tags and features are introduced while old ones are removed. With a variety of versions of different browsers all trying to interpret various forms of HTML, it's hardly surprising that the same page can end up being interpreted differently by different browsers. To avoid the worst of these problems, the World-Wide Web Consortium (W3C) have released a series of standard definitions of HTML (we're now on version 4.01) each of which introduces new tags and features while discarding others. By making sure that the HTML you produce for the Web exactly fits a particular revision of the standard, you make it easier for the browser to interpret your code as you actually intended.

There are a number of public HTML validators available that will read through your code and ensure that it matches a particular standard. They have an important secondary purpose, in that they will also report any errors, such as unclosed tags. For locating such errors the "View Source" feature of Netscape Navigator is also useful, as it colour-codes the HTML making it easier to locate problems such as mis-spelt tags.

You can also find services that will analyse the accessibility of your site and the performance of the server.

!DOCTYPE

So if the version number of the HTML used now has to be available for the validator, you might expect that we have to introduce a new attribute to the HTML tag. The bad news is that no, this is instead done using the, new, !DOCTYPE tag; and the worse news is that it's a monster: the !DOCTYPE for this document is something like
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">

The good news is that you don't really need to understand the details - just use the tag shown as the very first line of your HTML code for normal web pages (you need a different one for frameset pages:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Frameset//EN" "http://www.w3.org/TR/REC-html40/frameset.dtd">).

In fact, the tag is actually pretty straightforward: it explains that the current document is HTML defined by a publicly available standard; this standard is drawn up by the W3C and is known as HTML version "4.01 Transitional", and finally the tag gives the URL of a DTD or `Document Type Description' which defines the possible grammar and syntax of this HTML version in gruesome detail.

References

IndexDOT Html at http://www.blooberry.com/indexdot/ is a convenient reference resource to all the HTML tags.
http://www.webspawner.com/cc/html/alpha.html is another HTML Reference Guide.
A more complete list of the META tag can be found at http://vancouver-webpages.com/META/.
You can find out more about controlling search-engine robots at http://http://www.robotstxt.org.
A web-based HTML validator operated by the World-Wide Web Consortium (W3C) is available at http://validator.w3.org.
There are also commercial services such as http://www.htmlvalidator.com available.
You can examine the accessibility of your site with tools such as those at http://webxact.watchfire.com or http://www.cynthiasays.com.
You can admire a graphical representation of your webpage at http://www.aharef.info/static/htmlgraph/.
S. Spainhour and R. Eckstein: "Webmaster in a Nutshell" 3^rd Edition, O'Reilly ISBN: 0 596 00357 9 (2002)
A reference text covering HTML, CSS, JavaScript and server-side issues.

J.J. Nebrensky 8/08/2006

Back