Running a French Holiday Gite in Rural Brittany

Wednesday, March 19, 2008

Fixing HTML errors - getting our Gite website to be W3C valid HTML

A couple of weeks ago I wrote about getting FluffySearch search engine to work on my website, and concluded with the realisation that I'd been missing off quotes around filenames in <a> and <img src> tags and hence caused the HTML to be invalid.

As I set about correcting what I thought were simple and repeated HTML errors across the website I found that in fact I had a whole host of underlying HTML problems that should be fixed. None of these are actually visible when you look at the site with Firefox, Internet Explorer, etc as web browsers are generally fairly tolerant of invalid code, but nevertheless I did still want to get them fixed.

At the top of each HTML page is a document type definition that defines what version of HTML (or XHTML) the site is written in, it looks like this:

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">

This says that I'm using HTML 4.01 STRICT for this webpage. Of the different versions of HTML, 4.01 is the most recent, and STRICT is more rigourous as to what HTML is allowed unlike TRANSITIONAL which is more of a half-way house between HTML 4 and the previous major version, 3.2.

When I set about redesigning the website back in 2006 I found that I ended up with some display and alignment errors in particular browsers (especially Internet Explorer) if I used transitional HTML as IE then started reverting to "quirks" mode instead of "standards" mode. I therefore decided at the time to adopt strict HTML to stop this happening, and then HTML 4.01 as it was the most recent version of HTML at the time.

In hindsight of course some of this logic is probably fatally flawed and irrelevant but not wanting to go around the fun and games I had with the site layout again it's easier to push on and just get back to valid HTML.

Almost all the errors are due I'm sure to moving from HTML 3.2 to HTML 4 STRICT and of course much of the pain I've now just had would have been avoided if I hadn't been lax for the last year or so and actually had ran the website through the W3C validator, .... but as they say, we are where we are.

The full definition of HTML 4 strict is on the W3C website, but it's fairly incomprehensible (IMHO) so to help others out that may be embarking on adopting HTML 4 strict, here's the various problems and changes I found and fixed. Also useful is this summary of HTML 3.2 constructs not in HTML 4

  • Missing quotes around image source references
    <IMG src=/theme/mast_mos.jpg> produces an error message "an attribute value must be a literal unless it contains only name characters".
    The problem is that as the pathname to the image contains characters other than letters, numbers or full stops, it must be enclosed by quotes and re-written as <IMG src="/theme/mast_mos.jpg">

  • Missing quotes around anchor link and form link references
    <a href=http://blahblah/blah > produces an obscure error message "NET-enabling start-tag requires SHORTTAG YES", and then "end tag for element "A" which is not open" from the closing </a> tag.

    Like the above problem with <img>'s, the URL reference must be enclosed within quotes when it contains anything other than letters, numbers or full stops (e.g. colon or slash) and so the anchor tag needs to be changed to <a href="http://blahblah/blah" >.

    Similarly <form Method=POST action=http://whatever/whatever> needs to be re-written as <form Method=POST action="http://whatever/whatever">

  • Ampersands within URL links
    Some of the URL links to external websites contain an ampersand character which isn't valid HTML and from a single link such as <a href=http://blahblah?id=rss&ut=http://whatever> you get a slew of error messages as the text following the ampersand tries to be interpreted as a HTML variable:
      # cannot generate system identifier for general entity "ut".
      # general entity "ut" not defined and no default entity.
      # reference to entity "ut" for which no system identifier could be generated.
      # entity was defined here.

    The problem's described in WDG's Ampersand's in URL's and is easily fixed by changing the & to &amp; like this:
    <a href="http://blahblah?id=rss&amp;ut=http://whatever">

  • Using the 'target=_blank' tag to open links in a new window
    In HTML 4.01 strict, XHTML 1.0 Strict and XHTML 1.1 the use of the target tag has been removed as target is designed to be used with frame-based DTD's.

    The upshot of this is that there's no easily obvious way with any of these DTDs to enable a link to open in a new window. I use this functionality quite a bit as there are a number of links to external websites (e.g. Brittany Tourist information) and I wanted these to open in a new window so that the visitor would still be easily able to return to my own Gite website.

    There's some really cludgy ways round this problem suggested on the web including using Javascript to dynamically add a target attribute to the <a> tag at page load time which just means that whilst the page validates OK using the W3C validator all that's really happening is that you're causing the page to become invalid when it's rendered in the browser ... this is just hiding, not fixing the problem.

    I then found another article that suggests creating your own custom DTD by taking the base HTML 4.01 strict DTD (in my case) and then merging in the target attribute DTD from the frames DTD - this sounded so complicated I really didn't want to go down that route. I was trying to find a way of making standard valid HTML rather than creating some new bastardised version of HTML to get around the problems with the standards!

    The final, and really elegant solution, which I decided to adopt in the end was buried in comment #19 made against Jesse Skinner's Javascript to change <a> tags at page load time which was to replace
    <a href="http://www.wherever.com/somepage.html target=_blank">

    with an embedded onlick bit of Javascript:
    <a href="http://www.wherever.com/somepage.html onclick="target='_blank';">

    And this is not only valid HTML, it's valid Javascript, and it works a treat in all the browsers I tested it in - job done!

    Only downside is that if Javascript's not enabled in the visitors browser then onclick is ignored and when the link's clicked it opens the new (external) page in the same browser window rather than a new window. This for me is a minor issue and one I can definitely live with.

  • Width and Height declarations to table cells
    <TD valign=bottom width=270> is no longer valid and as explained on Joe2Torial's explanation of HTML tables,
    "This code would not validate in XHTML or HTML 4.01 Strict because the width and height attributes are removed in favour of using stylesheets (CSS)".

    Fix is to change the TD tag to become <TD valign=bottom style="width: 270px;"> or (better) define a CSS class and apply it wherever needed.

  • Bullet style definition for unordered lists
    <UL><LI class=p1 type=circle>blah blah</LI></UL> produces the error message 'there is no attribute "TYPE"' as the 'type' and 'value' attributes have been deprecated in HTML 4.01 in favour of styling the list using a CSS style sheet.

    I therefore introduced a new class into my CSS style sheet:
    .circl {
    list-style-type:circle;
    }

    And then applied this class to the unordered list:
    <UL class=circl><LI class=p1>blah blah</LI></UL>

    (The class=p1 entry against each list item is to add 4 pixels of bottom padding to each entry, in effect to create line-and-a-half spacing of the list which aids readability).

  • Forms and Blockquote must contain an outer block element
    On my website I had forms (for making booking enquiries and for subscribing by email to new articles on this blog) defined like this:
    <FORM Method=POST action=http://blahblah>&nbsp;<INPUT name=EMAIL maxlength=255 ... > ... etc

    and similarly Blockquote's (for indented paragraphs such as our address) defined as:
    <BLOCKQUOTE>2 The Slade<BR>Wrestlingworth<BR> ... etc

    These produce a slew of error messages such as 'character data is not allowed here' and 'document type does not allow element "INPUT" here; missing one of "P", "H1", "H2", "H3", "H4", "H5", "H6", "PRE", "DIV", "ADDRESS" start-tag'.

    The problem is that the syntax of BODY, BLOCKQUOTE and FORM have changed in HTML 4 to require a surrounding block element such as <DIV> or <P>

    So the errant input form example above now becomes:

    <FORM Method=POST action="http://blahblah"><DIV>&nbsp;<INPUT name=EMAIL maxlength=255 ... > ... etc

  • Using CSS to style horizontal rulers
    The styling attributes for <HR> tags (WIDTH, SIZE, ALIGN and NOSHADE) have all been deprecated in HTML 4 in favour of using CSS to set the size and style of the horizontal ruler.

    So instead of <hr size=1 align=left width=400> I had to change the HTML to <hr style="height:1px; text-align:left; width:400px;">.

    There's a bunch of other similar deprecations done to other HTML tags in favour of using CSS instead and I found a good summary of CSS alternatives to HTML4 deprecated tags over on Cavalcade of coding.

And so by the end of all this if you've managed to keep reading this far (thanks!) you'll be wondering if my quest is complete and everything's now sorted out across the website with perfect HTML 4.01.

In a word, NO.

All these fixes and changes have been implemented on all the different pages of the website and every one of the pages now validate as being HTML 4.01 strict, apart from just two.

Firstly the homepage has a problem with the recent HTML fix I introduced to center-align the GitesDeFrance image in Firefox which I haven't been able find a way to recode around, and secondly the flash-based PictoBrowser that shows an image gallery of Flickr photos gets rejected as the <EMBED> tag isn't valid HTML.

However I think I'll rest on my laurels for just a day or two (well more likely a few weeks!) before attempting again to crack these two obstinate HTML problems ...

Update July 08: Rewrote the Pictobrowser HTML to make it W3C standards compliant - almost got the whole site sorted now!

Labels:

1 Comments:

  • Rewrote the Pictobrowser HTML to! Got it XHTML valid.
    Could you explain the typeface and styling of your display? check http://peterthijs.nl/tour sourcecode.

    By Anonymous peter thijs, at May 02, 2009  

Post a Comment



<< Home