htmlchek version 3.01, December 11 1994

Name:

htmlchek.awk, htmlchek.pl - Syntactically checks HTML 2.0 or 3.0 files for a number of possible errors; includes a rudimentary cross-reference checking capability. Runs under awk or perl.

Typical Command Lines:

awk -f htmlchek.awk [options] infile.html > outfile.check
perl htmlchek.pl [options] infile.html > outfile.check

The options are in the form "option=value" (see the sections ``Options'' and ``Language Customization Options'' below). The following is an alternative invocation of htmlchek.awk under Unix (to ensure, as far as possible, that the program is not run under incompatible ``old awk''):

sh htmlchek.sh [options] infile.html > outfile.check

(If the files htmlchek.awk, htmlchek.pl, or htmlchek.sh are not in the current directory, the pathname to where they are located will have to be prefixed -- but see ``shell scripts'' below)

Description:

This program checks for quite a number of possible defects in the HTML (Hyper-Text Mark-up Language) version 2.0 SGML files used on the World-Wide Web. (Files with Netscape extensions, or with features from the preliminary Arena/HTML 3.0 document, can also be checked by specifying the appropriate options, as explained below.) Diagnostic messages are output to STDOUT and so generally appear on the terminal, unless they are redirected to an output file, as is done in the examples given above (of course, this all depends on the operating system -- the Macintosh doesn't even have a "command line" as such, but you can set up "droplets" with MacPerl).

Definite syntactic errors are signaled one per line, using the string "ERROR!". Stylistically deprecated HTML coding is signaled by "Warning!" (so that lines of output which signal errors and warnings all contain the character `!'). However, I don't bother checking for a few common tricks such as using headings in lists. Most error and warning messages should be fairly self-evident, assuming a familiarity with the basic HTML language documentation; the following is a basic glossary of terms used (note that tag "options" are what are called "attributes" in SGML):

An "element" is <X>...</X> (for example, <A HREF="#page2">Page 2</A>).
A "tag name" is <X...> (for example, "A" in <A HREF="#page2">).
An "option" is <...Y="..."> (for example, "HREF" in <A HREF="#page2">).
An "option value" is <...="Z"> (for example, "#page2" in <A HREF="#page2">).

One warning that may be obscure, "Jump from header level H0", means that the first heading in the file is not at level <H1>. One error that can sometimes be counter-intuitive is a ``<LI> outside list'' or ``<DT>/<DD> outside <DL>...</DL>'' error: in the sequence <UL><B><LI></B></UL>, the <LI> is actually not in the list, since it is not immediately contained within the <UL>...</UL> element (but is rather immediately contained within the non-list <B>...</B> element).

A very limited form of cross-reference checking (making sure that file-local <...HREF="#..."> references actually exist) is automatically performed within each file; for larger-scale cross-reference checking see the appropriate section below.

If you process more than one file at a time (by specifying multiple files or wildcards on the command line, e.g. ``perl htmlchek.pl *.html'' or ``awk -f htmlchek.awk *.html''), then errors are located by filename and line number. At the end of each file's output, diagnostics are generated as to the tags used in the file and the options used with each tag, along with possible additional global warnings (these final diagnostics/warnings can be longer than 80 columns). This program doesn't check that options such as ALIGN actually have values taken from the approved set (since no part of the evolving HTML standards are more fluid); but as an aid to typo-detection, the option=value pairs in which the value is unquoted (such as ALIGN=BOTTOM) are output as part of each file's tag diagnostics; this allows you to pick out incorrect pairs like ALIGN=BOTOMM. (In order to get a single set of tag diagnostics for multiple files, you can do something like "cat *.html | awk -f htmlchek.awk" on Unix -- however, this allows the errors in one file to affect the interpretation of following files.)

Operating System Dependency (shell scripts):

The files included in this package whose names end with ".sh" (htmlchek.sh, htmlchkp.sh, runachek.sh, and runpchek.sh) are shell scripts for greater ease of use in running htmlchek.awk and htmlchek.pl under Unix or Posix 1003.2. However, nothing in the checking programs themselves depends on the Unix operating system (with one minor exception -- see ``Limitations'' below), so that htmlchek.awk and htmlchek.pl can be run on any system where an awk or perl interpreter is available.

These Unix shell scripts are typically invoked by means of a command line of the form ``sh scriptname [script-options]'', with possible additional shell parameters such as output redirection (with the `>' character) or background execution (with the `&' character) -- e.g. "sh htmlchek.sh infile.html > outfile"). If you have set execute permission on the script files (e.g. by means of "chmod +x *.sh"), and the directory where they reside is specified in the PATH environment variable, then you can omit "sh" from the beginning of the command line.

For all the shell scripts, if htmlchek.awk or htmlchek.pl is not in the current directory when the shell script is run, the environment variable HTMLCHEK should be set to the name of the directory (terminating in `/') where the program is located. See your shell documentation for information about how to set environment variables ("setenv HTMLCHEK /somedir/" in csh and tcsh, "HTMLCHEK=/somedir/; export HTMLCHEK" in sh and its offspring).

These shell scripts return 1 on exit if some detectable error occurred, and output an appropriate errormessage to STDOUT; otherwise they return 0. (However, the scripts can't always detect when an error occurred in executing awk or perl.)

Options:

Options are in the form "option=value" (where the `=' should not have spaces on either side of it); options should be specified on the command line PRECEDING any names of HTML files to check (see ``Typical Command Lines'' above). Options which follow filenames will not take effect (they are silently ignored in awk, and generate an error in perl only after the preceding files on the command line are error-checked). Also, misspelled options will be silently ignored. (On Unix, the shell scripts htmlchek.sh and htmlchkp.sh automatically check for command-line option errors, so you don't have to worry about these problems.)

Options that affect the definition of the HTML language used to interpret and check files are discussed in the ``Language Customization Options'' section below; the other options are "nowswarn=", "sugar=", "refsfile=", "append=", "dirprefix=", and "usebase=".

Output Options

The two option "nowswarn=" and "sugar=" only affect the cosmetics of the output of htmlchek:

nowswarn=1
If this option is specified, it turns off messages that warn you about inappropriate whitespace (which may confuse browsers) in low-level mark-up elements. These warning messages can be numerous enough to make it difficult to pick out other warning and error messages.
sugar=1
If this option is specified, then "filename: linenumber:" is prefixed to non-file-final error and warning messages (for compatibility with editors which use diagnostic output which is formatted in this way, from Unix tools such as ``cc'' and ``lint'').

Cross-reference Checking Options

These options are connected with details of multi-file cross-reference checking; if you intend to do such cross-reference checking using the run?chek.sh shell scripts under Unix, you can ignore these options and jump to the next section

refsfile=``prefixname''
If this option is specified, then, the references contained in the HTML files being checked are output to a file named ``prefixname.HREF'', the references to in-line images contained in the HTML files are output to a file ``prefixname.SRC'', and the destination locations specified in the HTML files are output to a file ``prefixname.NAME''. This is the first step in cross-reference checking (see the next section below). Notice that <...HREF="..."> references to non-inline images will be found in the .HREF file, not the .SRC file.
append=1
If this option is set, then if the three files specified by the refsfile= option already exist (from a previous run), they will be appended to, rather than being replaced. This is useful for cross-reference checking of files which are not in a single sub-directory tree on a single machine (see below). A blank line is added to each file at the beginning of each run, so that the output due to successive runs can be separated (but this is not preserved when running under the run?chek.sh scripts).
dirprefix=``pathname''
When "refsfile=" is also specified, then the value of ``pathname'' (which should terminate with the URL-format directory separator, `/') is prefixed to destination locations and relative URL's. This can be useful in cross-reference checking, particularly when checking files not all in the same directory.
usebase=1
When "usebase=1" is specified, the URL specified in <BASE HREF="..."> in each file is assumed to be the name of the file (and the "dirprefix=", if any, is ignored in the processing of the file after the <BASE> is found). This only takes affect after the <BASE> tag is encountered in each file, so that <BASE HREF="..."> should be the first of the tags with NAME, HREF, ID, etc. options in each file.

Cross-Reference Checking:

Here ``cross-reference checking'' does not mean generating a detailed table of what references what, nor does it mean traversing the Web and finding out whether off-site remote URL's actually exist. It only means gathering together all the locations and references in a local HTML file, or a collection of local HTML files, and finding all the locations which are unreferenced within these files, and all the references which are not to a location within this collection. You should generally delay cross-reference checking until you have more or less debugged your HTML files and corrected syntactically malformed references.

The programs htmlchek.awk and htmlchek.pl do not implement full cross-reference checking internally -- rather they only output raw lists of all the references and destination locations found in the HTML files (if the refsfile= option is specified). These raw lists can then be processed externally for the purposes of cross-reference checking. The shell scripts runachek.sh and runpchek.sh, included in the htmlchek package, implement cross-reference checking for Unix, looking at all the *.html files in a directory hierarchy, but there is no reason why cross reference checking could not also be implemented on a non-Unix operating system (if substitutes can be found for the Unix utilities which sort files, compare files, remove duplicate lines, and find all .html files in a directory hierarchy).

The scripts runachek.sh (for cross-reference checking with htmlchek.awk) and runpchek.sh (for cross-reference checking with htmlchek.pl) should be run with the current directory set to the directory which is at the root of the tree of HTML files to be examined (for example, "$HOME/public_html"). These scripts have the following syntax (where ``directoryprefix'' and ``outfileprefix'' stand for the first two command line options, the presence of which is obligatory):

sh runachek.sh directoryprefix outfileprefix [options]
sh runpchek.sh directoryprefix outfileprefix [options]

The first parameter ``directoryprefix'' should either be the null string (''), or should end with a slash `/' (the URL-format directory separator). What ``directoryprefix'' should be specified as depends on how the files you are checking cross-refer to each other with <... HREF="..."> links. In the situation in which the files refer to each other strictly with simple relative URL's (i.e. which do not begin with "//" or "/" -- ignoring the optional access-method prefix), such as "subdir/otherfile.html#section1", then you don't need the ``directoryprefix'' mechanism, and you can get away with specifying the first parameter of run?chek.sh as the null string (and skip the rest of this paragraph). However, if you have non-relative URL's in your cross references, you need to specify a ``directoryprefix'' (note that if there are files in more than one directory, and files in subordinate directories refer to files further up the hierarchy, then you may want to use non-relative URL's, since while "../" is legitimate in a URL, relative URL's beginning with "../" can sometimes cause problems). The value to use for ``directoryprefix'' should be the string used, in cross-references among your files with non-relative URL's, to refer to the root of the tree of files being checked (i.e. the current directory when run?chek.sh is being run). Whichever type of non-relative URL your documents use for this purpose (whether a host-local reference like "/~myself/subdir/", a full reference like "//myhost.edu/~myself/subdir/", or a reference with access method like "http://myhost.edu/~myself/subdir/"), you should use that string as your ``directoryprefix''; if your files use an inconsistent mixture of these different reference types, then no single ``directoryprefix'' can work, and cross-reference checking will partially fail. Finally, if each file has its own name specified in a <BASE HREF="..."> reference, you can let ``directoryprefix'' be the null string, and use the option usebase=1.

The second parameter ``outfileprefix'' is the name of the files (with the extensions ".ERR", ".NAME", ".HREF", and ".SRC") in which the output of the HTML-checking and cross-referencing process will be put.

After these two obligatory parameters, optional parameters follow; these can be any of the "option=value" pairs discussed in the ``Options'' section above or the ``Language Customization Options'' section below (except for refsfile= and dirprefix=, which are specified within the run?chek.sh scripts).

So the following are some typical command lines (remember that putting the name of a shell script first in a command, as in the first example, implies that you have set execute permission by running chmod):

runpchek.sh http://uts.cc.utexas.edu/~churchh/ check html3=1
cd $HOME/public_html ; sh runachek.sh '' out > out &

The second example shows how cross-reference checking may be run as a background process.

If no error has occurred, then when the shell script has finished, non-cross-referencing errorcheck data is in the file ``outfileprefix.ERR''; locations in the specified files which were not referenced from the files are in ``outfileprefix.NAME'', references from the specified files which were not found in the files are in ``outfileprefix.HREF'', and references to in-line images are in ``outfileprefix.SRC''. (These last three files are the result of special processing of the original raw .NAME, .HREF, and .SRC files generated by running htmlchek with the refsfile= option in the shell script, and have overwritten the original files).

If ``outfileprefix.HREF'' and ``outfileprefix.NAME'' have file lengths greater than zero, this does not necessarily signal an error: ``outfileprefix.HREF'' will contain not only `dangling' references to local HTML files, but also references to non-inline images, and external references (to files not in the directory tree being checked, including files on other WWW sites) as well. Similarly, the file ``outfileprefix.NAME'' contains locations which are not referenced locally, but these locations might be referenced from outside the local directory tree.

It would be nice to check for the existence of local images listed in ``outfileprefix.SRC'' (and also the local non-inline images in ``outfileprefix.HREF''), but in general the references to these images are in URL format there (rather than in local filesystem format), so that there is no way to do this at the Unix shell level. However, the references in ``outfileprefix.SRC'' are sorted, and duplicate entries collapsed.

If you have several directory trees of HTML files which cross-refer, and each hierarchy needs a different ``directoryprefix'', you can still do cross-reference checking if you specify the same output files and use "append=1" as one of the options. (You can even do cross-reference checking across multiple machines, if you have an account on each machine, and transfer the cumulative .NAME, .HREF, and .SRC files to each machine before running local cross-reference checking on that machine -- of course, the ``directoryprefix'' string on each machine will have to include a hostname for this to work.)

Language Customization Options:

By default, htmlchek checks HTML files more or less according to version 1.21 of the HTML 2.0 standard (don't you just love recursive version numbers?). However, specifying the following options changes the language definition used in checking HTML files:

arena=1 or html3=1 or htmlplus=1
Specifying any of these options means that files are checked according to a preliminary (November 1994) version of the emerging HTML 3.0 specification, and not as HTML 2.0.
netscape=1
Specifying this option means that the Netscape extensions do not generate errormessages.

Since the HTML language will continue to evolve, the HTML 3.0 definition is still preliminary, and the Netscape extensions document is unclear on some points (and uses the word "tag" rather confusingly) -- therefore, the language definitions coded in the htmlchek program are clearly not cast in stone. For this reason I have provided htmlchek with the following command-line or configuration file options to customize many features of how htmlchek treats individual tags, and thus the language that is checked for:

nonpair=
Defines a tag or a list of tags as non-pairing (i.e. only <X> is encountered, never </X>). If more than one tag is to be defined as non-pairing, then they should be separated by commas: "nonpair=x,Y,z". (On the command line, there can be no whitespace on either side of the equals sign or commas; in the configuration file the syntax is less strict.) The case of the tag names does not matter, as seen in this example, but the case of the option does ("NONPAIR=..." will not work -- on VMS I think this means you'll have to quote the whole "option=value" unit).
Non-pairing tags in HTML 2.0 include <BR>, <HR>, <IMG>, and <LINK>.
loosepair=
Defines a tag or a comma-separated list of tags as optionally pairing (a <X> can be followed by a matching </X>, but need not be).
Optionally pairing tags in HTML 2.0 include <P>, <LI>, <DD>, <DT>, and <OPTION>.
strictpair=
Defines a tag or a comma-separated list of tags as obligatorily pairing (a <X> must always be followed by a matching </X>). (So "strictpair=p" would cause <P> to be checked as a paragraph container.) Most tags in HTML are of this type.
nonrecurpair=
Defines a tag or a comma-separated list of tags as obligatorily pairing, and in addition specifies that each tag is non-self-nesting -- i.e. one occurrence of an <X>...</X> element can never occur inside another occurrence of <X>...</X> (no matter how many intervening levels of structure there are). Thus since <A> is a non-self-nesting tag, the sequence <A>...<B>...<A>...</A>...</B>...</A> is forbidden.
This is a powerful technique for detecting missing closing tags, which unintendedly result in an element being much bigger than it should be (the other checks in htmlchek may only detect such errors much later on, possibly at the end of the document, while a self-nesting error will generally show up close to the site where the missing closing tag should be). For this reason, and because self-nesting is actually by mistake in almost all cases, I have defined most of the HTML 2.0 obligatorily-pairing tags as non-self-nesting in htmlchek (although this is stricter than the official standard).

If a new tag is declared with any of the preceding four options, it becomes a "known" tag to htmlchek. The remaining options below should only be applied to tags which have been declared in this way (or are already known to htmlchek), or the results may not be what you expect.

lowlevelpair=
Defines an obligatorily pairing tag, or a comma-separated list of such tags, as low-level markup. Low-level markup elements can generally only include each other (and not things such as lists, headings, paragraphs, and blockquotes).
Low-level markup tags in HTML 2.0 include <A>, <B>, <EM>, etc. (By special dispensation, the <A> element is allowed to contain <H1>-<H6> headings, though a warning is generated.)
lowlevelnonpair=
Defines a non-obligatorily-pairing tag, or a comma-separated list of such tags, to be allowable within low-level markup and non-block elements.
Non-obligatorily-pairing low-level markup tags in HTML 2.0 are <BR>, and <IMG>. (By special dispensation, a <PRE> element is allowed to contain <HR>.)
nonblock=
Defines a pairing tag, or a comma-separated list of such tags, to only contain low-level markup (the difference from lowlevelpair= is that nonblock= elements cannot contain each other). Making an optionally-pairing tag such as <P> (in the default definition) a non-block tag will not in general work, since htmlchek will not assume an implicit </P> before lists, headings, blockquotes, etc. (<DT> and <LI> do work, since they're confined to lists).
Non-block tags in HTML 2.0 include <DT>, headings <H1>-<H6>, and also <LI> within a <MENU> or <DIR> list.
deprecated=
Defines a tag or a comma-separated list of tags as deprecated and obsolescent. If such a tag occurs in the file, there is a warning message in the file-final tag diagnostics.
Deprecated tags in HTML 2.0 include <LISTING>, <PLAINTEXT>, and <XMP> (note that htmlchek doesn't use the special deprecated tag-insensitive mode in parsing within these elements).
tagopts=
Defines allowed options for tags. Uses a different syntax than the above options to htmlchek; here comma separated "tag,option" pairs are themselves separated by colons. So to allow the <P> tag to have the options ALIGN and NOWRAP, one could specify "tagopts=P,align:p,nowrap" on the command line, or in the configuration file.
reqopts=
Defines required options for tags. Uses the same syntax as tagopts=. A limitation of htmlchek is that only one required option can be specified for each tag; since later definitions overwrite earlier ones, this means that "reqopts=TEXTAREA,COLS:textarea,rows" would leave ROWS as the only required option for <TEXTAREA>.

Beware that the above definitions have the effect of undefining what is incompatible with what you are defining (to avoid logical inconsistencies). For example, if you define "lowlevelpair=p", then the tag <P> will be undefined as a loosely-pairing tag (since this is incompatible with ``lowlevelpair'' status). This means it will be treated as an unknown tag, unless you add an explicit "strictpair=p" or "nonrecurpair=p" declaration.

Configuration File:

Since it is cumbersome to specify long strings on the command line, there is an alternative configuration file mechanism. Specifying configfile=``filename'' on the command line will cause htmlchek to read in options from the file. The same "option=value" units that are recognized on the command line should be specified one per line in the configuration file (note that all lines in the configuration file which do not contain the `=' character are treated as comment lines and silently ignored).

There are some differences between specifying options on the command line and in the configuration file. On the command line, if there are multiple instances of the same "xxx=" option, all but the last will be silently ignored, but in the configuration file such multiple definitions will have cumulative effect. Also the relative order of evaluation on the command line is undefined (if you have both "strictpair=p" and a "nonrecurpair=p" definitions on the command line, you don't know which will override the other), while the order of statements in a configuration file is significant, since later definitions will override previous ones. Also, there can be no spaces or tabs around the `=', `,' or `:' characters on the command line, but this requirement is relaxed in the configuration file.

You can include definitions both on the command line and in the configuration file, in which case command line definitions will override those in the configfile= (specify an "arena=off" to override an "arena=1" in the configuration file, and similarly with html3=, htmlplus=, and netscape=). The internal definitions invoked by "arena=1" etc. and "netscape=1" will override definitions specified in the configuration file, but not those on the command line.

Note that the options discussed in the ``Options'' section above (append=, dirprefix=, nowswarn=, refsfile=, sugar=, and usebase=) cannot be specified in the configuration file (nor, obviously, can configfile= itself be specified there). This is because the configfile= is a language definition file, not a user preference file. (If I ever implement a user preference file in a future version of htmlchek, it will be separate from the configfile=.)

Supplemental programs: dehtml and entify.awk

dehtml

dehtml removes all HTML markup from a file so you can spell-check the darn thing. The commoner ampersand entities are translated to the appropriate single characters, so you can spell check if you're writing in a non-English language, and your spelling checker understands 8-bit Latin-1 alphabetic characters. Note that dehtml makes no pretensions to being an intelligent HTML-to-text translator; it completely ignores everything within <...>, and passes everything outside <...> through completely unaltered (except known ampersand entities).

Typical command lines:

awk -f dehtml.awk infile.html > outfile.txt
perl dehtml.awk infile.html > outfile.txt

The shell script file dehtml.sh runs dehtml.awk using the best available interpreter (under Unix):

sh dehtml.sh infile.html > outfile.txt
entify.awk

The relatively tiny entify.awk program translates Latin-1 high alphabetic characters in a file to HTML ampersand entities for safety when moving the file through non-8-bit-safe transport mechanisms (principally non-Mime RFC-822 e-mail and Usenet). This is for the greater convenience of those writing European languages with editors which use Latin-1 characters; entify.awk can be run just before distributing an HTML file externally.

Typical command line:

awk -f entify.awk infile.8bit > outfile.html

(Note that entify.awk doesn't help in checking whether an HTML file is OK, but is rather used as a precautionary measure to prevent the file from being mangled by archaic 7-bit software.)

While dehtml and entify.awk aren't primarily error-checking programs, if they do happen to find errors connected with their functioning, then the error messages are on lines beginning "&&^" which are intermixed with the non-error output.

Limitations:

The classification of each problem as being an "ERROR!" or a "Warning!" is a somewhat subjective decision on my part.

The characters `<' and `>' are always treated as special (even inside comments, etc.), which can be considered a bug or a feature -- since some browsers actually pretty much work this way, HTML code should be ruggedized for the real world, not just SGML-ically correct. (Actually, `<' inside a tag is harmless, but `>' will be interpreted as prematurely ending the tag.) SGML constructs outside the scope of HTML proper (such as <!ENTITY...>, <!ELEMENT...>, and <!ATTLIST...>) are not checked for, and can result in error messages.

Ampersand codes like &amp; are not checked against any fixed list of approved HTML/SGML entities, since such lists are rather extensible. If you want to check if any ampersand entities outside of the currently most-commonly recognized set are used, then you can run a file though the separate program dehtml and see if any ampersand codes are left (on Unix, do something like "sh dehtml.sh infile.html | egrep '&#?[0-9a-zA-Z]*;'").

Only the commonly-used double-quote character (") is recognized in quoting option values, though the single-quote character (') theoretically should also be able to be used (I think).

Htmlchek tries to enforce the <HTML><HEAD>...</HEAD> <BODY>...</BODY> </HTML> format; if you don't want to add these constructs to your files, then you'll have to learn to ignore the warnings that will inevitably be produced. (Marcus E. Hennecke <marcush@leland.stanford.edu> has posted a perl script, old2newhtml, which automatically adds these tags to a HTML file.) However, I've tried to cut down the output, so that a warning is generally produced only for the first item of each type that is not contained in an <HTML>, <HEAD>, or <BODY> element, and not for every uncontained item in a file.

Error messages are output at the earliest point at which an error becomes detectable, which is not necessarily always the point where the bad code actually occurs (this is particularly true for "improper nesting errors" -- i.e. where there is a pending <x>, so that </x> is expected, but </y> is found instead). Complaints about encountering text where no text should be found (such as immediately within a <head>...<head> element) are deferred until the start of the first following tag.

As with almost any parser or lint-type program that doesn't just give up at the first error, the presence of one real error can generate a cascade of subsequent bogus errors. I've tried to eliminate some of the more redundant repeated errormessages that earlier versions of this program tended to generate in such cases. However, it is still true that sometimes htmlchek can't compensate for an error (particularly a self-nesting error), so that the invalid HTML code it has encountered affects its interpretation of valid HTML code later on in the file -- and some of the subsequent errormessages and warnings for that file may not be useful. The only remedy is to fix the first error (which is the real one), and run the check again.

Checking the same HTML file twice with htmlchek.awk can result in a different ordering of the final tag diagnostics, due to the indeterminacy of the awk "for (x in array) {...}" looping construct.

If you try to implement cross-reference checking on a non-Unix operating system, there won't be any problem for files in a single directory, but cross reference checking across a directory hierarchy does take advantage of the fact that the URL format coincides with Unix-specific filename conventions. For example, if an <A NAME="XXX"> reference occurs in a file which is not in the directory at the top of the tree, what is output to the .NAME file is a concatenation of the string specified in the dirprefix= option, the name of the current file on the command line (after removing an initial "./" generated by the Unix find utility), `#', and the string specified in the NAME="..." option. On Unix, if the dirprefix= string is a valid URL beginning (for example "http://myhost.edu/~myself/"), then the result of this concatenation will also be a valid URL (e.g. "http://myhost.edu/~myself/subdir/somefile#XXX"). On VMS this would produce "http://myhost.edu/~myself/[.subdir]somefile#XXX" or worse. For HREF="..." references, things can get more complicated.

Version History:

The version history is contained in the file htmlchek.awk

.

Author:

Copyright H. Churchyard 1994 -- freely redistributable. This code is functional but not very well commented (and readability has not been improved in htmlchek.awk by pushing all lines which would overflow 80 columns flush left) -- sorry! If you find an error in this program, e-mail me at churchh@uts.cc.utexas.edu.

Common Problems:

If you get an awk error under Unix, the most common problem that people seem to be having is inadvertently running incompatible ``old awk''; also, some vendor-supplied awks under Unix have problems (which can be avoided by using gawk -- see the file README.30). Try using htmlchek.sh (or dehtml.sh for dehtml) and see if the problem goes away.

Meaningless Bitmap Graphic

No Web document would be complete without including a meaningless bitmap graphic.

htmlchek version 3.01, December 11 1994