Internet Publishing Handbook - Copyright © 1995 by Mike Franks

CHAPTER 4: World-Wide Web

People like color and pictures. Graphics have always been more immediate and compelling than plain text. As the saying goes, a picture is worth a thousand words. The World-Wide Web (WWW, W3, or the Web) certainly brings us pictures--by the thousands--across the Internet. The Web's popularity stems from the ability to mix pictures and text in a way that is similar to print. The promise of the Internet as a new communications medium has been fulfilled more by WWW than by any other Internet application.

Gopher is easy to use and manage. WAIS is more powerful in its searching capabilities. But WWW is the Internet application that makes people sit up and take notice, because of its colorful graphics and text. It also allows limited document formatting and ad hoc organization through hyperlinks. Although it may not be fair to say that WWW has taken over the Internet, WWW traffic has certainly been increasing astronomically. WWW browsers such as NCSA Mosaic and Netscape are so easy to use that they have fascinated and attracted many new Internet users. The ability of WWW to combine colorful graphics with text has led to extremely creative Web sites put up by companies, organizations, universities, and even individuals.

WWW: Pros and Cons

WWW servers allow you to present information with an impressive mix of graphics and text. Hypertext allows easy linking of documents and makes for an extremely flexible publishing environment. However, WWW requires greater bandwidth than Gopher to transport its image files across the Internet, and transport can be extremely slow over telephone lines, even with 14,400-baud modems.

WWW server software is available for a wide variety of computers, including UNIX machines, Windows, Macintosh, and OS/2 operating systems. HTML, HyperText Markup Language, which is the basis for most WWW documents, is easily decipherable and written in plain text, portable between operating systems. In addition, most Web viewers allow immediate viewing and saving of the source HTML for any document. This gives users easy access to examples of HTML code. When you find something you like, you can easily see how it was done in HTML.

WWW uses one easily understood interface to connect the user with most types of Internet resources, including Gopher, WAIS, Finger, telnet, tn3270, and FTP. This permits people with little or no computer experience to start exploring the Internet easily and quickly. The creation and free distribution of the NCSA Mosaic software epitomized the WWW ideal of easy access to Internet resources. Unfortunately, WWW browsers have not kept pace with the Gopher world's advance to Gopher+. Specifically, WWW browsers cannot yet use the Gopher+ online forms.

One serious problem with WWW is that HTML is being used as a page layout language when it really wasn't designed for that. Still, WWW offers a number of advantages:

ADVANTAGES

DISADVANTAGES

WWW servers let you think in terms of publishing an electronic version of a full-color brochure. You can embed images for a variety of purposes: logos, place markers, buttons, or as page accents. The text can be displayed in different sizes and different formats.

How WWW Works

WWW works similarly to the Gopher protocol: it waits for requests from WWW browsers and then fulfills the request if it can. Like Gopher, HTTP servers can send anything, but the information usually consists of text files in HTML embedded with inline images. Where Gopher sends plain text files, which the client can read immediately, WWW browser software must interpret and receive the HTML that comes through the Web. That's why Web browsers need a higher-powered computer than Gopher browsers do. (The minimum PC for a graphical Web browser is a 386 running Windows with 4MB of RAM.)

The WWW protocol, HTTP, is a stateless client/server protocol, which means that a Web server does not have a long attention span. It receives a request from a client, such as Mosaic, Netscape, or Lynx, and it processes that request and responds with either the information requested or an error message. Aside from the log it keeps of each transaction, it has no "memory" of encountering a particular client before.

HTTP servers handle a broader range of commands than a Gopher server, and the protocol is slightly more complicated (and still evolving). But they're basically both doing the same thing: providing files or actions upon request, not unlike what the anonymous FTP servers do.

Essential to WWW are Uniform Resource Locators, or URLs, which have become the phone numbers of the Internet. They were invented along with WWW by Tim Berners-Lee as a simple method for describing links to files on other systems. They have proved so useful that they are being incorporated in Gopher servers as well. For a discussion of the related developments of Uniform Resource Identifiers (URIs), Uniform Resource Names (URNs), Uniform Resource Characteristics (URCs), and Uniform Resource Agents (URA), see Chapter 11. These devices solve some of the problems with URLs and add some capabilities.

Here's a simple description of what goes on between a WWW server and client. Basically, the process of a user following a Web link goes like this:

  1. A Web client program user selects a link (URL) on a Web menu.
  2. The Web client program reads the URL to get the protocol to be used (Gopher, HTTP, FTP, and so on) and the address and port number of the server to be contacted (if it's the same server, it's considered local).
  3. The Web client contacts the specific server (assume HTTP for this example) using the protocol and port specified.
  4. The Web server accepts the connection.
  5. The Web client sends the remainder of the URL text. The URL text usually consists of a file name and/or a directory, but it could also be the name of a program or text for a database query.
  6. The Web server passes the file or item on to the Web client if it can or an error message if it can't.
  7. The connection is closed.

If the file is in HTML format, the Web client immediately scans it for inline images. If it finds any, it starts the process over again with separate requests for each of those image files (assuming the user has not turned off Display Images). If the item specified by the link was a program, the Web server would run the program and send the output back to the Web client. (For example, the link might start up a special program to calculate and display the latest server usage statistics.) If the item specified was a database query, the Web client would receive the results of the search. If the item specified was a file that the Web client doesn't know how to display, it can pass it off to a "helper application" or viewer program. Web clients often handle different graphics and sound files this way.

If the user wants something else, she repeats the process, starting a new connection each time. From the Web server's point of view, the process is efficient because it doesn't have to keep track of ongoing requests. Answer a question and wait for the next question--that's all it does. This can happen hundreds of thousands of times per day, depending on the popularity of the server.

WWW Features

Although not all WWW servers offer all the features listed here, most will.

WWW Resources

Resources for learning how to use and build Web servers are all over the Internet. Pay close attention to the Usenet newsgroups and e-mail list servers listed in Table 4-1, because they'll be a source of ongoing support, ideas, and information as WWW continues to develop.

Web Data Types

Like Gopher servers, Web servers can deliver many different types of data files in addition to HTML, which is the most common.

HTML

HyperText Markup Language, or HTML, is a method of marking plain text for both structural and layout elements, as well as links to other documents and images, sounds, movie clips, and so on. At first it is confusing to look at but it is not difficult to learn. And it is completely portable. You can write it on a Macintosh, serve it from a Sun Microsystems workstation, and view it on a Windows PC.
Any WWW client should be able to interpret the HTML you write, so long as you conform to the HTML standard. HTML is a specific application of SGML, a more detailed markup language.

SGML

Standard Generalized Markup Language, or SGML, may become a popular language for WWW documents. It produces platform-independent (portable) documents and provides a consistent scheme for describing a wide variety of documents. Many large organizations have used SGML extensively, and there is a movement to create for it viewers that can be hooked up as helper applications added to WWW browsers. One advantage seems to be SGML's ability to fully define every structural element in a body of text, a useful feature when automated programs are designed to go out and retrieve for you specific sections of documents, such as a table of contents. But SGML can only define a document structure. To be useful it has to work with a viewer program or something that takes advantage of its document-structuring capabilities.

PDF

Portable Document Format, or PDF, is the name for Adobe Systems' method of representing documents in a manner independent of the original application, software, hardware, and operating system used to create those documents. In other words, it doesn't matter what type of machine or program created the document--it will show up virtually the same on any type of machine with a PDF viewer.

Gopher

Because WWW was designed to pull all existing Internet services together in a "web," it should be able to connect to Gopher servers. The problem is that most Gopher servers have evolved to the Gopher+ standard, and WWW clients haven't. Gopher+ forms, especially ASK blocks, bomb with WWW clients. At this point it's not clear that anyone is working to solve this problem.

WAIS

WAIS, or Wide Area Information Server, is another Internet resource that WWW was designed to work with. Chapter 5 details the advantages and disadvantages of WAIS.

GIF

GIF, or Graphical Interchange Format, is a method of storing graphics made popular by CompuServe, a commercial bulletin board service. It's notable here because GIF is the graphics type supported by most WWW browsers for inline images. That is, if you publish your images in GIF format, most people using graphical Web browsers won't have to do anything special to see them. GIF does a good job with crisp sharp images (like icons), whereas JPEG is better for realistic images, such as scanned photographs. A controversy erupted in early 1995 when Unisys, the patent holder for LZW compression, a key element of GIF, decided to charge CompuServe and its developers for the right to use this part of GIF. End users did not have to pay anything, and although talk about creating a new image standard is flying, GIF is a mainstay on WWW for now because all graphical Web browsers can display inline GIF images.

TIFF

Tagged Image File Format (TIFF) is a widely used image format often associated with scanning software. Some would like WWW browsers to automatically display inline TIFF images, which is how the browsers handle GIF images now. One problem is that TIFF image files are huge in comparison to GIF images. Remember to transfer TIFF files in raw format to avoid problems.

CGI Scripts

Technically, Common Gateway Interface scripts, or CGI scripts, are not a data type. CGI is a way to write programs or shell scripts or Perl scripts to perform some action when this item is chosen. This can be as simple as accepting information from an online form and mailing it to the Web manager. CGI scripts are behind most of the interesting interactive effects on the Web.

Plain Text

Plain text files can be stored on Web servers, but they won't look as good as HTML text. This is the main disadvantage of Web servers. Once users get a look at nicely formatted HTML files, they're no longer satisfied with a plain text file, which is what they usually see on a Gopher or WAIS server.

PostScript

PostScript is a printer-independent language (that is, it doesn't matter which printer so long as it is PostScript compatible), also from Adobe Systems. PostScript printers can print the same document the same way, no matter what platform or program created the file. This is an excellent and common method for providing high-quality text printouts. Viewers that will display PostScript on screen also are available.

Sound Files

Several different sound file formats are available, and none works on all platforms. Ulaw compression of sound files is standard for Sun Microsystems workstations. Wav and au are two other common formats.

JPEG

JPEG (Joint Photographic Experts Group) is a system for compressing still images that is better than GIF for full-color photos. JPEG usually requires an external viewer before it can be downloaded and viewed, although some WWW browsers like Netscape handle inline JPEGs.

MPEG

MPEG (Motion Pictures Experts Group) is a method for compressing movie images. MPEG requires an external viewer (add-on program to display these images) to the Web client before they can be viewed.

HTML--HyperText Markup Language

HTML is the essential ingredient in building a Web server. It can be read on almost any computer system and is relatively easy to learn. However, as you'll see, it raises expectations that it is not designed to fulfill. That is, HTML is the first computer code to offer a broadly accepted method for creating and displaying hypertext with images and text (and sound and movie files) in the same document. This means that you can create a multimedia document that is easily transported across the Internet. Unfortunately, the very attractiveness of the hypertext leads many to the false conclusion that they can further control the appearance of these Web pages. HTML was not designed to do that, although strides are being taken in that direction. HTML 3.0 offers many more formatting options than HTML 2.0.

HTML assumes that different Web browsers display documents differently, according to the needs of the operating system and equipment for which they're designed. The most dramatic difference is between Lynx, a text mode WWW browser, and Mosaic or Netscape. Lynx cannot display images at all, and it can't handle different font sizes or even italics, because it was designed for plain VT100 terminals, a common dumb terminal. Lynx is a common way to connect to the Internet because it requires only a telnet connection, which is possible over even the slowest modems.

As with Gopher, HTML has been evolving into newer versions. This process can be slow, because even when everyone agrees that improvements are necessary, it is hard to get all parties concerned to agree on how the improvements should be made. HTML 2.0 was completed in the spring of 1995, and 3.0 and 3.1 are under discussion. Some improvements include the ability to display tables and mathematical equations.

What HTML Is and Is Not

The purpose of HTML is one of the most hotly discussed issues in the Web world. Technically speaking, HTML is a specific example of SGML, which was invented by the U.S. government to aid in electronic publishing and manipulation of text. The main goal of SGML is presentation independence for any given body of information. That means that the information might be displayed as text or a graphic outline of that text or transposed to audio output. Text and documents by themselves are amorphous objects. SGML would tightly bind them to a structure that could be manipulated in various ways.

SGML does not concern itself with the layout or final form in which the text will appear. According to SGML philosophy, that is more properly handled by the user or client viewing the program. Because different clients have different types of machines with different capabilities, SGML purists say that it is foolish (and just plain wrong) to try to specify in any way how the text will be displayed in the final form. Instead they concentrate on identifying every piece of a
document.

On the other hand, many, if not most, publishers were drawn to the Web by its ability to present attractive mixes of graphics and text. They constitute the other side of the controversy. They are using the various elements of HTML to create the image or presentation they want. In fact, multimedia authors are pushing to enlarge these abilities of HTML, to add things like centering and tables. Unconcerned with the SGML history of HTML, multimedia authors simply want to publish in this new medium and have more control over the way things are displayed.

Naturally, SGML purists feel that HTML's use as a page layout language is completely inappropriate and in fact a perversion of the pure SGML concept. To give an example: <H1> is the top header element in HTML. To some Web authors this is a way to emphasize text--make it bold and big and stand out. (Not all Web clients display it that way, but most do.) However, SGML purists see <H1> as a structural marker for the document, the principal header at the head of the main section. It's only incidental that some browsers display it in big bold type. Its significance lies only in its identification of the main section--nothing else.

Although the SGML purists may sound overly restrictive, in the long run they are probably right in suggesting that strict adherence to these principles will make these documents more useful for future document retrieval techniques. Unfortunately, those techniques aren't here, or at least in common use, yet. And it's much easier to pay attention to how your document looks than to how it might be logically parsed.

Conforming to HTML Standards

You can get into some bad habits writing HTML. In the interest of improving the standard of Web pages, Mark Gaither of HaL Software Systems has created a free online HTML conformance-checking service <http://www.halsoft.com/html-val-svc/> on the Internet. A similar one, also free, is located at the Georgia Institute of Technology <http://www.cc.gatech.edu/grads/j/Kipp.Jones/HaLidation/validation-form.html>. You can either submit your HTML text to be checked or point to a URL that the HTML validation service will then check and send comments back. The key point here is that writing HTML that is clean and conforms to current HTML standards ensures that your documents will display correctly in all compliant browsers.

Gaither, author of "Why Validate Your HTML Documents" <http://www.halsoft.com/html/whyvalidate.html>, has commented that one reason to ensure HTML compliance is so that newer and better browsers can be written with the assurance that most HTML pages on the Web are correct. That makes the job of designing a new browser easier because they don't have to continually allow for variations and mistakes in HTML layout. The writers of browser programs will be able to add core functions and greater sophistication if they can be sure the browsers will encounter well-structured HTML. Correctly written HTML facilitates searches, reformatting, document portability, and extracting or excluding pieces of documents. A paper titled "Designing a Web of Intellectual Property," by Terje Norderhaug and Juliet M. Oberding, a San Diego lawyer, was presented at the Third International WWW Conference in Darmstadt, Germany, in April 1995. The article called for the ability to "inline" pieces of documents as a way to cite someone else's work without infringing on their copyright <http://www.igd.fhg.de/www/www95/papers/95/webip.html>. To use a piece of another document "inline" would require the ability to extract and retrieve specific sections of someone else's HTML to be used in your document.

Writing HTML Quickly

One of the best tools for learning to write HTML quickly is the capability of all Web browsers to view and capture the source HTML of any document you find on the Web. That means that if you see something you like, you can look at it to see how it was done, and save it for further reference. Unfortunately, there's no guarantee that the examples you find this way will be clear, documented, or even good examples. But as with learning a foreign language, look at a wide variety of examples and pay attention to how they are used in different situations.

Another quick way to get started is to convert an existing word-processing document into HTML, using some of the filters and converters discussed later in this chapter. LaTex, Word for Windows, and WordPerfect documents can be converted this way, although sometimes you have to use a two-step process (first saving them in RTF (Rich Text Format). Once you've done that, making changes in your original word-processing document and converting again is an easy way to create useful Web documents quickly. Eventually, you'll need to learn HTML codes to handle problems that the converters and filters don't handle, but you can accomplish a lot by going this route first.

EasyHTML is a program from NCSA (National Center for Supercomputing Applications--creator of NCSA Mosaic) that lets you create and save HTML pages interactively by filling out online forms on a Web server. Users can create HTML pages by filling out forms in a Web browser like Mosaic, Netscape, or Lynx. The user answers questions and creates an HTML page one step at a time, with immediate feedback after each step, to show what the HTML page will look like. You can even redo parts of the page. Finally, users are given two ways to save their files and are encouraged to use both. One is straight HTML and the other is as a form, so that EasyHTML can be used to re-edit the file. EasyHTML doesn't let you use the full set of HTML commands, but it lets your users create simple HTML documents quickly and easily. <http://peachpit.ncsa.uiuc.edu/easyhtml/>

Several good online tutorials and guides to HTML are listed in Table 4-1. Also, some books have come out on HTML, and many others are being written.

Guides to Writing HTML

An easy way to find out what mistakes others are making and avoid them is to search for sites that have gone to the trouble of compiling common errors and mistakes in writing HTML. I found these valuable links, and you will probably find more by searching on the words learning HTML.

The most common mistake in HTML is to believe that the way your documents appear on your browser is the way that they'll appear on every other Web-browsing program. It is quite tempting to try to stretch the limits of HTML and make your document look a certain way. Your efforts may not carry over to a different Web browser. Because HTML resembles a page layout language, it is often hard to resist trying to use it that way. A commonsense way to avoid this trap is to check your HTML documents with several different browsers.

Relative Versus Absolute Addressing

Not all URL links have to be absolute, or completely written out. It is often more useful to simply refer to other documents or files "relative" to the current server, directory, and document. For example, this link is explicit:

<http://latino.sscnet.ucla.edu/murals/Sparc/SPARC.html>

And this link is relative, because it assumes the file is in the same directory as the file that does the linking (this trick does not work when linking to another server, of course):

<sparc2.html>

Note how short the second URL is. Many documents have at least 10 to 20 links, which is why relative links are popular. Also, the link will still work if both are moved to a new directory for some reason. For example, if our server gets too popular and we're forced to move it to a new machine, or a different directory, the relative links won't need updating. Just move the whole directory of files and set it down in the new site.

SGML Instead of HTML?

Many Web sites are using SGML for the primary storage of their Web documents and then converting them to HTML. The advantages are that SGML is a fully established, commercially accepted method of defining text files.

Several commercial SGML editors are available:

HTML Editors, Filters, and Converters

Table 4-2 offers a partial list of HTML editors available for creating HTML documents. This is an extremely fast-growing field, so watch for new entrants. The first two items in Table 4-2 are sites that list HTML editors.

One common complaint while I was writing this was that many HTML editors (especially those for Windows) are limited to files that are smaller than 32K. Although that's sufficient for quite a few screenfuls, it often is not enough for large documents. (For example, my bookmark file is up to 95K right now.) One solution I've heard of is to split large HTML files into pieces and then edit those pieces separately. Newer editors will probably remove this limitation.

Another aspect of HTML editors that you should consider is whether they discourage or prevent you from making mistakes in HTML. SoftQuad's HoTMetaL (see Table 4-2) won't let you write incorrect HTML. Getting it to accept preexisting HTML that comes with errors can be quite frustrating, however. You can also use an HTML validation service (see Table 4-1) to check your HTML files for errors.

Web (HTTP) Servers

When choosing a Web server (technically they're called HTTP servers), the first place to check for online information is the WWW FAQ by Thomas Boutell at <http://sunsite.unc.edu/boutell/faq/www_faq.html>. Another source is Paul Hoffman's WWW Servers Comparison Chart at <http://www.proper.com/www/servers-chart.html>. This chart documents at least nine Web servers, both free and commercial, and compares their features.

As with Gopher servers, you'll need to address at least five concerns when choosing a Web server:

Another concern is ease of operation. You may find it worthwhile to first set up a Web server on a PowerMac or Windows machine, learn what's involved, and try out your publishing ideas. Then, when you are more comfortable and usage warrants, you might move up to a more powerful machine.

Some servers can run as proxy servers. This means that if your organization has a firewall, you could have a proxy server sitting on the firewall, passing out requests to the rest of the WWW and saving the documents returned (called caching) so that they'll be ready for the next person who asks.

Security provisions might be important to your organization if you want to limit the access to some sections of your Web server. Some commercial servers now on the market provide a secure transaction-processing environment, but they require similar clients. This area will almost certainly expand in the near future. See Chapter 7.

The section that follows lists many existing WWW (HTTP) servers and describes some of their various features. Some aspects to check out include the load, or number of requests that can be handled in a day; how easy it is to add information; the ability to limit the number of connections depending on the load; and the general popularity of the server (assuming that the more popular servers tend to get better user support from the Web community).

As of May 1995, in my humble opinion, the CERN and NCSA servers are by far the most popular Web servers, with GN, MacHTTP, and Windows-based servers coming behind. Now that commercial vendors are starting to make servers available, I expect that these rankings will change drastically. I expect a great increase in the use of Macintosh, Windows NT, and OS/2 as platforms as more and more individuals start putting up Web sites on desktop machines.

UNIX

Until recently, most Web servers ran on UNIX, because most of the Internet was developed on UNIX, an extremely flexible and powerful operating system. Many Web servers and tools for Web management were developed first on UNIX and later rewritten (ported to) other operating systems. The drawbacks to using UNIX-based WWW servers are that you need to be sure you have good UNIX support, and it can be a complicated and arcane field. Some companies, notably Sun Microsystems and Silicon Graphics, are working to make this easier, because they are in the business of selling UNIX hardware. For a comparison of features of UNIX-based servers, see <http://mistral.enst.fr/~pioch/httpd/>.

CERN HyperText Transfer Protocol Daemon

The CERN HyperText Transfer Protocol Daemon (HTTPD) server is available at no charge for UNIX and VMS and provides the ability to run as a proxy server with document caching for faster access. <http://www.w3.org/hypertext/WWW/Daemon/Status.html>

NCSA HyperText Transfer Protocol Daemon

The NCSA UNIX server is available at no charge and runs on UNIX systems. Server administrators can decide whether they want to permit their machine's users (account holders) to write their own HTML files in their own directory (as opposed to the server's data directory). These can be referenced with a tilde in front of the user name. For example, <http://someplace.com/~smith/> would get you to Smith's home page directory. This makes Web file updating extremely easy for users. They don't need special rights in other places. <http://hoohoo.ncsa.uiuc.edu/docs/>

GN

The GN server is a free combination Gopher/WWW server by Professor John Franks of Northwestern University. It runs on UNIX and has a loyal group of users with an active support mailing list. He stopped development before Gopher+, and now he's moved on to a WWW-only server called WN, which you may want to consider if you don't need Gopher. Both of his servers make it easy to do indexing. <http://hopf.math.nwu.edu:70/>

Netscape

Netscape Communications Corporation offers two commercial UNIX-based Web servers, Netscape Communications ($1,500; free to nonprofit educational and charitable groups) and Netscape Commerce Server ($5,000). Commerce Server makes possible secure financial transactions (for example, paying with a credit card over the Internet) with clients using their Netscape browser. <http://www/netscape.com/>

Open Market WebServer and Secure WebServer

Open Market offers two commercial UNIX-based Web servers, one for information services ($1,495) and the other (Secure WebServer, $4,995) for electronic commerce. A support contract, which costs $999 per year, is required for the commerce version. The Secure WebServer supports both Netscape's Secure Socket Layer (SSL) and Enterprise Integration Technologies' Secure HyperText Transfer Protocol (S-HTTP). They support 1,000 concurrent users and are extensible so you can add your own programs. <http://www.openmarket.com/>

Plexus

The Plexus Web server is UNIX based and written in Perl by Tony Sanders based on an earlier version by Marc Van Heyningen at Indiana University. It is in the public domain and has authorization and access control. <http://www.bsdi.com/server/doc/plexus.html>

WN

The WN server by Professor John Franks of Northwestern University is a free HTTP server with many features, including title and keyword searching. It can return a range of lines (portion of the text) in a text file. <http://hopf.math.nwu.edu/>

Apache

Apache is a public domain "patched" (improved) version of NCSA's 1.3 HTTPD Web server. It was created by a group of WWW providers and part-time HTTPD programmers to get HTTPD to behave the way they wanted it to. It is 100 percent compatible with the existing NCSA 1.3 HTTPD. <http://www.apache.org/>

EIT Webmaster's Starter Kit

Enterprise Integration Technologies (EIT) has put together the Webmaster's Starter Kit, which is based on the NCSA HTTPD server for UNIX. This free Web server installs itself and optional enhancements as you answer the questions it asks via online forms.

The enhancements extend NCSA's HTTPD in several ways, including virtual document configuration options, automatic server monitoring and restarting, request prioritization, polite errors, and polite down time. Polite errors means the server can be configured to return a custom document for error responses. Polite down time means that all requests can be redirected to another server during maintenance down time. The Starter Kit also provides assistance in installing shareware tools such as libcgi, webtest, hypermail, and getstats. <http://wsk.eit.com/wsk/doc/>

VMS

VMS (Virtual Memory System) is Digital Equipment Corporation's multi-user, multitasking operating system. It runs on DEC's VAX series of computers, from its smallest minicomputer to its biggest mainframe.

CERN HTTPD

The CERN HTTPD server is available at no charge for VMS and provides the ability to run as a proxy server with document caching for faster access. <http://www.w3.org/hypertext/WWW/Daemon/Status.html>

Region 6 HTTP Server

This WWW server for VMS, written by David L. Jones of Ohio State University, is said to offer a performance advantage over the CERN version, because it runs with DECthreads, which allows it to serve multiple users simultaneously. <http://kcgl1.eng.ohio-state.edu/www/doc/serverinfo.html>

Macintosh

Macintoshes are user-friendly places to set up WWW servers, and they are gaining in popularity. Running a Web server on a Macintosh epitomizes the ease-of-publishing trend. It allows those who use Macintoshes to set up a Web server on the type of computer they're used to and conveniently make their information available worldwide. And for small and medium-sized Web sites, a Macintosh works as well as a more powerful (and expensive) UNIX machine, according to Joe Holmes of Sonoma State University in California. His comparison of Macintosh and UNIX performance as a Web server is available at <http://www.sonoma.edu/btools/theTest.html>; reactions and further commentary are available at <http://www.sonic.net/net.dreams/word/current.html>. One of his main points is that while UNIX workstations are generally much more powerful than Macintoshes, they come with a large cost in UNIX support staff. If you don't have that staff in-house, you should definitely consider using Macintoshes as Web servers, especially if you don't expect yours to be the largest site on the Internet.

By using AppleScript and MacPerl, you can manage quite a bit of customization. MacTCP (the underlying software that communicates with the Internet) is the factor that limits the Web load that Macintosh-based servers can handle, but Apple is working on a replacement. That should improve things considerably. For excellent resources and data on Macintosh WWW systems and servers, look at the "Macintosh WWW Development Guide" by Jon Wiederspan of the University of Washington <http://www.uwtc.washington.edu/Computing/WWW/Mac/Directory.html>.

Another alternative for mounting a Web server on a Macintosh or PowerMac is to run A/UX (Apple's UNIX) instead of the normal Macintosh operating system. Then you would choose Web server software listed in the section on UNIX Web servers.

WebSTAR (Formerly MacHTTP)

The WebSTAR/MacHTTP Web server for Macintosh was originally written by Chuck Shotton of BIAP Systems <http://brain.biap.com> and has some enthusiastic users. It apparently can run on every Mac from a Mac Plus through a PowerMacintosh. MacHTTP 3.0 has been renamed WebSTAR ($295 educational, $795 other, from StarNine Technologies; <http://www.starnine.com/>) and speeded up and enhanced considerably. WebSTAR allows for multiple simultaneous transfers; CGI scripts in AppleScript, MacPerl, or any other language; directory and page password security; and setting maximum simultaneous connections. It is available in both 68K and PowerMac native versions. (Native means it was written specifically to take full advantage of that model CPU.) It is comparable to commercial Web servers on other platforms. Commerce and security add-ons will be available in 1995. MacHTTP 2.2 is still available from StarNine Technologies ($75 educational, $95 other).

Netwings

Netwings is an HTTP server for Macintosh built on the 4D database system. More than just a Web (HTTP) server, it provides most functions you would want from an Internet server, such as e-mail, mailing list, and database services. It also provides systemwide security, interactive forms, Internet visitor tracking, built-in database management, and a report generator. A one-user license costs $1,495 and prices escalate from there: five-user license, $7,150; 25-user license, $33,650; 50-user license, $59,500; 100-user license, $104,650. Government, educational, and medical (nonprofit) customers may apply for a 25% discount. <http://netwings.com/>

HTTPD4Mac

HTTPD4Mac is a bare-bones Macintosh Web server written by Bill Melotti and is not a port of either the NCSA or CERN servers. It does not support maps or other CGI applications, but it is free. <http://130.246.18.52/>

MacCommon Lisp Server

You can now interface your Lisp programs to the world to show exactly what you can do better and faster in Lisp (a programming language). The server is a free, full-featured server (HTTP 1.0 and HTML 2.0) that comes complete with source code. <http://www.ai.mit.edu/projects/iiip/doc/cl-http/home-page.html>

Windows NT

Windows NT 3.5 is now a serious choice for many Web administrators primarily because of the ease of installation and administration. Windows NT also offers the ability (on some servers) to interact with Visual Basic and data in regular Windows applications, such as Microsoft Excel and Access.

EMWAC HTTPD for Windows NT

Both a freeware and a professional Web server for Windows NT ($1,995 in the United States; $2,490 on the international market) are available from the European Microsoft Windows NT Academic Centre (EMWAC) at the University of Edinburgh. The professional version adds authentication, access control, virtual paths, a proxy server, and redirection. (Virtual paths let you serve files from more than one tree, or directory, at a time. Redirection means automatically retrieving a file from its new location.) <http://emwac.ed.ac.uk/html/internet_toolchest/https/contents.htm>

SAIC-HTTP

A noncommercial license is available to use to the Web server for Windows NT developed by San Diego-based Science Applicaton International Corporation (SAIC). Among other features, the server allows multiple hosts on the same machine and on the same port. <http://wwwserver.itl.saic.com/features.html>

Netscape

Netscape offers a Windows NT version of its Communications server that has the same features as the UNIX version. It is free to nonprofit educational and charitable groups, but costs commercial users $1,500. See <http://home.netscape.com/> for details.

Folio Infobase Web Server

Folio Corporation is licensing the Edinburgh University Computing Service to provide an HTTP server that runs on Windows NT and generates HTML pages on the fly from Folio infobases. Infobase is Folio's term for its extremely fast and efficient full-text searchable database system that first appeared as the software behind Novell NetWare online manuals. Since then infobases have been created for everything from the U.S. Tax Code to WordPerfect manuals to the Bible, and this server will make it easy to put them onto the Internet. This product also solves a problem for Folio, whose customers long have sought a version that runs on UNIX. By adding this feature to Web server software, Folio will enable its UNIX users to browse the Folio infobases through their Web browsers. Those already producing Folio infobases will find this Web server appealing. It costs $6,995. <http://www.folio.com/>

WebSite

WebSite ($499) is a 32-bit WWW server for Windows NT 3.5 released in May 1995. It also runs on Windows 95, has features for restricting access to authorized users, and has access authorization. It was developed by O'Reilly and Associates in cooperation with Bob Denny and Enterprise Integration Technologies, Inc. (EIT). <http://www.ora.com/gnn/bus/ora/item/website.html>

InterNotes Web Publisher

Lotus Development Corporation has announced that it is selling a Web server ($7,500) that runs on Windows NT and acts as a gateway between its popular Lotus Notes system and the WWW. Lotus Notes is a proprietary client/server platform that allows businesses to share and organize information among all their computer sites. Lotus Notes works whether they are linked by network or only occasionally by modem. Lotus Notes includes data replication and synchronization (keeps versions of the database up to date on all machines), as well as full-text search, and has support for Macintosh, DOS, Windows, and OS/2 clients.

InterNotes Web Publisher translates Lotus Notes documents and databases into HTML and delivers them in response to WWW queries. It also automatically creates HTML pages of Lotus Notes views to create an easy way for browsers to navigate among documents on the Web site. Lotus Notes document links become hypertext links. Bitmaps in Lotus Notes documents are converted into inline GIF files. Attachments to Lotus Notes documents are preserved and can be downloaded from the server. <http://www.lotus.com/inotes/>

DOS and Windows

DOS runs on a minimal PC so it is ideal for extremely inexpensive servers. Windows has the advantage of being widely used (even if it's not always stable), and its user-friendly features make it an easy platform with which to start Web publishing.

Hype-It 1000

Hype-It 1000 ($549) and Hype-It 2000 ($1,995) are commercial Web servers that run on a 386 PC running DOS and support 30 simultaneous connections. The server provides full-text searching of the document database, as well as FTP, e-mail, and telnet services. Connections to the Web server are logged into a separate database. The company also offers Web administration training and advertising for your server at an extra charge. <http://cykic.com/homepage.htm>

KA9Q NOS HTTP

KA9Q NOS is a full-fledged Network Operating System (NOS) that runs under DOS and acts as a server for e-mail, FTP, Gopher, WWW, and CSO (Central Services Organization). It performs best on 386s but can be recompiled for 8088s. It is free to educational users; contact the program author for costs to other users. <http://inorganic5.chem.ufl.edu/ka9q/ka9q.html>

HTTPD for Windows by Robert Denny

This Windows version of HTPPD has most of the features of the popular UNIX version, including CGI scripts. It is free for personal, noncommercial use and $99 for commercial use. <http://www.city.net/win-httpd/>

Other Platforms

HTTPD for the Amiga

This is a port (or transfer) of the NCSA HTTPD server to the Amiga computer platform. It is available free with Amiga Mosaic. <http://www.Omnipresence.com/amosaic/2.0/>

GLACI HTTPD

The Great Lakes Area Commercial Internet (GLACI) HTTPD server runs on Novell Netware (3.x and up) as an NLM (Netware Loadable Module). It can be configured with IP access lists and also to allow users to store their personal HTML documents in their Novell home directory. It supports clickable image maps. <http://www.glaci.com/info/glaci-httpd.html>

Webshare CMS HTTPD

This free server by Rick Troth is written to run on the VM/CMS operating system. It supports CGI scripting, user-defined Web spaces, redirection, transaction logging, forms support, and image maps. <http://ua1vm.ua.edu/~troth/rickvmsw/rickvmsw.html>

Deciding Who Runs Your Web Server

This is a good time to remind you of the difference between a system administrator and a data librarian. A system administrator keeps track of the technical details of the operating system and of the computer on which your server runs. The data librarian organizes, updates, and maintains the integrity of the data files and Web links to other sites. Neither description may fit you exactly, but whereas the cooperation of a good system administrator is essential for the UNIX Web servers, much of the job of maintaining your Web server will fall to the data librarian. This duty often becomes a responsibility of system administrators by default, but that lamentable practice need not continue.

Take control of your data. You should know what makes your server special. You should know the working rules for your server. And you should know its weaknesses. I am not saying that you have to have a data librarian to run your server. But I am saying that you ought not leave it to your system administrator and other technical personnel. So get your system administrator to read the documentation and go over the documentation with you, so that you know what your server is doing. What follows describes in plain English how Web servers can be configured.

Installing a Web Server

You have three areas to consider, no matter which Web server you install:

  1. Server configuration
  2. Resource configuration (data and documents)
  3. Access control (security)

Server Configuration. Configuring your server requires you to specify where the server will store its log files, what port it will answer (normally 80), an e-mail address for the server administrator, the root directory in which the server program files reside, the host name that the server should give out, how long the server should wait for a client once a transaction has started, and so on. None of these is inconsequential, but usually you'll leave them for your system administrator.

You'll come across another Web server configuration issue with UNIX Web servers. They have two options for start up: inetd (Internet Daemon, which handles requests from the Internet) and standalone. Standalone means that once started, the server sits waiting for questions from Web clients. With inetd the program doesn't run until the moment an Internet connection comes in to the Web server. Then it starts up, fulfills the request, and closes down. The extra startup time can be a disadvantage for heavily loaded servers. On the other hand, if you don't see much WWW traffic, it seems a shame to have the Web server daemon always running. Also, inetd has some security features that can be useful. However, both CERN and NCSA recommend standalone instead of inetd for their server software for the increased speed.

Resource Configuration. Resource configuration means specifying where your Web documents can be located. NCSA's HTTPD server has some interesting directives: UserDir gives the directory name for public HTML files in user directories. The default is public_html. UserDir is important because it determines whether your staff or others with accounts on your Web server can set up their own public HTML directory. Set UserDir to Disabled if you don't want to allow this. Otherwise, those with accounts on your system would be able to publish on the Internet simply by creating a subdirectory by that name in their directory and putting files into it. They would then refer to them by using your Web server's address with ~theirname/ at the end. For example, if the Web server's address is <http://www.someplace.com/>, Jack Todd (with an account under the name todd) could refer to his own Web directory as <http://www.someplace.com/~todd/>. He could put as much or as little as he wanted in that directory (including subdirectories) and it would all be available under that URL, which he might put on his business card. This is happening all over the place, and you'll need to decide how much freedom you want your system users to have. Remember, this applies only to people with accounts on your system (running NCSA HTTPD server), not people browsing your Web site from elsewhere.

The DirectoryIndex command lets you avoid showing the contents of a directory. If an HTML file called index.html (default name) exists in a given directory, it will not send the directory list or index but will instead send the index.html file to the Web browser. The beauty of this is that you can use a shortened URL. If a file with that name is not there, the Web server will create an index of the contents of that directory and send that to the Web browser. This is important, because often you will want to limit what Web browsers see in certain directories. By creating a simple index.html file, the contents of that directory are available only if someone links to them by name. If you have no links to that directory from elsewhere, users will have no way of knowing what the names of the other files are. You can add links to the index.html file when you're ready, but in the meantime it acts as a shield for that directory.

One other advantage is that you can shorten your home page URL. Instead of using <http://latino.sscnet.ucla.edu/murals/murals.html>, I can change the name of murals.html to index.html and leave it off the URL so it becomes <http://latino.sscnet.ucla.edu/murals/>, and the effect is the same. Given that URLs are long and complicated, anything we can do to shorten them is helpful.

The AddType directive allows you to set certain file extensions to be treated as certain types of data. A useful example would be to make .htm an equivalent of .html to ease editing of your HTML files in DOS systems that can have only three-character extensions. In that way you can edit your files in DOS with an .htm extention and then FTP them up to your Web server without changing their names. When you have hundreds of files on your server, you'll appreciate this.

Access Control (Security). As discussed in Chapter 3 in regard to Gopher, sometimes you need to think about security and restricting access. If anyone for any reason should not see what's on your server, you'll need to know how to keep that person out. Different options are available, depending on the version of Web server you use. You should learn the options your server provides and be sure you understand them. Don't trust that the server software out of the box will be secure on your system. More advanced techniques are discussed in Chapter 7, Internet Commerce.

Here are some simple techniques for restricting access to your server or sections of your server:

  1. Restrict or allow access by domain name.
  2. Hide your server.
  3. Hide files or directories.
  4. Require a password for connection.

When restricting access by Internet domain or subdomain, you can also restrict access to subdirectories or block someone from using the entire Web server. (Check your server software to make sure it can do this.) For more information about restricting by domain, see the section on Configuring Access and Security in Chapter 3 because the process is similar.

Restricting by domain works when you have a small number of local subdomains that can be listed easily; anything else can be said to be external and shouldn't be allowed to see your server. This is appropriate when your company or campus is licensed for certain types of things, and you don't want to publish them for the world.

Hiding your server means running it on a nonstandard port (other than port 80) and not advertising its host name and port number except to those who need access. Also, make sure that it's not indexed by or linked by any other publicly accessible server. Basically, this method keeps people from finding your Web server by accident.

Successfully hiding your files and directories depends on your ascertaining that a particular set of files or directories on your server has absolutely no links and that users can't browse and find them on their own. Usually, a server will give a listing, or index, of whatever contents are in a specified directory. But, as we discussed earlier, the NCSA server, for one, allows you to specify that if the file name index.html (the NCSA default) is present in that directory, the file will be used instead. This a technique usually used to present a home page for that directory with links to the contents of that directory. But if you deliberately don't make links for the files you want to hide, no one will be able to find them in that directory unless they already know the file names. So be sure to make up unguessable file names. One professor used a random number generator to create unique file names where students could look up their grades.

Several Web servers provide for password protection. Make sure you are clear about the level of security this offers. Ask whether the password is passed over the Internet in the clear or whether it's encoded in some manner that is relatively secure.

Again, for real security look to one of the security systems being developed, such as Netscape's SSL, S-HTTP from Enterprise Integration Technologies, and the combination from Terisa Systems. As of May 1995 the W3 Organization had not yet decided on a standard means of providing secure exchanges to Web servers across the Internet.

Designing Your Web Pages

As with Gopher servers, everyone has an opinion about the best way to lay out a Web server. It is much easier to create bad Web pages than it is to look at them. Here are some sites that talk about Web design guidelines and other factors to keep in mind:

<http://www.arcade.uiowa.edu/hardin-www/jaffeDesign2.html>

<http://info.med.yale.edu/caim/StyleManual_Top.HTML>

Keep Home Pages Small

Keep your home page (or any other entry points) small. Remember that many people will connect to your home page just to see what's there, not because they really want all the information you have to offer. So don't put everything on the first page. Because small size equals speed on the Internet, it's only courteous to have a small, quickly downloaded home page that summarizes the information available with links to the larger bodies of content.

Watch Total Size of Your Pages

The more images you have, and the larger they are, the longer your page will take to load. The inline images that go with an HTML page add to the total time it will take to display, so keep them small and have as few as possible. Shrinking the text in your home page doesn't help if you have 30 inline images on that same page. Even if they're tiny, they take time to download. However, if you use inline images (of green and red dots, for example) several times in a document, most Web browsers cache, or save, them locally, so they are downloaded just once and then displayed several times.

Use Thumbnail Images as Teasers

Offer "thumbnail" graphic images that link to larger images on a separate page. Putting several of these on one page gives a catalog effect. This allows your users to choose the images they really want to take the time to download and see. Again, this is in the interest of speed for the user. It will also pay off in a lighter load on your server.

Offer Text-Only Views

Providing an alternative on your home page for text-only viewers is essential for the many people who use a text-mode Web browser. In many cases they have no other option, because the equipment they're using or their connection to the Internet will not support graphics. If you don't plan for them, you are effectively making your site useless to this set of users.

Text-only views also help those who have graphical Web browsers but turn off image loading so that they can move through Web sites quickly and explore what is available. When they find something they want, they turn image loading back on. This is particularly important internationally, because Internet connections and throughput vary widely from country to country.

Text views of your site are important to another group of Web browsers: the blind or vision impaired. They often use text-to-speech software, which dictates the text on screen.

Have a Unifying Element or Graphic

Users can easily become confused as they move through link after link of your server. Repeating a small graphic or piece of text at the top of all your pages will remind them where they are. If you use a graphic, be sure it's small and that you give a text description for it using the ALT (alternate) tag on HTML. In this way your text-only browsers get your text descriptions of the image instead of just a line that says image.

Provide "Sets" of Pages for Downloading

Consider offering a way for users to download all the linked files and images pertaining to a certain subject or theme in your Web server, if you think they'll want the whole set. For example, finding an online manual for the NCSA Web server is nice, but sometimes it's more convenient for users to get it all at once so that they don't have to worry that the connection to NCSA will be down just when they need to look at the manual. (This happened to me for days while first trying to install the NCSA HTTPD Web server.) Group the relevant Web pages and images, and compress them into one file that users can download easily and view locally. If you do this, be sure that you post the information in a prominent place.

Provide Print Versions

Users find PostScript or PDF (Portable Document Format) documents useful alternatives to your Web pages. A PostScript, PDF, or Microsoft Word version of your information can provide even better document formatting than is possible with HTML. In certain cases considering your Web server as a distribution method for these print versions of your information may be advantageous. Referring to information online is not always practical. For example, printed versions of instruction manuals and documentation are often much more useful than screen versions. Also, remember that many people have less time online than they would like, and they may prefer to download the information you have to offer in order to read and study it at their leisure.

Divide Your Information into Chunks

Wherever possible, divide your information into chunks that can be conveyed in one- to two-screen pages. This allows people to get the gist of your offering and move on quickly. If you find your documents are taking more than a screen or two, you should think about splitting them into smaller sections.

Offer Alternatives (ALTS) for Inline Images

Take advantage of the HTML ALT code in your inline images to provide text descriptions of those images for the users who can't view them. Otherwise, text-only viewers see the word image in brackets wherever an inline image appears in your document. With the ALT HTML tag you can be much more descriptive: [flag], [arrow], [picture of Abraham Lincoln].

Explain Your Server on Your Front Page

Although not everyone will enter your Web server through your front page, many will. Make life easier for everyone by clearly and prominently explaining the purpose of your Web server. If your server is intended to accommodate only students on your campus, make this clear so that others won't waste their time.

Sign and Date Every Document

The World-Wide Web is new and exciting, but time does pass and people do need to know when documents were written and last updated. You should also sign them with an e-mail address so that people can contact you if they find errors.

Avoid "Click Here" Links

Make each link descriptive. Remember that your links may well be indexed by programs that wander the Web collecting links. If the text of your links uses the word here (as in "Click here for tip on investing"), the indexer will pick up a bunch of heres and nothing useful. You can easily write that link as "See tip on investing."

Map Your Server

Although it's not always possible or necessary to map every link or even every section of your server, users find maps useful. See Figure 4-1 for an example of how one company used a map of its Web site to provide access to any page on its Web server with three clicks, at most, of the mouse. The company also offers an organizational chart of its site that both text-only and graphics browsers can use.

Ask for Feedback

Make it easy for users to give you feedback. Build a comment form or supply a "Mail to" link, and be sure to list your e-mail address in case the first two don't work. First, it makes good sense to ask your users for comment. They'll often surprise you with their insights. Second, the Internet is an interactive medium. You'll accomplish much more if you take advantage of that.

Use Directional Links

Be sure that you put a link back to your home page on all your other pages. Remember that some users will land in the middle of your Web pages after doing an Internet search, and unless you give them a way to get to your home page, they may just give up.

Warn of Large Files

When your links hook up to large files (more than 50K), let your readers know in advance, ideally by specifying the file size. That way they'll think twice before downloading large files. When a user becomes frustrated and cancels because a download of a large file is slow, neither of you gains. Warnings keep users aware and might save a little of your server's workload.

Adding Information and Files

Several methods are available for adding items to your Web server. I'll list them and then discuss them in greater detail:

  1. Type your text directly into an HTML editor or word-processing program that outputs HTML.
  2. Convert existing text with filters or conversion programs.
  3. Keep your data in a database and generate database reports in HTML format.
  4. Add links from your reading and explorations.

Typing text directly into an HTML editor gives you control over every step of the process but can be quite time-consuming. Also, HTML may not be a long-term solution for data storage and management. SGML is often cited as a better document source language because of its flexibility and detailed structural labeling. Although you might not be able to imagine this now, eventually you may find that you need greater detail in accessing pieces of documents. Saving in SGML and converting to HTML is one way to be sure you're ready for that step. And if SGML is ever replaced, its structured format will allow for easy conversion.

Converting word-processing files to HTML is already possible for WordPerfect and Microsoft Word for Windows (versions 2.0 and 6.0). RTF-to-HTML converters as well as several others already are available. And Microsoft and Novell are both offering free add-in HTML editors for their word-processing programs (see Table 4-2). You can edit your text normally and have HTML too. Although these techniques appear to be ideal, be sure to check whether the conversion process is restricted in any way. What happens to images, tables, and equations? Are they converted correctly? Don't assume. Check the results. To quote an Arabic saying, "Trust in God, but tie your camel."

When you have a large amount of information, creating a database in which to store it may be a good idea. Or you might already have a database that you want to publish on the Internet. One scheme, used by Ian M. Sims, computing coordinator for the business faculty at Edith Cowan University in Australia, is to combine a naming convention with a database to create individual HTML documents for each record in the database. He does this for course offerings. The naming convention comes in when he decides to give each course a three-character code and a number. For example, Economics 101 might be eco1. The information record for that course is written to a file named eco1.html. Any summary report can look up each course's code and create hyperlinks for each one.

Finally, you will find as you explore the Internet and in your reading that you will come across links or URLs to all sorts of useful or interesting locations. Explore the bookmark capability of your browser so that you become proficient at organizing and copying the links that you save there, a relatively painless way to add to your server's resources. Your users will appreciate ancillary or related information, or links to services offering similar information. Internet policy isn't clear on this, but it's probably best to tell the administrator of any site to which you have provided a link. If nothing else, it's courteous, and it increases the likelihood that you'll be informed of any changes.

Be Careful of Embedded Codes

It is tempting to take your existing text files and online documents and add them directly to your Web server. But remember that HTML interprets certain normal text characters as special codes. I once came across a situation in which a vendor's Web site examples used the characters < and > in the text. HTML interpreted them as codes, and they did not display correctly on either Web browser to which I had access at the time (Lynx and Mosaic). The greater than and less than signs were being interpreted as HTML codes, but because they did not adhere to HTML coding requirements, they caused the loss of some text. Needless to say, the vendor's examples were mangled and highly confusing. I could see the full example only by looking at the HTML source. Taking the extra step of scanning all documents for all HTML special characters before loading them is important.

Web Tools

Many special tools, both free and commercial (even more than for Gopher--see Table 4-3), have been developed to help Web administrators manage their servers. Some tools are for dealing with graphic images--resizing, converting, or altering them in some way. Some tools are for server administration, to help analyze log files and check links. Other tools provide you with the capability of, for example, creating clickable maps or realtime GIF or HTML documents. Realtime in this case means a special image or document that might be always changing like the Southern California Traffic Report <http://www.scubed.com/caltrans/> or documents created on the fly or in reaction to input from a user. Look online for the latest tools, because they are always changing.

Advanced Web Techniques

Once you've got your Web server up and running, and you've tired of the thrill of seeing your HTML pages on screen, you'll want to try out some of the more advanced capabilities of your Web server. Some require you to configure your server in special ways. Others involve writing or using scripts to run with the server. Other techniques have more to do with image or document preparation. Eventually, you'll find them all useful.

Images

Images present a wealth of possibilities and problems. The main problem is to get an image that will show up well without being too large a file. Also, color quality varies considerably according to whether the image was scanned in 8-, 16-, or 24-bit color and viewed on Macintosh, Windows, or UNIX workstations.

Postage Stamp or Thumbnail Images

Large image files take time to transfer, so many sites make thumbnail or postage stamp-sized reductions of their image files to give users a taste of the images available. Most graphics programs can create these by shrinking images, although they often lose a lot of detail. Plan for the smaller versions to take up 2K to 10K instead of the 20K to 200K a larger image file might take. Although the postage stamp images are stored as GIF files so they can be used as inline images, the original full-size image could be in GIF or JPEG.

Transparent GIFs

Making the background of a GIF image transparent so that it takes the color of whatever background the browser uses can be a nice aesthetic touch. One site uses this technique to begin its files with a large initial in an archaic font. The giftrans program can do this on UNIX and DOS, and a program aptly called Transparency can do this on Macintoshes. See Table 4-3 for these and other such programs. Not all browsers display these images correctly, however, so test your images on many browsers.

Clickable Images

A clickable image is a graphic image on a Web page that has certain "sensitive" areas that, if clicked with a mouse, link to some other page or program. Basically, clickable images work in conjunction with CGI scripts on the server. When the user clicks on part of the image, its x-y coordinates are sent back to the WWW server, where a CGI script or program processes them and performs some specific action, depending on which section of the image the user selected. The uses of clickable images are limited only by your imagination.

Here's a short list of examples:

To create a clickable image decide first which areas of the image should lead to which actions. These actions could be as simple as displaying selected documents or images or running programs on your server. The NCSA version of clickable images gives you a choice of points, squares, rectangles, circles, and polygons with as many as 100 vertices. Points would be difficult for the user to click accurately, so they use a "closest to" rule--the point that is closest to the coordinates clicked will respond. Then mark these areas, and record in a map file the x-y coordinates and the URL to the appropriate image, document, or script for each area. The top left corner is 0,0--heading down increases the value of the y-axis, and moving to the right is positive for the x-axis. You also could use a default URL in case the users do not pick any of the zones you've defined. Then you make a link that points to your map file by using the IMG SRC code in HTML. The Macintosh and Windows HTTP servers are similiar, but CERN uses a different approach.

The major drawback of the current method of providing clickable images is that users get no immediate feedback as they move the mouse over the image. They have to wait until the x-y coordinates are sent to the server and it looks up the nearest zone. Because the Internet isn't instantaneous, clickable images delay an otherwise interesting process. In development are several alternatives for image mapping that allow the browser to do more of the work and provide immediate feedback. Because the browser program is considered a Web client, this is called client-side image mapping.

Interactive Image Format (IIF)

Interactive Image Format (IIF) is a form of client-side clickable image developed at the University of Michigan by the Weather Underground (a weather service, not the 1960s radicals). IIF originally was written to allow Gopher servers to provide special weather maps via the Blue Skies Gopher client (browser). As you moved your mouse across the map, text in a box at the top tells you which weather-reporting station is nearby. If you then click the mouse, you are connected to that server and you receive a weather report on screen. The University of Michigan is exploring this technology with a grant from the National Science Foundation. Blue Skies Gopher clients are available for Macintosh and Windows. <http://cirrus.sprl.umich.edu/>

Creating GIF Images on the Fly

"On the fly" means that your server creates the GIF image only after the user requests it. The image might be a special graphic representing flood damage in an area, a zoom-lens view of a map, a spinning globe, a graph of rising (or falling) profits, or the amount of coffee left in a pot. Imagizer is one software program that can do this (see Table 4-3).

CGI Scripts

CGI scripts are behind many of the most interesting applications on the Web. Whenever the Web server does something special (as opposed to simply providing a file), it does so because of a CGI script. CGI, or the Common Gateway Interface, is a standard way for external programs to interact with Web and other information servers. What this means is that CGI is your main tool for adding special functions to your Web server. Some functions, like gateways to Archie, Finger (a user information lookup command; see Chapter 6), and WAIS have already been added by CGI scripts or programs that come with your WWW server software. Others can be created by altering existing programs or writing new ones from scratch.

The Common Gateway Interface designates how programs will interact with Web servers behind the scenes. For example, a commonly used CGI script takes the information from an online comment form and e-mails it to the Web administrator. It does this by taking the information submitted on the form and using it to fill out a mail message, which it sends.

CGI programs can be UNIX shell scripts (a series of commands in a file, like DOS batch files), Perl scripts, or programs written in C, C++, or other programming languages. The important thing is that they interact with the Web server according to the rules of the CGI specification <http://hoohoo.ncsa.uiuc.edu/cgi/interface.html>. The link to NCSA's collection of CGI examples and overviews is <http://hoohoo.ncsa.uiuc.edu/cgi/>; one link to a CGI primer is <http://hoohoo.ncsa.uiuc.edu/cgi/primer.html>.

NCSA-Supplied CGI Scripts

The simplest way to start using and writing CGI scripts is to look at the CGI scripts and example programs that came with your server or those mentioned here. For instance, the NCSA HTTPD Web server comes with the following sample CGI scripts and programs. Use or modify these to meet your needs, or write your own, using these as examples:

Other CGI script tricks include adding access counts to your documents, which allows your users to see how many times a particular document has been accessed. Chuck Musciano of Harris Corporation explains how to do this and gives away a C program for the NCSA HTTPD server that counts accesses and displays a message saying you are visitor number such and such. It keeps track of all efforts to access the documents to which it is linked. <http://melmac.corp.harris.com/access_counts.html>

Server-Side Includes

Server-side includes are a feature of NCSA's HTTPD server that permits a server to automatically include information from files and environment variables in an HTML file as it is being sent to the user. It can include other information, such as date and document size, if you wish. One useful application is to have your HTML documents automatically show the date they were last updated, without you or anyone else modifying the date in the file. This means you won't have to remember to change the date each time you change a file. Server-side includes have many other uses, but be sure that you look for any warnings or advisories about them, because in some situations they may pose a security threat. They also add a load to the serve and can slow it down. Each time a document goes out, it has to check and process any included scripts.

Forms

Online forms are among the most useful and appreciated features of Web servers. If the Web provided no way for users to give feedback, the Web would not be what it is today. Forms are used for everything from comments to subscriptions to entering HTML pages for validation to ordering pizzas.

Basically, forms consist of two parts, the HTML description of the form and the CGI program on the server that will do something with the answers to the form's questions. This might be as simple as taking the information and mailing it to the Web administrator. Or it might mean reporting the number of users since the server went online. Whatever the application, it's always a combination of a form written in HTML and a CGI script or program.

HTML 2.0 forms can be created from the following pieces:

INPUT specifies a field that the user fills out. You control the maximum length of the text the user can enter as well as the size of the onscreen space. You can also specify a default value to sit in the field for the user to see. (This lets you put standard answers in for the convenience of your users.) Input fields can be of the following types:

SELECT allows the user to select from a list of headings or options. If you set the button to read MULTIPLE, the user can make more than one selection. You can and should include a SIZE button when the list is longer than what a single screen can display. I've seen one Web server with a SELECT option of several hundred items, but because SIZE wasn't specified, users of some browsers could choose only those options that appeared on the first screen because the browser didn't scroll automatically. Once this server was set for SIZE (to 17 in this case), the browser started scrolling through the entire list, showing 17 items at a time.

TEXTAREA lets users enter more than one line of text. ROWS and COLS show the visible dimensions of the field in characters. Browsers are supposed to allow scrolling beyond these limits so that users can enter longer text as necessary.

Almost all the fields also have a name attached to them so that your CGI script can easily distinguish between them. The user doesn't see these field names, but they are essential nonetheless. Each answer the user gives is paired with its variable name, and these sets are sent back to the Web server. Without the NAME attribute, the CGI script that the server uses would not know which answer went with which variable.

Several books on HTML and many online resources, including one from NCSA, <http://www.ncsa.uiuc.edu/SDG/Software/Mosaic/Docs/fill-out-forms/overview.html>, provide, among other things, examples of forms, complete with explanations. At the end of its document NCSA is kind enough to offer a test server that you can use to test any forms that you develop. The script simply echoes back to you any of the fields that it receives, so you can see whether they work as you expected.

Don't forget that you can always look at the source (HTML) of any form you come across on the World-Wide Web. This is one way to learn new form techniques. Some examples of forms in use on the Web include a form for adding home pages and a form that forwards e-mail.

Figure 4-2 shows part of a form used by the Web site HUMnet (Humanities Computing) at UCLA to allow students, staff, and faculty to add their own home pages to the humanities Web server.

Figure 4-3 shows a portion of the HTML that created the form that appears in Figure 4-2. Note all the departments listed as options and how Art History is listed as the default option by using <option selected> (see Figure 4-3).

The E-mail Forwarding Request Form in Figure 4-4 also was designed and is used by HUMnet at UCLA. It takes information that enables it to forward e-mail from users. Note that HUMnet also uses this form to communicate the rules and establish the information required for e-mail forwarding on its system. The portion shown here uses checkboxes and radio buttons as well as the TEXTAREA field for a long text entry. Figure 4-5 shows how that form was written in HTML.

Web Indexing

One criticism of the WWW is that it is not nearly as easy to index as Gopher servers. If you want to index all documents and links in a Web server, you have to actually scan through all the HTML documents, looking for the title of that document, as well as all the HTML links. And when you do find a link, what do you index--the link's URL or the link's descriptive text (see Figure 4-6)?

The link URL is great for getting you somewhere, but isn't always useful for describing what it's linking to. The descriptive text is the logical thing to index, but it might not always be available. Servers commonly use graphic images as links instead of descriptive text. And as we just saw, clickable images can link to many different files or locations, without having any text attached.

Nonetheless, the reasons for indexing Web servers are the same as for Gopher servers: so your users can find out what you have (and don't have) quickly and with minimal frustration and so your server is efficient in providing its information. Next I'll describe several different indexing systems, some of which are built into particular WWW servers. The important thing to keep track of is what is being indexed--file names and links, document titles, the full text of all documents, or an abstract written by the Web administrator. Different indexing programs or systems have different philosophies. Look at them all, but be sure that when you choose one, you know exactly what it is designed to index and what its advantages (and disadvantages) are.

Internal Indexers

I call do-it-yourself indexing internal indexing. Servers and add-on programs that are available for indexing your Web server yourself include:

External Indexers (Web Crawlers, Spiders,
and Worms)

External indexers are programs that reside on someone else's machine but index all servers. Often called Web crawlers, spiders, worms, and robots, they are programs that attempt to traverse all the known menus or links in Gopher and Web servers in order to build comprehensive indexes of Gopher or Web space. Gopher is simpler to index, and Veronica and Jughead take care of that. Because WWW is much harder to index, there are many approaches to indexing or cataloging the exploding World-Wide Web. It is important that you understand the distinctions between each one so that you can make sure that your server and its documents are indexed in the way that's most appropriate for them. If you want to advertise your Web server, register it with every external index, or subject server, you can find. This is where your notes from your own searches will come in handy.

These programs require your cooperation to avoid excessive efforts on their part and to limit the load on your server. Basically, some information on your server may not be of interest to outside users. Or you may not want it indexed for some reason, either to limit access or because the information (such as Usenet News links) changes too quickly. Some protocols have been developed for dealing with information that should not be indexed. Certain hidden files can be placed in specified locations on your server that tell the indexing programs which sections to index and which to ignore.

The link to Martijn Koster's list of Web Wanderers, a good list of Web robots and spiders, is <http://web.nexor.co.uk/mak/doc/robots/active.html>. Check it out, and look at some other places that can index your Web site.

ALIWEB

ALIWEB asks Webmasters to summarize in an index file the various themes or subjects covered in their WWW server. When you register with ALIWEB <http://web.nexor.co.uk/aliweb/doc/aliweb.html>, it looks for this file and checks its structure; this index file is picked up by the ALIWEB harvester at regular intervals. The index file is a plain ASCII file in a special format left in a special place on the Web server. The protocol even allows Web administrators to specify how often their server should be checked for changes. The ALIWEB database is also incorporated in the CUI W3 catalog.

GENBBB

GENBBB allows you to add WWW titles, reports, and links to other virtual libraries via an online WWW form. <http://www.cs.colorado.edu/homes/mcbryan/public_html/bb/summary.html>

EINet Galaxy

EINet Galaxy lets you add your own annotations and links to its catalog. <http://www.einet.net/galaxy.html>

Lycos

Register your Gopher, WWW, or FTP site with Lycos at <http://lycos.cs.cmu.edu/lycos-register.html>. It takes a few days to a week for newly registered items to be added to the index. Lycos ignores redirected pages, so you must register the actual server name and path.

World-Wide Web Virtual Library

This is a directory or catalog of the Internet by subject. Mail information about your Web server to the maintainers of the specified subject or to www-request@mail.w3.org. You might also offer to contribute to administration of a subject area to share your expertise. <http://www.w3.org/hypertext/DataSources/bySubject/Overview.html>

Advertising and Registering Your Web Server

Putting something on the Web doesn't necessarily publicize it. Unless you want to have a little-known Web server, you'll want to register it and advertise it as much as possible. I'll give you some ways to do that, but check online for additional services and sites. One drawback to the Internet is that there is no guaranteed way to intensively advertise your site among the tens of thousands of other Web sites available for browsing.

Try these techniques:

  1. Register your server with W3O by following the instructions given at <http://www.w3.org/hypertext/DataSources/www/Geographical_generation/hew-servers.html/>.
  2. Post information about your server to the Usenet newsgroup comp.infosystems.www.announce.
  3. Send e-mail to the list server www-request@www.w3.org, which should get your server on the W3O World-Wide Web servers' list. This list is sorted alphabetically by continent, country, and state.
  4. Send an announcement to whats-new@ncsa.uiuc.edu in HTML format and third person to get your announcement on the NCSA Mosaic's "What's New" page.
  5. Put information about your server in your .signature file (an e-mail feature on many systems that lets you automatically place your organization, title, and other data at the end of every message), on your business cards, brochures, and such, and spread the news via word of mouth and e-mail.
  6. Find the relevant subject lists and guides and make sure they know about your server.

Summary

WWW is a method of publishing on the Internet that is generating tremendous interest and enthusiasm among users. It combines text, graphics, and hypertext links using HTML in a way that is appealing and easy to navigate. The capacity for formatting documents is deceptive, however, because different WWW browser programs display the same HTML document differently. A certain percentage of users is limited to text-only browsers (Lynx) that do not show images. Web designers have to take these variations into account. Meanwhile, HTML is being improved, with new features added constantly.

WWW (technically, HTTPD) servers are available in both commercial and free versions for most types of computer systems. Although UNIX has been the operating system of choice for development and for most WWW servers, that is changing as more and more individuals set up servers on their desktop computers--PCs running Windows, Windows NT, or Macintoshes. Web managers find the PCs friendlier than UNIX, and they perform well for small and medium sites. Another option is to take a 486 or Macintosh and run it with a UNIX operating system, which can make it perform much faster and more efficiently as a Web (or Gopher) server.

Writing HTML is getting easier with the new tools (both HTML editors and word-processing add-ons) that are being developed. Be sure to check your HTML for conformance to HTML rules and standards; several online sites do this for free.

Design your Web site so that its contents and purpose are clear to even the most casual user. Organize your material and be consistent. Where possible, index the contents with WAIS, Glimpse, ICE, or some other indexing program to allow users to quickly search for material. Keep the sizes of pages and graphics down, particularly in your most frequently used pages. When you can't avoid having large documents or files, label the links with the size of the file so that users are forewarned.

Considerable programming effort and creativity are being applied to making Web servers do more than just provide documents and files on demand. CGI scripts are programs that can be added to most Web servers to perform special functions such as searches, updating databases, performing commercial transactions, and providing interactive online forms. Sun's Hot Java is adding an actual programming language that will be "understood" by any Hot Java-compatible Web browser (see Chapter 11).

With more than 25,000 WWW servers on the Internet and increasing all the time, it is extremely important to register and advertise your Web site in as many places as possible. Indexing the contents of Web servers was originally thought to be one of the weak points of the WWW. Now, with robots and spiders (like Lycos from Carnegie Mellon University) that traverse Webspace to add to their indexes of Web documents, the situation has improved dramatically.


small image of cover of Internet Publishing Handbook
Table of Contents