WAIS, pronounced ways, stands for Wide Area Information Server, an animal quite different from Gopher or WWW. Whereas Gopher and WWW are systems that help users to look at documents on various servers with ease, and indexing was added as an afterthought, WAIS was designed from the beginning to retrieve information from multiple indexed document sources. Its designers recognized that those documents might well be stored in many different places. The emphasis on searching and the ability to query many different data sources simultaneously are why WAIS is often the indexer (or search engine) running in the background for Gopher and WWW.
A key point about WAIS is that your users do not have to use special WAIS clients, because you can use WAIS indexes with a Gopher server or a WWW-to-WAIS gateway. Most WAIS indexes are accessed via Gopher or WWW clients.
To a greater extent than the various versions of Gopher and WWW, WAIS is split between a commercial version, offered by WAIS, Inc., and freeware versions supported by various organizations and individuals. The commercial version, WAISserver from WAIS, Inc., costs $15,000, and there has been no intermediate version available at a lower price. That's why alternative indexing systems (such as Glimpse and ICE) have gained popularity.
WAIS is one approach to indexed searching. It has the advantage of being written to work with both text and nontext files so that you can use a WAIS server for a collection of photographic images, word definitions, book reviews, video or audio clips, and almost anything else you can imagine. It does full-text indexing when text is available. It can index nontext files by file name or index descriptions of each nontext file.
However, one major drawback of WAIS is that the organization that had supported the noncommercial version (called freeWAIS) has dropped it in favor of a model based on the international standard Z39.50 v.2 (or v.1992). The University of Dortmundt in Germany has modified and improved the free version to allow for structured fields. This version is called freeWAIS-sf. Structured fields are containers for certain types of information, such as title, author, date, and other data. For example, if you index all your e-mail with the structured fields version of WAIS, you might designate fields for the sender, the subject, and the date. Then you could retrieve any message sent by your boss between certain dates pertaining to a specific subject. This kind of structured field searching offers much more flexibility to the user.
ADVANTAGES
DISADVANTAGES
WAIS does its work by transmitting a natural language query--a question in plain English--to a select number of information servers previously chosen by the user. Each of those servers compares the request with documents it has on file and sends back the headlines, or document titles, of their closest matches, ranked by how many words they had in common with the query. The user sees the list of headlines (each of which is a link to the original) alongside a number that gives each headline's relative ranking against the query. One thousand is best and it's downhill from there. The user then selects one or more documents to view and the full text shows up.
With relevance feedback a user can take one document and
use it
as a basis for a new search in an effort to find more documents
that closely match the subject matter of the first one. The new
search looks for documents that contain the same words as the
user-selected document. In effect, it now has a much more specific
search query to work from, with many more example words to try
to match.
When you have a large amount of text that you want to be completely searchable, you need an indexed server tool like WAIS. The text could consist of many different files, or it could be one or more large files, with blocks of information divided in some standard way. For example, a card catalog might have title, author, publication date, and such with a blank line between each record (a record is a set of fields). WAIS would return only the individual record (book in this case) that matches your search query.
For example, you might profitably use WAIS to search the
WAIS and Z39.50 are both protocols for retrieving information from computer to computer. The assumption is that one computer acts as a client and the other, with indexed databases, acts as a server. They are different from WWW and Gopher protocols in that they aren't "stateless." That is, they are designed to carry on an extended conversation between the client and server until the search session is complete. Stateless protocols, on the other hand, do all their work in one short interaction. Their conversations consists of "Give me this" and "Okay here it is." Z39.50 and WAIS allow for a longer interaction, building on previous responses.
Z39.50 (which was started in the United States) has gone through several versions, beginning with the largely unimplemented 1988 version (Z39.50-1988). The WAIS protocol was developed as an enlargement of Z39.50-1988 and built into a productive working system. The development of the Z39.50 standard continued, however, and not in the same direction as the WAIS enhancements. Z39.50-1992 (also called Z39.50-V2) improved the standard, among other things by bringing it closer to a similar international standard that also was being developed. WAIS, Inc., the principal holder of the WAIS torch, has altered its product to conform to the newer standard. CNIDR (Clearinghouse for Networked Information Discovery and Retrieval at <http://cnidr.org/>), which had been the developer and guardian of the freeWAIS (noncommercial) version, decided to switch its development efforts to the newer Z39.50 standard. ZDist is CNIDR's Z39.50 implementation and includes a UNIX client, server, HTTP-to-Z39.50 gateway, and an e-mail-to-Z39.50 gateway. Unfortunately, WAIS clients won't work with ZDist servers and vice versa.
FreeWAIS is a single indexer, single search engine, which means that it supports only one method of indexing documents and one search engine for searching those documents. A search engine is the program that actually does the searches. In contrast, ZDist allows more than one indexer and more than one search engine to be used at one time. This is useful when you want to search information stored at different sites and indexed in different ways. Ideally, users would be able to search many different databases without needing to learn the ins and outs of each one. That learning time was a significant barrier to using indexing systems.
WAIS fits in with Z39.50 in that the WAIS protocol has become what is called a Z39.50 profile, or a specific application of Z39.50. Profiles are formal implementation agreements within the context of the Z39.50 standard. They define which portions of the basic Z39.50 protocol are used in a given system and how the system should interpret these portions. WAIS is one of several publicly available Z39.50 profiles. The U.S. Government Information Locator Service (GILS), started by the Office of Management and Budget in concert with the Information Policy Committee of the Information Infrastructure Task Force, is another Z39.50 profile. The purpose of GILS is to make accessible the tremendous quantity of economic data, environmental data, and technical information collected and processed by various agencies of the U.S. government. GILS uses Z39.50 so that the information can be retrieved in a variety of ways. For more information on GILS see <http://www.usgs.gov:80/gils/>.
The single biggest use of Z39.50 might be the bib-1 profile developed for the library OPAC (Online Public-Access Catalog) market. This profile was developed by the library technology community for bibliographic applications of Z39.50. It has numerous commercial implementations in addition to free ones, like ISite from CNIDR. It is hoped that as bib-1 proves successful, profiles for other applications may be developed, such as geographic and medical information systems. Table 5-1 provides links for more information on Z39.50 and WAIS resources available online, as well as WWW-to-Z39.50 gateways.
Some WAIS servers support stemming, which reduces every word in a query to its word stem. For example, stemming would treat computer, computing, and computers identically, with the stem comput.
It is important to understand that the rankings or scores that WAIS generates for documents are just approximations or guesses at what you really want. Technically, they may match on a certain number of words, but those words might be used in different contexts, so that the document with the highest score might not be useful at all. For example, searching on a question like Why is the sky blue? might bring poetry with those words in one stanza or information about the University of Michigan's Blue Skies Gopher project but nothing that answers the question. Basically, computers still can't think like a human being can. But using indexes is still better than going out and looking for each document. To use a sports analogy, think of computer searching as getting you into the ballpark but not playing the game for you.
The key to WAIS is that by using a powerful search algorithm, it locates and orders documents that match your query, but it doesn't guarantee their utility to your needs.
Fielded searching gives you the ability to search for certain information, such as author or publisher or company. So if you are looking for information about publishing on the Internet, you would prefer to find the words publishing on the Internet in the title of a document instead of scattered throughout the text. When you use fielded searching, a document does not receive a high ranking because it has each of those words individually in its text but because it has those words in its title. Obviously, a better choice.
The selection among WAIS servers is not nearly as ample as it is for Gopher and WWW.
Thinking Machines gave the original WAIS program to the public domain. It was supported for a while by the Clearinghouse for Networked Information Discovery and Retrieval (CNIDR) <http://cnidr.org/welcome.cnidr.html>. The CNIDR version (freeWAIS 0.3) has Boolean searching and stemming, and the source code is available for other programmers to modify, but it does not offer structured fields and has many bugs that haven't been fixed. CNIDR has switched its development efforts to ZDist, which is based directly on the newer Z39.50-1992 standard. FreeWAIS 0.3 is available at <ftp://ftp.cnidr.org/pub/NIDR.tools/freewais/>.
The sf in freeWais-sf stands for structured fields. FreeWAIS-sf is a UNIX-based WAIS server that builds on the freeWAIS code to provide the ability to search structured fields (including text, date, and numbers) as well as full text. This means that you can search traditional bibliographic citations, which include title, author, and publisher, by those fields. Such a search for James Bond in the title would turn up only those titles in which the words James Bond appears, not all articles or martini recipes that mention James Bond. Or you could restrict your search to only those items published after a particular date.
FreeWAIS-sf can be used as a plain WAIS server and is compatible with existing WAIS clients. In addition, it improves on freeWAIS by allowing the user to define document format and headline layout without having to be a C language programmer. It supports country-specific character sets (8-bit only), which means that server operators will have fewer problems indexing data, at least in European languages.
Development is continuing, but this is not a commercial product. FreeWAIS-sf comes from Ulrich Pfeifer of Dortmund University, Germany. <http://charly.informatik.uni-dortmund.de/freeWAIS-sf/>
Two other programs are available. SFgate is a CGI script that connects with WWW and freeWAIS servers. <http://charly.informatik.uni-dortmund.de/SFgate/SFgate> And SFproxy lets you index your personal WWW hotlist (or bookmark file--the collection of Web links you've saved) and easily add URLs. <http://ls6-www.informatik.uni-dortmund.de/SFgate/SFproxy.html>
WAISserver 2.0 by WAIS, Inc., is the top-of-the-line commercial WAIS server ($15,000). The company was founded by Brewster Kahle, who also created WAIS while at Thinking Machines. WAIS, Inc.'s customers include Britannica Online and CMP's TechWeb, among others. Its Web site includes links to freeware WAIS clients and servers. <http://www.wais.com/>
WAISserver for VMS is noncommercial software by Jim Fullton (formerly of the University of North Carolina, now at CNIDR) and runs on VAX VMS systems (contact Fullton if you want to use the software for commercial purposes). The files you need in order to compile it are located at <ftp://sunsite.unc.edu/pub/packages/infosystems/wais/servers/vms/vms-server/>.
The European Microsoft Windows NT Academic Centre (EMWAC) is an integral part of Computing Services of the University of Edinburgh and has been set up to support and act as a focus for Windows NT within academia. It is sponsored by Datalink Computers, Digital, Microsoft, Research Machines, Sequent, and the University of Edinburgh. <http://emwac.ed.ac.uk/html/internet_toolchest/top.html>
There is an older, free, pre-WinSockets (a program that allows Windows to talk TCP/IP) version of a WAIS server that will run on Windows. It was written by Tony Addyman and is noncommercial, and he provides no support but will help someone who wants to take it over. FTP from ftp.salford.ac.uk in pub/wserver.zip.
Some alternatives to WAIS don't use the WAIS protocol but do provide full-text indexing.
ISite, a server for the protocol Z39.50.92, is based directly on the latest version of Z39.50 protocol. Read the document "Why We Changed the Name freeWAIS to Zdist" by Kevin Gamiel of CNIDR for background <http://cnidr.org/talks/whyzdist.html>. The client and server software for UNIX (including an HTTP gateway) are available at <http://vinca.cnidr.org/software/zdist/zdist.html>. The main advantages of ISite over freeWAIS are that
The INQUERY system, from the Center for Intelligent Information Retrieval (CIIR) at the University of Massachusetts, is not intended to be an off-the-shelf information retrieval system. CIIR's focus is on collaborating with industry and government to address challenging and important problems associated with text databases. INQUERY is often used to solve those problems. Additional features are being developed, but it currently incorporates relevance feedback techniques (the user's picks help refine the search). A Japanese version is available, and Spanish and Chinese versions are under development. INQUERY has a WWW gateway. <http://ciir.cs.umass.edu/inqueryhomepage.html>
Glimpse is a powerful indexing and query system that allows fast file searches (free for nonprofit use; for licensing information contact the authors at glimpse@cs.colorado.edu). Glimpse can be used by individuals for their personal filing systems as well as by organizations for large data collections. Glimpse is the main searching mechanism behind Harvest, an information discovery and access system that is also worth checking into. <http://harvest.cs.colorado.edu/>
The GlimpseHTTP gateway allows WWW searching of Glimpse indexes. <http://glimpse.cs.arizona.edu:1994/>
ICE, an indexing program for WWW servers by Christian Neuss, is a lightweight, easy to install alternative to WAIS gateways. It allows free-text searches on a World-Wide Web archive. It is free for use in any noncommercial product. According to the author, ICE is beerware--if you decide that you like it, send him a can, or case, of your favorite beer. <http://www.igd.fhg.de/~neuss/>
Open Text Corporation offers a commercial alternative to WAIS, also called Open Text, that is best known for its use with large tagged structures such as SGML databases. It can be used with many other data formats, however, including free-text, word-processing formats, and nontextual data. For example, Open Text has been used to do as complete an index of the WWW as possible. Right now the project is at 574 million words of text with more than 11 million hyperlinks indexed. Try searching--it's extremely impressive, particularly because it lets you search specific sections of HTML documents as well as do proximity and occurrence count searches. (The latter means that the searcher will send back only those documents in which your search term appears a specific number of times.)
Open Text would like to run its Open Text Web Index as a charged service, but it is available for free while the company tries to convince people to buy its Open Text indexing products. <http://opentext.uunet.ca:8080/>
WAIS resources are much less prevalent than Gopher and WWW resources, although they are sorely needed. Managing a WAIS server is an inherently more complicated task, so you may find the newsgroups and mailing lists to be fairly technical. But monitor them nonetheless (see Table 5-1).
WAIS servers can run on UNIX systems, Windows NT, and Windows. Remember that a UNIX system doesn't have to be a large machine. Linux is one version of UNIX that runs on 386 PCs, and Macintoshes can also run a form of UNIX as their operating system. These can give you a relatively inexpensive platform from which to run a UNIX-based WAIS server.
Like Gopher and WWW servers, setting up a WAIS server is a two-part process, installing the server and adding the data, documents, or images it will serve. With Gopher servers data preparation is a simple matter of putting the data, document, and image files into the data directory. Web servers require the additional task of writing HTML code to format the documents you want to go out. Indexing for both Gopher and WWW servers is optional.
But indexing is what WAIS servers are all about. WAIS servers don't deal with documents--they deal with the indexes to those documents. If they aren't indexed, they aren't available. And unfortunately, indexing is not always a simple matter, as we'll see. But first let's get the various pieces installed.
We'll use freeWAIS-sf 1.1 as an example. It is a free WAIS server for UNIX machines, and it has some advanced features, including the ability to search structured fields. It is an enhancement of the original version of freeWAIS and was written by Ulrich Pfeifer and Tung Huynh of the University of Dortmund, Germany.
The installation process consists of running a configuration script that checks out the kind of UNIX system you have. Then you run the UNIX make utility to build the WAIS programs: waisserver, waisindex, waissearch, and waisq. Installation also provides for storing UNIX online documentation files (called man pages) in appropriate directories. Here's what the various programs that are part of freeWAIS-sf 1.1 do:
After you have tested your server and pronounced it ready for use, you set it up to be ready and waiting at all times. You can do this either by using inetd or by running it as a stand-alone program. See the discussion on inetd versus local in Chapter 4, because the same concerns apply here (inetd applies only to UNIX). Basically, inetd starts up only when a request comes through; it doesn't waste system resources by waiting, but its startup is slower. Running the WAIS server in standalone mode gives the quickest response because the software is always running. Check the WAIS server documentation for further considerations.
Because WAIS is based on indexing, the most important part of setting a WAIS server up is to index your data, whether it's a collection of text files, images, sound files, or programs. A WAIS server can deliver all of these, but you first need to understand a few of its restrictions:
When you use waisindex to index your files, you need to specify what type they are. The word type is used to describe the different formats of files. A picture in GIF or JPEG format is obviously different than a text file. But text files also come in different types. One type of text file might have each entry on its own line. Another might have several items of information on various lines for each item. WAIS file types try to recognize and take advantage of these differences. See Table 5-2 for a list of data types from the freeWAIS-sf documentation.
As you can see there are many special purpose file formats on the list. Whenever a pattern is consistent throughout a file, waisindex can take advantage of those differences. What follows are some examples from the list in Table 5-2.
Text files are anything that contain text and are stored in ASCII. They could be a collection of résumés, or papers from a conference, or recipes for margaritas. The rule here is that when the server finds a match, it returns the entire file to the client, whether that file is a single résumé or 1,500 résumés, a conference paper or 12 papers from the one conference, or a margarita recipe or a collection of recipes. This can be a problem when all you want a user to see is a single résumé or a single recipe or a single conference report. For example, you have a file that contains the names and addresses of all members of your organization. If that file were indexed as a simple text file, every search would return all the names. But you can use other WAIS file types to gain more control over what your users receive.
Say your file of names and addresses has a blank line between each name, as in Figure 5-1. If you use the para (short for paragraph) type, each name and address combination is treated as a separate document, because a blank line separates each address into "paragraphs." The headline would be the first line of each paragraph, which would work fine for this example. So if Brazil were the search item, the server would return only Rosa Mello's name and address.
If you want every line of a file to be treated separately, you would specify the file as the one_line type. For example, you have a collection of one-liners (jokes) that you want to index, as in Figure 5-2. The waisindex program will treat each line as a separate document to be indexed and retrieved separately.
With the dash type you can have several paragraphs grouped together, so long as they are separated by a row of dashes. This is useful when the text is longer than a paragraph but not long enough to be in a file by itself. This also allows for some entries to be much longer than others. Just separate them with the row of dashes and they'll be indexed and retrieved separately. In Figure 5-3 student 1 wrote an essay of several paragraphs, student 2 wrote only one paragraph, and student 3 wrote three paragraphs. Because of the dashes between them, they'll be searched and retrieved separately.
Large amounts of text often accumulate in mail files. These could be personal messages, the archives of a list server, or the record of your company's e-mail correspondence. Use the mail_or_rmail type, and WAIS can index each message as a separate item, with the subject line of the message acting as the headline. Mail_or_rmail will search and retrieve separately the two mail messages in Figure 5-4. The system will treat the subject line for each as its headline, even though it is not the first line in the section.
Similarly, you can use waisindex to specify Usenet News (Netnews) files or those saved by the rn news reader as a type. That way individual Usenet News postings can be retrieved separately. Logically, the subject line will be used as the headline.
If it seems strange that WAIS can index image files, don't be fooled. WAIS is not magically ferreting out the contents of the images and indexing them. All it is indexing is the file names of these nontext files.
Another technique is to match an explanatory text file with a nontext file, one with the extension .text, the other with an extension suitable to the type of file it is, such as .gif, .au, .mpeg, and so on. For example, lincoln.text and lincoln.gif would both turn up during a search for Lincoln. WAIS indexes the text in the description file and it can be set up to return the titles of both the nontext and text files when it finds a match. You can use this method to link multiple files, such as video, image, and sound relating to the same topic.
Most of the data types listed in Table 5-2 are probably unfamiliar to you. Even when they are familiar, it can be helpful to have a model against which to check to ensure you and WAIS really are speaking the same language. For that reason the nice folks at Dortmund University in Germany put a set of example files on the Net: <ftp://ls6-www.informatik.uni-dortmund.de/pub/wais/fmt-examples>.
Once you've identified the type of file format you have and know where you'll store the indexes, you're ready to run the waisindex program to create all the indexes that waisserver will need. The indexes created will take up as much space as the original files. For a file full of one-line records, you would type waisindex -d QUOTES -export -t one_line quotes.txt. The parameter -d QUOTES creates a WAIS database called QUOTES. The -t one_line means treat it as a type one_line where every line is indexed separately. The -export parameter means to send the information on your WAIS server to the master directory of WAIS servers at WAIS, Inc. <http://www.wais.com>
Figure 5-5 shows an index from a file of mural artists' biographies that fit the para (paragraph) type. Unfortunately, as you can see in Figure 5-5, the name of the artist leads right into the first line of the biography. Because the para type uses what it finds on the first line to create the headline presented in query results, this is messier than we want. Each headline will show up with the artist's name, date of birth, and a few words from the first sentence. In this case we have a simple solution--there is a consistent pattern that helps differentiate the title information. Note that in Figure 5-5 the artist's name and date information ends with a colon.
After checking that the colon usage is consistent throughout the file, we can write a simple Perl script to find everything up to the colon and put a carriage return character afterward it. The resulting file looks like Figure 5-6.
Now when waisindex takes the first line of each paragraph to be the title for that record, the artist's name and birthdate show up alone for a clean title. As in this case, you may often find yourself cleaning up text file formats for better indexing.
Indexing HTML files is a common occurrence with the tremendous increase in WWW servers. The freeWAIS-sf FAQ describes how to do this using the freeWAIS-sf server. See <http://www.cis.ohio-state.edu/hypertext/faq/usenet/wais-faq/freeWAIS-sf/faq.html>.
The key to indexing documents is to clearly separate and identify each piece that you might want to treat separately. In the case of a book breaking it into chapters probably is not sufficient. If a WAIS search finds a match in a 20-page chapter, the person who requested the material still will have to plow through 20 pages to find the section that matches. Dividing the book's chapters into sections or even one- or two-page parts might make more sense. That way a search response returns a much smaller and more manageable piece of text for the user to view. However, you'll want to make sure that you don't break it down so much that viewers will lose the context.
Parsers and filters are tools that can split up your text or rearrange it for you. Basically, you can program parsers and filters to take advantage of any built-in patterns in the text to break it the way you want it. Say you have a file of mail messages that is separated by a row of equal signs (Microsoft Mail for DOS can produce files like this), and you want to index each message individually. The dash type would work, except that it wants a row of dashes instead of a row of equal signs. No problem. Convert each row of equal signs to a row of dashes. That's easy to do with any program that can do searches and replaces. Stream editor (sed) and Perl in UNIX can easily do this. Similar tools are available for almost every computer platform.
It may be possible to create your own filters by using a language such as Perl, C, or even shell scripts. In Figure 5-5, I wanted the artist's name to show up as the headline, or title, for each record. The easiest way to do that was to use Perl to convert the text to fit an existing format type. You can make the server make such changes automatically when you have a regular supply of files that need exactly the same alteration. But the problem often is that each file is different, and someone has to find the easiest way to convert it to a form that will work. If you know your tools, you're halfway there.
Once you have installed your WAIS server and set up some databases for public searching, you can use the Register parameter in Waisindex to register with the central WAIS directories. This is not necessary if you are going to access WAIS databases only from your Gopher or WWW servers.
WAIS is a very different animal from Gopher or WWW. Gopher and WWW provide information over the Internet, but for the most part any indexing that is done is hired out, that is, done by other programs. (The GN Gopher/WWW server and the WN WWW server are exceptions.) WAIS and Z39.50, to which it's related, are specifically about indexing text databases and providing for online searches of those databases. This is a large and growing area of Internet publishing, but it is often hidden behind Gopher and WWW front ends.
The commercial version of WAIS servers is available only from WAIS, Inc., and is expensive ($15,000), although the company is considering a free version (see Chapter 11). The freeWAIS version is no longer supported by CNIDR. The improved freeWAIS-sf version adds the ability to do structured field searches, which adds quite a bit more flexibility for the user. Many alternative indexing systems are available, including Glimpse, ICE, and Open Text. For the most part these focus strictly on indexing. Most have some sort of gateway to WWW so that they can be searched by WWW browsers.
Z39.50 is a U.S. standard that complies with a similar international standard for information indexing and retrieval. Large-scale information publishers and libraries have developed some interesting applications using Z39.50. See <http://www.lib.ncsu.edu/staff/morgan/alcuin/wwwed-catalogs.html> for Library Catalogs with WWW-to-Z39.50 Interfaces. The U.S. Office of Management and Budget has helped to set up the Z39.50-based program called GILS to make available the vast amount of information and data collected by various U.S. government agencies.