Abstract

Using the URL at the [end][1] of this item, readers can immediately offer feedback and suggestions on this topic. The importance of good search tools [[HN1][2]] has been apparent since the early days of the Internet. There are now an estimated 100 million Web pages, and with about 200,000 added each day, it is not surprising that the main complaint among users is the difficulty of finding the documents they want. An important class of tools is the search “engine,” which takes keywords as input and returns a set of Web documents. Effective use of these tools requires some knowledge of how they work. Search engines are not directories. The latter are sites where several thousand Web sites have been categorized by human reviewers. The largest is Yahoo! ([www.yahoo.com][3]), with links to about 370,000 Web sites. Although directories often point users to quality sites, they cannot possibly catalog all of the Web. For a more thorough search, users must turn to search engines. Because these engines are not limited by the availability of human filters, they can catalog literally millions of Web documents. Each engine consists of three distinct components: the spider, the index, and the query module. The spider (also referred to as the robot)[[HN2][4]] is a program that “moves” on the Net from page to page in search of new documents. Each robot uses its own specific algorithm [[HN3][5]] for finding and navigating Web pages. The harvested pages are then entered into a database on one of the search engine's computers. The database is organized by an index whose architecture is also specific for each search engine. Users search the index through a predefined query module, an interface specific to each engine. Two concepts should be borne in mind. First, search engines do not scour the Net in real time, but rather query an index of Web pages compiled on the search engine's site. This explains why some search results point to outdated or nonexistent links. Indeed, it may take days or weeks for any particular search engine to traverse the entire Net. Second, the same query on two different search engines will not yield identical results, because the combination of proprietary robot, index, and user interface will be unique. The best strategy is to try several engines when searching for a specific Web document. With the proliferation of commercial search engines, a new tool—the metasearch engine—has occupied a slot above them all. Metasearch engines will send a query to several other engines in parallel and return a composite report (with duplicate entries removed). This can save the effort of performing several searches in series. But metaengines tend to be slower and tend to return a maximum of about only 100 results. Because each engine has its own quirks, it is worth reading the search tips usually listed on the home page. Armed with a few search engines and perseverance, users should be able to find relevant documents in what is fast becoming the world's largest library. Some top search engines are listed in the accompanying Site Finder section, and links to additional search sites can be found at [www.MedsiteNavigator.com/techsight/nettips_3.html][6] ### General Hypernotes 1. This WWW Search Engines Register site is for the expert. It compiles a running tab of the most significant [search tools][7] in current operation. 2. Here you can find details of how to build and operate your own [Web robot][8]. 3. The [Harvest project][9] at the University of Colorado is a popular search and retrieve algorithm for Web search engines. It can be obtained free of charge. [1]: /lookup/doi/10.1126/science.277.5328.976 [2]: #p-9 [3]: http://www.yahoo.com [4]: #p-10 [5]: #p-11 [6]: http://www.MedsiteNavigator.com/techsight/nettips_3.html [7]: http://coombs.anu.edu.au/CoombswebPages/SearchEngines.html [8]: http://info.webcrawler.com/mak/projects/robots/robots.html [9]: http://harvest.transarc.com/

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call