WAR OF THE SEARCH ROBOTS IN THE MODIFIED SEVENTH QUANTUM REALM . . . ?
DISCLAIMER :
BELOW THIS LINE YOU WILL BE LEAVING THE WWW.GLOBALMIND.INFO AREA.
PLEASE CHOOSE THE WEBSITE VIA HYPERLINKS PROVIDED BELOW TO READ FULL ARTICLE .
Web Robots are programs that traverse the Web automatically. Some people call them Web Wanderers, Crawlers, or Spiders. These pages have further information about these Web Robots.
| The
Web Robots FAQ |
Frequently Asked Questions about Web Robots, from Web users, Web authors, and Robot implementors. |
| Robots Exclusion |
Find out what you can do to direct robots that visit your Web site. |
| A List of Robots |
A database of currently known robots, with descriptions and contact details. |
| The Robots Mailing List |
An archived mailing list for discussion of technical aspects of designing, building, and operating Web Robots. |
| Articles
and Papers |
Background reading for people interested in Web Robots |
| Related Sites |
Some references to other sites that concern Web Robots. |
(http://www.robottxt.org /wc/robots.html)
These frequently asked questions about Web robots.
Send suggestions and comments to Martijn
Koster. This information is in the public domain.
A robot is a program that automatically traverses the Web's hypertext structure by retrieving a document, and recursively retrieving all documents that are referenced.
Note that "recursive" here doesn't limit the definition to any specific traversal algorithm; even if a robot applies some heuristic to the selection and order of documents to visit and spaces out requests over a long space of time, it is still a robot.
Normal Web browsers are not robots, because the are operated by a human, and don't automatically retrieve referenced documents (other than inline images).
Web robots are sometimes referred to as Web Wanderers, Web Crawlers, or Spiders. These names are a bit misleading as they give the impression the software itself moves between sites like a virus; this not the case, a robot simply visits sites by requesting documents from them.
The word "agent" is used for lots of meanings in computing these days. Specifically:
A search engine is a program that searches through some dataset. In the context of the Web, the word "search engine" is most often used for search forms that search through databases of HTML documents gathered by a robot.
Robots can be used for a number of purposes:
See the list of active robots to see what robot does what. Don't ask me -- all I know is what's on the list...
They're all names for the same sort of thing, with slightly different connotations:
There are a few reasons people believe robots are bad for the Web:
But at the same time the majority of robots are well designed, professionally operated, cause no problems, and provide a valuable service in the absence of widely deployed better solutions.
So no, robots aren't inherently bad, nor inherently brilliant, and need careful attention.
Yes:
Its coverage of HTTP, HTML, and Web libraries is a bit too thin to be a "how to write a web robot" book, but it provides useful background reading and a good overview of the state-of-the-art, especially if you haven't got the time to find all the info yourself on the Web.
Published by New Riders, ISBN 1-56205-463-5.
Published by Sam's, ISBN: 1-57521-016-9
A few others can be found on the The Software Agents Mailing List FAQ
There is a Web robots home page on: http://www.robotstxt.org/wc/robots.html
Of course the latest version of this FAQ is there.
You'll also find details and an archive of the robots mailing list, which is intended for technical discussions about robots.
This depends on the robot, each one uses different strategies. In general they start from a historical list of URLs, especially of documents with many links elsewhere, such as server lists, "What's New" pages, and the most popular sites on the Web.
Most indexing services also allow you to submit URLs manually, which will then be queued and visited by the robot.
Sometimes other sources for URLs are used, such as scanners through USENET postings, published mailing list achives etc.
Given those starting points a robot can select URLs to visit and index, and to parse and use as a source for new URLs.
If an indexing robot knows about a document, it may decide to parse it, and insert it into its database. How this is done depends on the robot: Some robots index the HTML Titles, or the first few paragraphs, or parse the entire HTML and index all words, with weightings depending on HTML constructs, etc. Some parse the META tag, or other special hidden tags.
We hope that as the Web evolves more facilities becomes available to efficiently associate meta data such as indexing information with a document. This is being worked on...
You guessed it, it depends on the service :-) Most services have a link to a URL submission form on their search page.
Fortunately you don't have to submit your URL to every service by hand: Submit-it <URL: http://www.submit-it.com/> will do it for you.
You can check your server logs for sites that retrieve many documents, especially in a short time.
If your server supports User-agent logging you can check for retrievals with unusual User-agent heder values.
Finally, if you notice a site repeatedly checking for the file '/robots.txt' chances are that is a robot too.
Well, nothing :-) The whole idea is they are automatic; you don't need to do anything.
If you think you have discovered a new robot (ie one that is not listed on the list of active robots, and it does more than sporadic visits, drop me a line so I can make a note of it for future reference. But please don't tell me about every robot that happens to drop by!
This is called "rapid-fire", and people usually notice it if they're monitoring or analysing an access log file.
First of all check if it is a problem by checking the load of your server, and monitoring your servers' error log, and concurrent connections if you can. If you have a medium or high performance server, it is quite likely to be able to cope a high load of even several requests per second, especially if the visits are quick.
However you may have problems if you have a low performance site, such as your own desktop PC or Mac you're working on, or you run low performance server software, or if you have many long retrievals (such as CGI scripts or large documents). These problems manifest themselves in refused connections, a high load, performance slowdowns, or in extreme cases a system crash.
If this happens, there are a few things you should do. Most importantly, start logging information: when did you notice, what happened, what do your logs say, what are you doing in response etc; this helps investigating the problem later. Secondly, try and find out where the robot came from, what IP addresses or DNS domains, and see if they are mentioned in the list of active robots. If you can identify a site this way, you can email the person responsible, and ask them what's up. If this doesn't help, try their own site for telephone numbers, or mail postmaster at their domain.
If the robot is not on the list, mail me with all the information you have collected, including actions on your part. If I can't help, at least I can make a note of it for others.
Read the next section...
They are probably from robots trying to see if you have specified any rules for them using the Standard for Robot Exclusion, see also below.
If you don't care about robots and want to prevent the messages in your error logs, simply create an empty file called robots.txt in the root level of your server.
Don't put any HTML or English language "Who the hell are you?" text in it -- it will probably never get read by anyone :-)
The quick way to prevent robots visiting your site is put these two lines into the /robots.txt file on your server:
User-agent: * Disallow: /
but its easy to be more selective than that.
You can read the whole standard specification but the basic concept is simple: by writing a structured text file you can indicate to robots that certain parts of your server are off-limits to some or all robots. It is best explained with an example:
# /robots.txt file for http://webcrawler.com/ # mail webmaster@webcrawler.com for constructive criticism User-agent: webcrawler Disallow: User-agent: lycra Disallow: / User-agent: * Disallow: /tmp Disallow: /logs
The first two lines, starting with '#', specify a comment
The first paragraph specifies that the robot called 'webcrawler' has nothing disallowed: it may go anywhere.
The second paragraph indicates that the robot called 'lycra' has all relative URLs starting with '/' disallowed. Because all relative URL's on a server start with '/', this means the entire site is closed off.
The third paragraph indicates that all other robots should not visit URLs starting with /tmp or /log. Note the '*' is a special token, meaning "any other User-agent"; you cannot use wildcard patterns or regular expressions in either User-agent or Disallow lines.
Two common errors:
Probably... there are some ideas floating around. They haven't made it into a coherent proposal because of time constraints, and because there is little pressure. Mail suggestions to the robots mailing list, and check the robots home page for work in progress.
Sometimes you cannot make a /robots.txt file, because you don't administer the entire server. All is not lost: there is a new standard for using HTML META tags to keep robots out of your documents.
The basic idea is that if you include a tag like:
<META NAME="ROBOTS" CONTENT="NOINDEX">
in your HTML document, that document won't be indexed.
If you do:
<META NAME="ROBOTS" CONTENT="NOFOLLOW">
the links in that document will not be parsed by the robot.
Some people are concerned that listing pages or directories in the /robots.txt file may invite unintended access. There are two ansers to this.
The first answer is a workaround: You could put all the files you don't want robots to visit in a separate sub directory, make that directory un-listable on the web (by configuring your server), then place your files in there, and list only the directory name in the /robots.txt. Now an ill-willed robot can't traverse that directory unless you or someone else puts a direct link on the web to one of your files, and then it's not /robots.txt fault.
For example, rather than:
User-Agent: * Disallow: /foo.html Disallow: /bar.html
do:
User-Agent: * Disallow: /norobots/
and make a "norobots" directory, put foo.html and bar.html into it, and configure your server to not generate a directory listing for that directory. Now all an attacker would learn is that you have a "norobots" directory, but he won't be able to list the files in there; he'd need to gues their names.
However, in practice this is a bad idea -- it's too fragile. Someone may publish a link to your files on their site. Or it may turn up in a publicly accessible log file, say of you user's proxy server, or maybe it will show up in someone's web server log as a Referer. Or someone may misconfigure your server at some future date, "fixing" it to show a directory listing. Which leads me to the real answer:
The real answer is that /robots.txt is not intended for access control, so don't try to use it as such. Think of it as a "No Entry" sign, not a locked door. If you have files on your web site that you don't want unauthorized people to access, then configure your server to do authentication, and configure appropriate athorization. Basic Authentication has been around since the early days of the web (and in e.g. Apache on UNIX is trivial to configure), and if you're really serious, SSL is commonplace in web servers.
If you mean a search service, check out the various directory pages on the Web, such as Netscape's Exploring the Net or try one of the Meta search services such as MetaSearch
Well, you can have a look at the list of robots; I'm starting to indicate their public availability slowly.
In the meantime, two indexing robots that you should be able to get hold of are Harvest (free), and Verity's.
See above -- some may be willing to give out source code.
Alternatively check out the libwww-perl5 package, that has a simple example.
Lots. First read through all the stuff on the robot page then read the proceedings of past WWW Conferences, and the complete HTTP and HTML spec. Yes; it's a lot of work :-)
Simply fill in a form you can find on The Web Robots Database and email it to me.
Sometimes people find they have been indexed by an indexing robot, or that a resource discovery robot has visited part of a site that for some reason shouldn't be visited by robots.
In recognition of this problem, many Web Robots offer facilities for Web site administrators and content providers to limit what the robot does. This is achieved through two mechanisms:
| The Robots Exclusion
Protocol |
A Web site administrator can indicate which parts of the site should not be vistsed by a robot, by providing a specially formatted file on their site, in http://.../robots.txt. |
| The Robots META tag |
A Web author can indicate if a page may or may not be indexed, or analysed for links, through the use of a special HTML META tag. |
The remainder of this pages provides full details on these facilities.
Note that these methods rely on cooperation from the Robot, and are by no means guaranteed to work for every Robot. If you need stronger protection from robots and other agents, you should use alternative methods such as password protection.
The Robots Exclusion Protocol is a method that allows Web site administrators to indicate to visiting robots which parts of their site should not be visited by the robot.
In a nutshell, when a Robot vists a Web site, say http://www.foobar.com/, it firsts checks for http://www.foobar.com/robots.txt. If it can find this document, it will analyse its contents for records like:
User-agent: * Disallow: /
to see if it is allowed to retrieve the document. The precise details on how these rules can be specified, and what they mean, can be found in:
The Robots META tag allows HTML authors to indicate to visiting robots if a document may be indexed, or used to harvest more links. No server administrator action is required.
Note that currently only a few robots implement this.
In this simple example:
<META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW">
a robot should neither index this document, nor analyse it for links.
Full details on how this tags works is provided:
The List of Active Robots has been changed to a new format, called The Web Robots Database. This format will allow more information to be stored, updates to happen faster, and the information to be more clearly presented.
Note that now robot technology is being used in increasing numbers of end-user products, this list is becoming less useful and complete.
For general information on robots see Web Robots Pages.
The robot information is now stored into individual files, with several HTML tables providing different views of the data:
Browsers without support for tables can consult the overview of text files.
The combined raw data in machine readable format is available in a text file.
To add a new robot, fill in this empty template, using this schema description, and email it to add-robot@robotstxt.org.
There are robots out there that the database contains no details on. If/when I get those details they will be added, otherwise they'll remain on the list below, as unresponsive or unknown sites.
These services must use robots, but haven't replied to requests for an entry...
User-agent field: Wobot/1.00 From: mckinley.mckinley.com (206.214.202.2) and galileo.mckinley.com. (206.214.202.45) Honors "robots.txt": yes Contact: cedeno@mckinley.mckinley.com (or possibly: spider@mckinley.mckinley.com) Purpose: Resource discovery for Magellan (http://www.mckinley.com/)
These look like new robots, but have no contact info...
BizBot04 kirk.overleaf.com HappyBot (gserver.kw.net) CaliforniaBrownSpider EI*Net/0.1 libwww/0.1 Ibot/1.0 libwww-perl/0.40 Merritt/1.0 StatFetcher/1.0 TeacherSoft/1.0 libwww/2.17 WWW Collector processor/0.0ALPHA libwww-perl/0.20 wobot/1.0 from 206.214.202.45 Libertech-Rover www.libertech.com? WhoWhere Robot ITI Spider w3index MyCNNSpider SummyCrawler OGspider linklooker CyberSpyder (amant@www.cyberspyder.com) SlowBot heraSpider Surfbot Bizbot003 WebWalker SandBot EnigmaBot spyder3.microsys.com www.freeloader.com.
These have no known user-agent, but have requested /robots.txt repeatedly
or exhibited crawling patterns.
205.252.60.71 194.20.32.131 198.5.209.201 acke.dc.luth.se dallas.mt.cs.cmu.edu darkwing.cadvision.com waldec.com www2000.ogsm.vanderbilt.edu unet.ca murph.cais.net (rapid fire... sigh) spyder3.microsys.com www.freeloader.com.
Some other robots are mentioned in a list of Japanese Search Engines.
(http://www.robottxt.org /wc/robots.html)
Note: this mailing list was formerly located at
robots@nexor.co.uk.
This list has moved to robots@mccmedia.com
The robots@webcrawler.com mailing-list is intended as a technical forum
for authors, maintainers and administrators of WWW robots. Its aim is to maximise the
benefits WWW robots can offer while minimising drawbacks and duplication of effort. It is
intended to address both development and operational aspects of WWW robots.
This list is not intended for general discussion of WWW development efforts, or as a first line of support for users of robot facilities.
Postings to this list are informal, and decisions and recommendations formulated here do not constitute any official standards. Postings to this list will be made available publicly through a mailing list archive. The administrator of this list nor his company accept any responsibility for the content of the postings.
These few rules of etiquette make the administrator's life easier, and this list (and others) more productive and enjoyable:
When subscribing to this list, make sure you check any auto-responder ("vacation"> software, and make sure it doesn't reply to messages from this list. X-400 and LAN email systems are notorious for positive delivery reports...
If your email address changes, please unsubscribe and resubscribe rather than just let the subscription go stale: this saves the administrator work (and fustration)
When first joining the list, glance through the archive (details below) or listen-in a while before posting, so you get a feel for the kind of traffic on the list.
Never send "unsubscribe" messages to the list itself.
Don't post unrelated or repeated advertising to the list.
To subscribe to this list, send a mail message to robots-request@webcrawler.com,
with the word subscribe on the first line of the body.
To unsubscribe to this list, send a mail message to robots-request@webcrawler.com,
with the word unsubscribe on the first line of the body.
Should this fail or should you otherwise need human assistance, send a message to owner-robots@webcrawler.com.
To send message to all subscribers on the list itself, mail robots@webcrawler.com.
Messages to this list are archived. The preferred way of accessing the archived messages is using the Robots Mailing List Archive provided by Hypermail.
Behind the scenes this list is currently managed by Majordomo, an automated mailing list
manager written in Perl. Majordomo also allows acces to archived messages; send mail to robots-request@webcrawler.com
with the word help in the body to find out how.
(http://www.robottxt.org /wc/robots.html)
| Bot Spot |
"The Spot for All Bots on the Net". |
| The Web Robots Pages |
Martijn Koster's pages on robots, specifically robot exclusion. |
| Japanese
Search Engines |
This is a comprehensive index for searching, submitting, and navigating using Japanese search engines. |
| Search
Engine Watch |
A site with information about many search engines, including comparisons. Some information is available to subscribers only. |
| RoboGen
|
RoboGen is a visual editor for Robot Exclusion Files; it allows one to create agent rules by logging onto your FTP server and selecting files and directories. |
(http://www.robottxt.org /wc/robots.html)
PLEASE CHOOSE THE WEBSITE VIA HYPERLINKS PROVIDED BELOW TO READ ARTICLE .