Great Circle Associates List-Managers
(February 1998)
 

Indexed By Date: [Previous] [Next] Indexed By Thread: [Previous] [Next]

Subject: Re: Archives and robots.txt
From: Chris Pepper <pepper-list @ list . audubon . org>
Date: Thu, 5 Feb 1998 16:46:08 -0500
To: list-managers @ GreatCircle . COM
Cc: cnorman @ best . com
In-reply-to: <199802050241.SAA06128@shell7.ba.best.com>
References: <v04003a31b0fe95333d9a@[17.219.12.172]> (message from Chuq VonRospach on Wed, 4 Feb 1998 14:00:16 -0800)

Cyndi,

	Check out
<http://info.webcrawler.com/mak/projects/robots/norobots.html#method> for
the Robots Exclusion Standard.

	Short form: well-behaved robots look for /robots.txt (note -- this
is a file at the top level of the server; they *DO NOT* look in every
directory for files named robots.txt) for instructions before they start
spidering sites.

	If you can get to your site via two different hostnames that give
you differnet effective top-level directories, you'll need your exclusion
entries in two different robots.txt files.

	To block people using URLs of the form
<http://www.mydomain.com/keepout/privatepage.html>, you'll need a
robots.txt file in the same directory as the keepout directory. To prevent
robots from following the URL
<http://www.best.com/~cnorman/keepout/privatepage.html>, you'll need an
entry in the main robots.txt file, adjacent to Best's home page. For this
case, since you can't edit directly, you'll probably want to send a note to
webmaster@best.com, asking them to create a file named robots.txt at the
top level of their content tree containing these two lines:

User-agent: *
Disallow: /~cnorman/

	You can also exclude specific pages (but not whole directories)
with META tags -- check out the page at WebCrawler for details on this
method.


						Chris
PS-If you can get to your home directory via a full path, as well as by a
user (tilde) path, you might want to exclude that as well.

At 9:41 PM -0500 02/04/98, Cyndi Norman wrote:

>A word about robots.txt.  It doesn't work for most of us.  I set it up and
>the search engines (altavista) still went to the pages, months later.  I
>had blocked out an entire subdirectory and I know I did the logistics
>correctly.  I asked around my ISP's local newsgroups and it turns out that
>you can only block directories if the robots.txt file is at the top level
>(i.e., if you have a custom domain).
>
>I don't really understand this though.  If I have a custom domain that's
>really a virtual domain, why would robots.txt work?  It would perhaps keep
>out searches of, say, http://www.mydomain.com/keepout/privatepage.html but
>how would it stop a search of the very same file which is also known as,
>http://www.best.com/~cnorman/keepout/privatepage.html ??
>
>People on the groups mentioned alternatives to robots.txt where you put an
>HTML command on each page you don't want searched.  But I'm afraid it
>didn't make any sense to me.  Is there someone who could give me the code
>(I know HTML and could probably implement it with a brief explaination) to
>block searches of indivudual pages?  If there is a way to block directories
>or FTP sites from search engines, I'd appreciate that very much.

--
Chris Pepper | National Audubon Society: Web & List Manager
212 979 3092 |    <http://www.audubon.org/staff/pepper/>


References:
Indexed By Date Previous: Re: Trapping MIME content?
From: Info-LabVIEW List Maintainer <info-labview-request@pica.army.mil>
Next: Re: Archives and robots.txt
From: Joe Smith <jms@tardis.Tymnet.COM>
Indexed By Thread Previous: Re: Archives and robots.txt
From: Chuq Von Rospach <chuqui@plaidworks.com>
Next: Re: Archives
From: Gerald Oskoboiny <gerald@impressive.net>

Google
 
Search Internet Search www.greatcircle.com