On Sat, 17 Aug 2002 09:08:01 -0700
JC Dill <inet-list@vo.cnchost.com> wrote:
> On 06:28 PM 8/16/02, Nick Simicich wrote:
>> At 10:04 AM 2002-08-16 -0700, JC Dill wrote:
> How hard is it to expect webmaster to take simple steps to announce
> their policy? They need a *single* file indicating if spidering is
> allowed or not, and they need a *single* header in the webpage to
> indicate if it can be cached or not.
> THIS WORKS
> The reason it works is A) it is simple, and B) the burden is shared by
> those with the content (who have to properly post their policy in the
> robots.txt file and in their webpage headers) and those who seek to
> search, mirror, or archive it (who must follow the content provider's
> policy as it is conveyed in the file and header).
robots.txt intercepts the data acquisition path that a spider normally
and already uses in spidering. It just adds a new function point is
grabbing that file too, in exactly the same way its already grabbing
other web files. robot.txt doesn't add a new transport to the mix, it
doesn't add a new protocol to the mix, it doesn't add a different third
party service to the mix, and it doesn't require any changes to the web
servers and their confiration and normal function. It merely very
slightly tweaks what the robot was doing already: sucking nodes from
URLs; and what the web server was doing stays exactly the same: serving
files.
Adding an equivalent of robots.txt to mailing lists under non-SMTP
protocols violates the above models. Adding robots.txt as a
pre-subscription check violates the above models for both "spider" and
list server as well. Adding a flag header to the subscribe handshake
requires no structural changes to the "spider" and damned close to no
changes to the list server (which already sends those messages with
custom content etc anyway).
> Here is the meta tag format that tells all caches on the Internet not
> to cache your webpage:
> <META NAME="ROBOTS" CONTENT="NOARCHIVE">
> Here is the format if you want to just specify no archive by google
> alone:
> <META NAME="GOOGLEBOT" CONTENT="NOARCHIVE">
> This is not hard. This WORKS.
Actually, it works rather poorly and is *very* non-standardised. Yes,
it might work with one cache, but your bets across the wider range of
caches out there are damned low probability.
<<Can you tell I spent the last few months hacking Squid?>>
--
J C Lawrence
---------(*) Satan, oscillate my metallic sonatas.
claw@kanga.nu He lived as a devil, eh?
http://www.kanga.nu/~claw/ Evil is a name of a foeman, as I live.
Follow-Ups:
References:
|
|