On 01:03 PM 8/17/02, J C Lawrence wrote:
>On Sat, 17 Aug 2002 09:08:01 -0700
>JC Dill <firstname.lastname@example.org> wrote:
>> On 06:28 PM 8/16/02, Nick Simicich wrote:
>>> At 10:04 AM 2002-08-16 -0700, JC Dill wrote:
>> How hard is it to expect webmaster to take simple steps to announce
>> their policy? They need a *single* file indicating if spidering is
>> allowed or not, and they need a *single* header in the webpage to
>> indicate if it can be cached or not.
>> THIS WORKS
>> The reason it works is A) it is simple, and B) the burden is shared by
>> those with the content (who have to properly post their policy in the
>> robots.txt file and in their webpage headers) and those who seek to
>> search, mirror, or archive it (who must follow the content provider's
>> policy as it is conveyed in the file and header).
>robots.txt intercepts the data acquisition path that a spider normally
>and already uses in spidering. It just adds a new function point is
>grabbing that file too, in exactly the same way its already grabbing
>other web files. robot.txt doesn't add a new transport to the mix, it
>doesn't add a new protocol to the mix, it doesn't add a different third
>party service to the mix, and it doesn't require any changes to the web
>servers and their confiration and normal function. It merely very
>slightly tweaks what the robot was doing already: sucking nodes from
>URLs; and what the web server was doing stays exactly the same: serving
>Adding an equivalent of robots.txt to mailing lists under non-SMTP
Where did this come from?
My suggestion is that just as web spiders can find a robots.txt file
(telling the spider what is or is not OK to search) on a webserver, so
should an email querying a mailing list server for their policy find a
policy document. The web query is done with HTTP, the email query would be
done with SMTP. I thought my suggestion of a "similar system" (and not the
"same system") made that clear. Also that I suggested that the data could
be had in TWO ways (method A, method B). Method A is SMTP
I *also* suggested that if the mailing list server has an associated
website, that the policy be available on the website (method B). I never
said this should be required, or suggested that it should be the only
method of finding the policy. It does so happen, though, for mailing lists
that have websites, that making this policy available thru the website
would make it more easily found for PEOPLE who want to know what the policy
happens to be. Most people are much more adept at going to the list server
webpage and clicking on the various links than they are at sending email
queries to the mail server.
>violates the above models. Adding robots.txt as a
>pre-subscription check violates the above models for both "spider" and
>list server as well. Adding a flag header to the subscribe handshake
>requires no structural changes to the "spider" and damned close to no
>changes to the list server (which already sends those messages with
>custom content etc anyway).
>> Here is the meta tag format that tells all caches on the Internet not
>> to cache your webpage:
>> <META NAME="ROBOTS" CONTENT="NOARCHIVE">
>> Here is the format if you want to just specify no archive by google
>> <META NAME="GOOGLEBOT" CONTENT="NOARCHIVE">
>> This is not hard. This WORKS.
>Actually, it works rather poorly
The googlebot extension was apparently created by Google to allow webpages
to exclude this one specific cache while permitting other caches to cache
the page (and visa-versa), and it works PERFECTLY. IMHO it's a much more
elegant solution than Deja's x-no-archive hack.
>and is *very* non-standardised.
The meta tag format for indicating cachability of a webpage in general was
covered in depth in section 14.9 (cache-control) of RFC 2616 (HTTP 1.1)
which makes it standardized in my book. My experience has been that most
caches comply with these tags quite well.
>Yes, it might work with one cache, but your bets across the wider range of
>caches out there are damned low probability.
Yes, some web caches ignore the standard meta tag (also found at the URL I
cited and you snipped) that no cache should ever cache the page. This
doesn't mean the system is broken, it means that those caches are
broken. You don't fix that by changing the system (using some other method
of determining if caching is OK or not), but by fixing the broken caches
that ignore the meta tag that says to not cache.