Great Circle Associates List-Managers
(August 2002)
 

Indexed By Date: [Previous] [Next] Indexed By Thread: [Previous] [Next]

Subject: Re: The gmane issue
From: JC Dill <inet-list @ vo . cnchost . com>
Date: Sat, 17 Aug 2002 14:09:36 -0700
To: List Managers <List-Managers @ greatcircle . com>
In-reply-to: <11695.1029614582@kanga.nu>
References: <Message from JC Dill <inet-list@vo.cnchost.com><5.0.0.25.2.20020817082440.02fbce20@pop3.vo.cnchost.com><5.0.0.25.2.20020815161849.03eac910@pop3.vo.cnchost.com><5.0.0.25.2.20020817082440.02fbce20@pop3.vo.cnchost.com>

On 01:03 PM 8/17/02, J C Lawrence wrote:
 >On Sat, 17 Aug 2002 09:08:01 -0700
 >JC Dill <inet-list@vo.cnchost.com> wrote:
 >> On 06:28 PM 8/16/02, Nick Simicich wrote:
 >>> At 10:04 AM 2002-08-16 -0700, JC Dill wrote:
 >
 >> How hard is it to expect webmaster to take simple steps to announce
 >> their policy?  They need a *single* file indicating if spidering is
 >> allowed or not, and they need a *single* header in the webpage to
 >> indicate if it can be cached or not.
 >
 >> 	THIS WORKS
 >
 >> The reason it works is A) it is simple, and B) the burden is shared by
 >> those with the content (who have to properly post their policy in the
 >> robots.txt file and in their webpage headers) and those who seek to
 >> search, mirror, or archive it (who must follow the content provider's
 >> policy as it is conveyed in the file and header).
 >
 >robots.txt intercepts the data acquisition path that a spider normally
 >and already uses in spidering.  It just adds a new function point is
 >grabbing that file too, in exactly the same way its already grabbing
 >other web files.  robot.txt doesn't add a new transport to the mix, it
 >doesn't add a new protocol to the mix, it doesn't add a different third
 >party service to the mix, and it doesn't require any changes to the web
 >servers and their confiration and normal function.  It merely very
 >slightly tweaks what the robot was doing already: sucking nodes from
 >URLs; and what the web server was doing stays exactly the same: serving
 >files.
 >
 >Adding an equivalent of robots.txt to mailing lists under non-SMTP
 >protocols

Where did this come from?

My suggestion is that just as web spiders can find a robots.txt file 
(telling the spider what is or is not OK to search) on a webserver, so 
should an email querying a mailing list server for their policy find a 
policy document.  The web query is done with HTTP, the email query would be 
done with SMTP.  I thought my suggestion of a "similar system" (and not the 
"same system") made that clear.  Also that I suggested that the data could 
be had in TWO ways (method A, method B).  Method A is SMTP

I *also* suggested that if the mailing list server has an associated 
website, that the policy be available on the website (method B).  I never 
said this should be required, or suggested that it should be the only 
method of finding the policy.  It does so happen, though, for mailing lists 
that have websites, that making this policy available thru the website 
would make it more easily found for PEOPLE who want to know what the policy 
happens to be.  Most people are much more adept at going to the list server 
webpage and clicking on the various links than they are at sending email 
queries to the mail server.

 >violates the above models.  Adding robots.txt as a
 >pre-subscription check violates the above models for both "spider" and
 >list server as well.  Adding a flag header to the subscribe handshake
 >requires no structural changes to the "spider" and damned close to no
 >changes to the list server (which already sends those messages with
 >custom content etc anyway).
 >
 >> Here is the meta tag format that tells all caches on the Internet not
 >> to cache your webpage:
 >
 >> <META NAME="ROBOTS" CONTENT="NOARCHIVE">
 >
 >> Here is the format if you want to just specify no archive by google
 >> alone:
 >
 >> <META NAME="GOOGLEBOT" CONTENT="NOARCHIVE">
 >
 >> This is not hard.  This WORKS.
 >
 >Actually, it works rather poorly

The googlebot extension was apparently created by Google to allow webpages 
to exclude this one specific cache while permitting other caches to cache 
the page (and visa-versa), and it works PERFECTLY.  IMHO it's a much more 
elegant solution than Deja's x-no-archive hack.

 >and is *very* non-standardised.

The meta tag format for indicating cachability of a webpage in general was 
covered in depth in section 14.9 (cache-control) of RFC 2616 (HTTP 1.1) 
which makes it standardized in my book.  My experience has been that most 
caches comply with these tags quite well.

 >Yes, it might work with one cache, but your bets across the wider range of
 >caches out there are damned low probability.

Yes, some web caches ignore the standard meta tag (also found at the URL I 
cited and you snipped) that no cache should ever cache the page.  This 
doesn't mean the system is broken, it means that those caches are 
broken.  You don't fix that by changing the system (using some other method 
of determining if caching is OK or not), but by fixing the broken caches 
that ignore the meta tag that says to not cache.

jc




References:
Indexed By Date Previous: Re: Better mailing list archives (Re: Listing
From: Russ Allbery <rra@stanford.edu>
Next: Re: [Mailman-Developers] Anti-spam "killer app"?
From: Chuq Von Rospach <chuqui@plaidworks.com>
Indexed By Thread Previous: Re: The gmane issue
From: J C Lawrence <claw@kanga.nu>
Next: Re: The gmane issue
From: Thomas Gramstad <thomas@ifi.uio.no>

Google
 
Search Internet Search www.greatcircle.com