On 06:28 PM 8/16/02, Nick Simicich wrote:
>At 10:04 AM 2002-08-16 -0700, JC Dill wrote:
>>So, my question to list-managers: Do you support the suggestion that list
>>managers/owners have some responsibility (ala robots.txt) for telling
>>those who wish to mirror or archive the list what the list rules are?
>The reality is that the "robots.txt" file should be a list of directories
>that robots are allowed into, to spider. The concept of needing to post a
>keep out sign to preserve existing intellectual property rights is
It works very well, yet you call it broken. Meanwhile we have NO such
system for mailing lists and you propose one that has far less chance of
actually working because it isn't fully automated.
> I agree that the original intent was that automated reading did
>not violate the intent or scope of the publishing, but if you note, google
>has a copy of everything - you can pull up their copy of any spidered web
>page simply by using the google toolbar and asking for the cached copy - I
>did that tonight when someone had taken a site down and I wanted to see
>what they used to advertise. (Chuq, under my model of the world, google
>would look for a robots.txt file and if they did not find one they would
>not index your site -- since I think that many sites want to be indexed, I
>think that they would create these files).
Perhaps you should read the Google FAQ on robots.txt and caching:
How hard is it to expect webmaster to take simple steps to announce their
policy? They need a *single* file indicating if spidering is allowed or
not, and they need a *single* header in the webpage to indicate if it can
be cached or not.
The reason it works is A) it is simple, and B) the burden is shared by
those with the content (who have to properly post their policy in the
robots.txt file and in their webpage headers) and those who seek to search,
mirror, or archive it (who must follow the content provider's policy as it
is conveyed in the file and header).
>Thus, if we *must* have an automated procedure of any sort, I think that
>the procedure should be:
Yes, we *must*, or else you will continue to find lists being mirrored and
archived without honoring the desires of the list owner. This genie isn't
going back into the bottle, so lets make some policies that control it.
>The response message to the subscribe may contain the specific phrase, "Off
>site archives allowed". If this phrase is not present, no off site archive
>should confirm their subscription without manually checking with the
>listowner. An alternative would be "obfuscated off site archives allowed"
>which would allow off site archives only if they obfuscate anything that
>looked like an e-mail address (any string of characters that can be parsed
>into a local part and a host part according to the typical rules used to
>parse an e-mail address) into something that can't be converted back into
>an e-mail address.
This is way too convoluted. We are talking about automated systems. We
need something with the simplicity of the robots.txt formale, of the meta
tag format: Data follows: Field, variable.
Here is the meta tag format that tells all caches on the Internet not to
cache your webpage:
<META NAME="ROBOTS" CONTENT="NOARCHIVE">
Here is the format if you want to just specify no archive by google alone:
<META NAME="GOOGLEBOT" CONTENT="NOARCHIVE">
This is not hard. This WORKS.
A comparable system for mailing lists and gmane can just as simply be done
by creating policy header values for email from the list. Sending a query
to the mailing list server asking for the policy file for the list would
return a file containing all the policy headers. Options could include:
X-list-archive: allow all
X-list-archive: deny all
X-list-archive: deny gmane, allow all
X-list-mirror: allow all
X-list-mirror: deny all
X-list-policy: <archive> deny gmane, allow all
X-list-policy: <mirror> deny all
X-list-policy: <notify> mirror; <listowner-email-addresss>
Or some other similar header scheme.
A mirror-notify value would tell the mirror to send email to the listowner
address when subscribing, to notify the owner that a mirror has subscribed.
>Good manners requires that the putative off-site archive ask permission
>anyway. For one thing, it lets the list owner know that the archive will
*Good* manners are gone from the Internet. What we need is not some Ms.
Manners-suitable for dining with Heads of State type of solution, but
simply a workable one for the masses.
>The lack of such a permission phrase should indicate that the list may not
>automatically be added to an off-site archive. Someone desiring to archive
>such a list publicly must check with the list owner.
You can wish this all you want, but it's never going to happen. Just as
you could wish that Canter and Segal would stop their green card spam, but
wishing it didn't make it stop, or make spam go away. We have to deal with
the reality of what is, not what we wish it would be. We NEED to create an
automated system for those who wish to offer offsite mirrors and archives
because such sites WILL exist, and they WILL mirror and archive unless we
make it EASY for them to learn the list policy and follow it.
>The format of robots.txt should never have been accepted: what it does is
>to force people to opt-out of having their intellectual property pirated.
What it does is make search sites like Google exist, and work. Without it,
we would not have the very useful searching and indexing of content on the
Internet. The huge success of the Internet as a place to find valuable
information is almost entirely due to the success of the various search sites.
Take DejaGoogle as an example. Before dejanews, before DejaGoogle, there
was no easy way to search large quantities of past usenet posts. Now it's
very easy to search these past posts, and there are many websites that now
point to usenet posts. This makes this data *accessible* to the net at
large and more and more are accessing it every day.
gmane (and other sites like it) will do the same for mailing lists. IMHO,
this is a good thing. There should be automated systems designed so that
those who want to keep their lists out of gmane (and others like it,
present and future) can easily do so. But complaining that you don't like
the premise isn't going to make it go away, and declaring that all of the
onus should be on the other party simply isn't going to work. The
Internet-using public, as a group, will want this service, they will want
mailing lists archived and easily searchable, just as they love search
>Because what we are considering doing is adding a whole level of opt-out
>permissioning, perhaps with multiple headers in every mailing list posting
>that is archived anywhere, simply to control rude behavior -- people who
>need to be able to look left and right and see a "posted" sign in each
>direction so that they don't come onto your property and trample your truck
>garden and kill Bambi -- because of one or two rude people per five years.
We have it with spiders and with web caches, and it WORKS. It works
because, by and large, this is a service that the Internet users WANT, and
for those who don't want it, there are systems to indicate this.
Spam needs to be opt-in because, by and large, users (people with email
boxes) don't want spam. Staying out of archives and caches needs to be
opt-out because A) this can easily be done with automated systems and B)
most users (webmasters, web visitors, list owners, list subscribers) will
want to opt-in. Anything else simply won't work, no matter how ideal the
solution might seem in principle.