Great Circle Associates List-Managers
(June 2003)
 

Indexed By Date: [Previous] [Next] Indexed By Thread: [Previous] [Next]

Subject: Re: standards for iso encoding subject lines?
From: Nick Simicich <njs @ scifi . squawk . com>
Date: Tue, 10 Jun 2003 12:19:54 -0400
To: List Managers <list-managers @ greatcircle . com>
In-reply-to: <v0421010fbb06888a9c64@[66.92.48.201]>
References: <877k7zayxc.fsf@windlord.stanford.edu><5.2.1.1.0.20030606035859.07cf7d88@199.74.151.1><877k7zayxc.fsf@windlord.stanford.edu>

At 11:09 AM 2003-06-06 -0700, Brent Chapman wrote:

>At 10:38 AM -0700 6/6/03, Russ Allbery wrote:
>>Nick Simicich <njs@scifi.squawk.com> writes:
>>
>> > Just as a point:  This is a really poorly thought out RFC.  You might
>> > want to decode those in your MTA or mailing list manager before
>> > forwarding them to your subscribers.  You *can't* safely do so.
>>
>>Why do you want to do that?  Certainly you're not allowed to do that
>>within that RFC because doing so would break the e-mail protocol.  The
>>whole reason why RFC 2047 exists is because 8-bit characters are not
>>allowed in RFC 822 and RFC 2822 headers.

Yes, but that does not mean that you have to put the encoding of them 
there.  Were I writing the standard, I would have required that the old 
headers contain the best possible encoding in seven bit characters.  If no 
such encoding existed, leave the header out or leave the clause out.  The 
stuff that was encoded should have been elsewhere.  Have a special header 
for it, or (probably better) put it in a new body section, or 
something.  Mail headers have traditionally been human readable.  This RFC 
violated that basic principle, and that, in and of itself, made it a bad idea.

>Let's say I have an archive of email messages to a list, and want to 
>create an index by subject, or to enable searches by subject.  How am I 
>supposed to do this if the Subject header is encoded, and I'm not supposed 
>to decode it except for display?

I'm not sure.  Consider that all handling of the data must be done in a 
binary safe manner.  The encoder could encode a binary zero byte, for 
example, which will screw up most typical string handling.  Also, the data 
could well be in a double byte character set.  It might well not be 
represented by any 7 bit character.

Your parser that assumes characters that fit in a normal (say) 
representation of a C string is probably already broken.  The parser that 
tokenizes into words is also likely broken.  A space is not a space, a 
punctuation is not a punctuation, and a word-character is not a 
word-character and a non-word character is not a non-word 
character.  Except in the context of the character set...which is not 
constant for the document, if I remember right.

I have considered the simple scheme of looking for this type of encoding in 
the headers and returning them to their origin for the lists I run which 
are supposed to be English.  The thing that stops me is that, just as in 
the cases of html body sections, the people writing the e-mail didn't 
always know what they were doing that caused the mime bodies to be generated.

This was the original straw that caused me to write demime, and this is why 
I would want to do something similar for encoded headers.  But the standard 
makes it impossible.

If your parser can handle all of these things (including, if I remember 
correctly, character changes within the line) then it can index things 
encoded in this manner.

>Similar problem with Base64-encoded message bodies (jeez, I hate Outlook).

There you have a different solution, in that you can decode them and 
re-encode in QP.  Outlook is not the only guilty party there, though.

>I agree with Nick here: ISO-encoded subject lines are a "solution" to a 
>non-problem, where the people putting for the solution apparently didn't 
>think through most of the consequences of what they were proposing.

I believe that the problem was real enough, especially for the people who 
use character sets that are multi-byte.  However, I do not talk to people 
who communicate in those character sets with me, because I don't read any 
language that requires them, so I don't see it as a problem.

What I am extremely affronted by is the encoding of ordinary subject lines 
simply because (I've seen this) someone selected 8859-1 and then didn't use 
any characters that were not represented in ASCII.  More frequently, 
someone simply picks the wrong tic mark, and this throws the whole header 
into encoding.

I feel that the standard should have forbidden the encoding of 
non-printable characters.  There is no good reason to encode a newline, for 
example, or a control character, or a binary zero, nor are there good 
reasons to have same in subject lines -- in ASCII, these characters may not 
be in the text of the subject for a simple reason:  They are used to 
delimit the line by the parsers.  If the character does not resolve to 
printable point in the character set used, the standard should permit the 
message to be bounced/trashed/non-delivered, at the least.   And if someone 
has a handler that locally can deal with eight bit characters in the 
headers, the decoded header should not break header parsing that is based 
on CR/LF or LF ending a "line", a space at the beginning of the line 
showing a continuation, and a blank line ending the headers. It should be 
possible to store a decoded headed in an eight-bit safe message store 
without breaking reparsing of the header. (In other words, there were good 
reasons for forbidding these things, and the standard should not have made 
this a free for all.)

--
He said: "There are people from Baath here reporting everything that
goes on. There are cameras here recording our faces. If the Americans
were to withdraw and everything were to return to the way it was before,
we want to make sure that we survive the massacre that would follow
as Baath go house to house killing anyone who voiced opposition to
Saddam. In public, we always pledge our allegiance to Saddam, but in
our hearts we feel something else."
Nick Simicich - njs@scifi.squawk.com 


References:
Indexed By Date Previous: Re: standards for iso encoding subject lines?
From: Russ Allbery <rra@stanford.edu>
Next: Error codes
From: Bob Bish <bobbish@earthlink.net>
Indexed By Thread Previous: Re: standards for iso encoding subject lines?
From: Russ Allbery <rra@stanford.edu>
Next: Error codes
From: Bob Bish <bobbish@earthlink.net>

Google
 
Search Internet Search www.greatcircle.com