At 11:09 AM 2003-06-06 -0700, Brent Chapman wrote:
>At 10:38 AM -0700 6/6/03, Russ Allbery wrote:
>>Nick Simicich <njs@scifi.squawk.com> writes:
>>
>> > Just as a point: This is a really poorly thought out RFC. You might
>> > want to decode those in your MTA or mailing list manager before
>> > forwarding them to your subscribers. You *can't* safely do so.
>>
>>Why do you want to do that? Certainly you're not allowed to do that
>>within that RFC because doing so would break the e-mail protocol. The
>>whole reason why RFC 2047 exists is because 8-bit characters are not
>>allowed in RFC 822 and RFC 2822 headers.
Yes, but that does not mean that you have to put the encoding of them
there. Were I writing the standard, I would have required that the old
headers contain the best possible encoding in seven bit characters. If no
such encoding existed, leave the header out or leave the clause out. The
stuff that was encoded should have been elsewhere. Have a special header
for it, or (probably better) put it in a new body section, or
something. Mail headers have traditionally been human readable. This RFC
violated that basic principle, and that, in and of itself, made it a bad idea.
>Let's say I have an archive of email messages to a list, and want to
>create an index by subject, or to enable searches by subject. How am I
>supposed to do this if the Subject header is encoded, and I'm not supposed
>to decode it except for display?
I'm not sure. Consider that all handling of the data must be done in a
binary safe manner. The encoder could encode a binary zero byte, for
example, which will screw up most typical string handling. Also, the data
could well be in a double byte character set. It might well not be
represented by any 7 bit character.
Your parser that assumes characters that fit in a normal (say)
representation of a C string is probably already broken. The parser that
tokenizes into words is also likely broken. A space is not a space, a
punctuation is not a punctuation, and a word-character is not a
word-character and a non-word character is not a non-word
character. Except in the context of the character set...which is not
constant for the document, if I remember right.
I have considered the simple scheme of looking for this type of encoding in
the headers and returning them to their origin for the lists I run which
are supposed to be English. The thing that stops me is that, just as in
the cases of html body sections, the people writing the e-mail didn't
always know what they were doing that caused the mime bodies to be generated.
This was the original straw that caused me to write demime, and this is why
I would want to do something similar for encoded headers. But the standard
makes it impossible.
If your parser can handle all of these things (including, if I remember
correctly, character changes within the line) then it can index things
encoded in this manner.
>Similar problem with Base64-encoded message bodies (jeez, I hate Outlook).
There you have a different solution, in that you can decode them and
re-encode in QP. Outlook is not the only guilty party there, though.
>I agree with Nick here: ISO-encoded subject lines are a "solution" to a
>non-problem, where the people putting for the solution apparently didn't
>think through most of the consequences of what they were proposing.
I believe that the problem was real enough, especially for the people who
use character sets that are multi-byte. However, I do not talk to people
who communicate in those character sets with me, because I don't read any
language that requires them, so I don't see it as a problem.
What I am extremely affronted by is the encoding of ordinary subject lines
simply because (I've seen this) someone selected 8859-1 and then didn't use
any characters that were not represented in ASCII. More frequently,
someone simply picks the wrong tic mark, and this throws the whole header
into encoding.
I feel that the standard should have forbidden the encoding of
non-printable characters. There is no good reason to encode a newline, for
example, or a control character, or a binary zero, nor are there good
reasons to have same in subject lines -- in ASCII, these characters may not
be in the text of the subject for a simple reason: They are used to
delimit the line by the parsers. If the character does not resolve to
printable point in the character set used, the standard should permit the
message to be bounced/trashed/non-delivered, at the least. And if someone
has a handler that locally can deal with eight bit characters in the
headers, the decoded header should not break header parsing that is based
on CR/LF or LF ending a "line", a space at the beginning of the line
showing a continuation, and a blank line ending the headers. It should be
possible to store a decoded headed in an eight-bit safe message store
without breaking reparsing of the header. (In other words, there were good
reasons for forbidding these things, and the standard should not have made
this a free for all.)
--
He said: "There are people from Baath here reporting everything that
goes on. There are cameras here recording our faces. If the Americans
were to withdraw and everything were to return to the way it was before,
we want to make sure that we survive the massacre that would follow
as Baath go house to house killing anyone who voiced opposition to
Saddam. In public, we always pledge our allegiance to Saddam, but in
our hearts we feel something else."
Nick Simicich - njs@scifi.squawk.com
References:
|
|