From: Wietse Venema on 14 Apr 2010 12:54 Victor Duchovni: > On Sat, Mar 27, 2010 at 08:53:03PM -0400, Wietse Venema wrote: > > > Currently, sites that send valid UTF-8 in MAIL/RCPT commands can > > make meaningful LDAP queries in Postfix. Lots of MTAs are 8-bit > > clean internally, so this can actually work today. > > > > Do we want to remove this ability from Postfix, or should we add > > a valid_utf_8() routine in anticipation of a future standardization > > of UTF8SMTP? > > I am a bit reluctant at this time to assume that untyped data coming in > that looks like UTF-8, really is UTF-8. Even if the LDAP lookup returns > plausibly useful results, will the UTF-8 envelope survive related > processing in Postfix? > > - PCRE lookups don't currently request UTF-8 support Meaning it will blow up, or what? > - Logs don't support non-destructive recording of UTF-8 > envelopes. I expect that in the long term, UTF-8 will be the canonical representation of text in *NIX files, and that we should plan for that future. Wietse
From: Victor Duchovni on 14 Apr 2010 13:13 On Wed, Apr 14, 2010 at 12:54:47PM -0400, Wietse Venema wrote: > > I am a bit reluctant at this time to assume that untyped data coming in > > that looks like UTF-8, really is UTF-8. Even if the LDAP lookup returns > > plausibly useful results, will the UTF-8 envelope survive related > > processing in Postfix? > > > > - PCRE lookups don't currently request UTF-8 support > > Meaning it will blow up, or what? When passing UTF-8 data to a regexp engine, we need to tell the engine that it is handling UTF-8 data, or it may produce match sub-expressions that consist of pieces of characters. Should "a.b" match a Unicode string where there is a multibyte character between "a" and "b"? What should ${1} be for "(a*.)" when "a" is followed by a multi-byte character? More generally, the issue is that we need a larger design in which we have a canonical data representation inside all the pieces of Postfix, and conversion logic at all system boundaries. This is much bigger than LDAP lookups. > > - Logs don't support non-destructive recording of UTF-8 > > envelopes. > > I expect that in the long term, UTF-8 will be the canonical > representation of text in *NIX files, and that we should plan > for that future. Yes, of course. The LDAP IS_ASCII check will be easy to remove, and and LDAP supports Unicode, so that will be the easy part, but first we need a "contract" that all inputs to the dictionary layer are UTF-8, and the "dict_<your-type-here>" clients will need to ensure that this is so. After that, we can just let the UTF-8 data flow into the database engine if supported, or try to translate to the database charset if not. Probably each table's charset is declared as part of the table configuration, and the generic dictionary layer handles translation of inputs and outputs... Anyway, I am still reluctant to make use of UTF-8 without a larger context in which this makes sense. -- Viktor. P.S. Morgan Stanley is looking for a New York City based, Senior Unix system/email administrator to architect and sustain our perimeter email environment. If you are interested, please drop me a note.
From: Wietse Venema on 14 Apr 2010 20:28 Victor Duchovni: > On Wed, Apr 14, 2010 at 12:54:47PM -0400, Wietse Venema wrote: > > > > I am a bit reluctant at this time to assume that untyped data coming in > > > that looks like UTF-8, really is UTF-8. Even if the LDAP lookup returns > > > plausibly useful results, will the UTF-8 envelope survive related > > > processing in Postfix? > > > > > > - PCRE lookups don't currently request UTF-8 support > > > > Meaning it will blow up, or what? > > When passing UTF-8 data to a regexp engine, we need to tell the engine > that it is handling UTF-8 data, or it may produce match sub-expressions > that consist of pieces of characters. Should "a.b" match a Unicode string > where there is a multibyte character between "a" and "b"? What should ${1} > be for "(a*.)" when "a" is followed by a multi-byte character? > > More generally, the issue is that we need a larger design in which we > have a canonical data representation inside all the pieces of Postfix, > and conversion logic at all system boundaries. This is much bigger than > LDAP lookups. Speaking of canonical representation, Postfix by design strips off the encapsulation on input (CRLF in SMTP, newline in local submission, and length+value in QMQP) and adds the encapsulation back upon delivery. This is sufficient for 7BIT or 8BITMIME content as we know it today. Note that by doing this, Postfix normalizes only the end-of-line convention, not the payload of the message. This means that with well-formed mail, the SMTP input is guaranteed to be identical to the SMTP output (ignoring the extra Received: header), and so on. I don't think it is necessarily a good idea to "normalize" message and envelope content into a canonical format (UTF-8 or otherwise), do all processing in the canonical domain, and then do another transformation on delivery. More likely, one would transform a non-ASCII lookup string into the character set of the lookup table mechanism and back, whatever that character set might be, and return "not found" when the transformation is not possible or when it is not implemented. Although gateway MTAs have a choice to either downgrade 8BITMIME to 7BIT or return mail as undeliverable, there is no equivalent choice for envelope addresses with non-ASCII localparts. A gateway into today's SMTP world would have to return envelopes with non-ASCII localparts as undeliverable. I would not be surprised if someone will come up with the equivalent of RFC 2047 for SMTP envelope localparts, so that mail can be tunneled through a legacy SMTP infrastructure, between systems that support 8-bit usernames. Wietse
First
|
Prev
|
Pages: 1 2 3 Prev: Postfix redirection after aliase resolution Next: Spam from the same domain |