mailing lists and "unknown mail transport error" [Postfix]

Prev: dnswl doesn't work?
Next: why no configure script?

From: Dominik Storck on 4 Aug 2010 04:39

Hello,

I have a problem here with mailing lists.

I've configured these as entries in the /etc/aliases like
group-x: :include:/etc/group-x.list

and corresponding files /etc/group-x.list with one line per (local)
recipient

This has been working perfectly for years. Now the number of recipients
for some of
these lists have increased to more than 200.

When a mail is sent to these recipients mail delivery starts as expected
but stops
short before end of list. The exact count changes, probably due to different
state of of concurrent mail queue entries.

The error message is an "unknown mail transport error", the mail stays
in the queue and
delivery starts over again from the beginnig until I remove the mail
from the queue.

I believe there is some limit to 200 recipients, queue entries or whatever.

Playing with parameters like
default_destination_concurrency_limit = 300
default_destination_recipient_limit = 300
smtp_recipient_limit = 500
did not produce any visible effect.

Any ideas would be appreciated.

Thanks
Dominik

here is an extract from the log
==================================================================
Aug 2 14:50:49 postfix postfix/pickup[27603]: 720441D34008: uid=0
from=<root>
Aug 2 14:50:49 postfix postfix/cleanup[28372]: 720441D34008:
message-id=<20100802125049.720441D34008(a)postfix.meinedomain.com>
Aug 2 14:50:49 postfix postfix/qmgr[904]: 720441D34008:
from=<root(a)postfix.meinedomain.com>, size=577, nrcpt=1 (queue active)

Aug 2 14:50:49 postfix postfix/local[29097]: 720441D34008:
to=<aba(a)postfix.meinedomain.com>, orig_to=<all(a)meinedomain.com>,
relay=local, delay=0.34, delays=0.23/0/0/0.11, dsn=2.0.0, status=sent
(delivered to maildir)

....
[211 lines removed, differing only in recipient address and increasing
delay]
....

Aug 2 14:52:06 postfix postfix/local[29097]: 720441D34008:
to=<wie(a)postfix.meinedomain.com>, orig_to=<all(a)meinedomain.com>,
relay=local, delay=77, delays=0.23/0/0/77, dsn=2.0.0, status=sent
(delivered to command: /usr/bin/vacation -j wie)
Aug 2 14:52:06 postfix postfix/local[29097]: 720441D34008:
to=<wol(a)postfix.meinedomain.com>, orig_to=<all(a)meinedomain.com>,
relay=local, delay=77, delays=0.23/0/0/77, dsn=2.0.0, status=sent
(delivered to maildir)
Aug 2 14:52:06 postfix postfix/local[29097]: 720441D34008:
to=<wre(a)postfix.meinedomain.com>, orig_to=<all(a)meinedomain.com>,
relay=local, delay=77, delays=0.23/0/0/77, dsn=2.0.0, status=sent
(delivered to maildir)

Aug 2 14:52:08 postfix postfix/error[32270]: 720441D34008:
to=<all(a)meinedomain.com>, relay=none, delay=79, delays=0.23/79/0/0.67,
dsn=4.3.0, status=deferred (unknown mail transport error)

==================================================================

postconf -n:
==================================================================
alias_maps = hash:/etc/aliases
biff = no
canonical_maps = hash:/etc/postfix/canonical
command_directory = /usr/sbin
config_directory = /etc/postfix
daemon_directory = /usr/lib/postfix
data_directory = /var/lib/postfix
debug_peer_level = 2
defer_transports =
disable_dns_lookups = no
disable_mime_output_conversion = no
header_checks = regexp:/etc/postfix/header_checks
home_mailbox = Maildir/
html_directory = /usr/share/doc/packages/postfix/html
inet_interfaces = all
inet_protocols = all
local_destination_concurrency_limit = 2
mail_owner = postfix
mail_spool_directory = /var/mail
mailbox_command =
mailbox_size_limit = 0
mailbox_transport =
mailq_path = /usr/bin/mailq
manpage_directory = /usr/share/man
masquerade_classes = envelope_sender, header_sender, header_recipient
masquerade_domains =
masquerade_exceptions = root
message_size_limit = 1024000000
mydestination = $myhostname localhost.$mydomain my-domain.com
myhostname = postfix.my-domain.com
mynetworks_style = subnet
newaliases_path = /usr/bin/newaliases
queue_directory = /var/spool/postfix
readme_directory = /usr/share/doc/packages/postfix/README_FILES
relayhost =
relocated_maps = hash:/etc/postfix/relocated
sample_directory = /usr/share/doc/packages/postfix/samples
sender_canonical_maps = hash:/etc/postfix/sender_canonical
sendmail_path = /usr/sbin/sendmail
setgid_group = maildrop
smtp_sasl_auth_enable = no
smtp_use_tls = yes
smtpd_client_connection_count_limit = 0
smtpd_client_restrictions =
smtpd_helo_required = yes
smtpd_helo_restrictions = permit_mynetworks, check_sender_access
hash:/etc/postfix/access, check_client_access
hash:/etc/postfix/access,reject_non_fqdn_hostname, permit
smtpd_recipient_restrictions =
permit_sasl_authenticated,check_sender_access
hash:/etc/postfix/access,check_client_access
hash:/etc/postfix/access,reject_unauth_destination,reject_non_fqdn_sender,reject_non_fqdn_recipient,
reject_unverified_recipient,reject_unauth_destination,permit
smtpd_sasl_auth_enable = no
smtpd_sender_restrictions = permit_mynetworks, check_sender_access
hash:/etc/postfix/access, permit
smtpd_use_tls = no
strict_8bitmime = no
strict_rfc821_envelopes = no
transport_maps = hash:/etc/postfix/transport
unknown_local_recipient_reject_code = 550
virtual_alias_domains = hash:/etc/postfix/virtual
virtual_alias_maps = hash:/etc/postfix/virtual
==================================================================

From: Wietse Venema on 4 Aug 2010 12:02

See http://www.postfix.org/DEBUG_README.html#logging

When Postfix does not receive or deliver mail, the first order of business is
to look for errors that prevent Postfix from working properly:

% egrep '(warning|error|fatal|panic):' /some/log/file | more

Note: the most important message is near the BEGINNING of the output. Error
messages that come later are less useful.

The nature of each problem is indicated as follows:

* "panic" indicates a problem in the software itself that only a programmer
can fix. Postfix cannot proceed until this is fixed.

* "fatal" is the result of missing files, incorrect permissions, incorrect
configuration file settings that you can fix. Postfix cannot proceed until
this is fixed.

* "error" reports an error condition. For safety reasons, a Postfix process
will terminate when more than 13 of these happen.

* "warning" indicates a non-fatal error. These are problems that you may not
be able to fix (such as a broken DNS server elsewhere on the network) but
may also indicate local configuration errors that could become a problem
later.

From: =?UTF-8?Q?Andrzej_Kuku=C5=82a?= on 5 Aug 2010 06:58

On Wed, Aug 4, 2010 at 10:39, Dominik Storck <dominik(a)storck.net> wrote:
>
> This has been working perfectly for years. Now the number of recipients
> for some of
> these lists have increased to more than 200.
>
> When a mail is sent to these recipients mail delivery starts as expected
> but stops
> short before end of list. The exact count changes, probably due to different
> state of of concurrent mail queue entries.
>
> The error message is an "unknown mail transport error", the mail stays
> in the queue and
> delivery starts over again from the beginnig until I remove the mail
> from the queue.
>
> I believe there is some limit to 200 recipients, queue entries or whatever.

I'd speculate it's low open file limit in operating system. I had this
once when my 'everyone' alias exceeded several hundred users. See
ulimit -n
Increase it in your postfix startup script to, say, 100000, and
observe the difference.

Regards,
Andrzej

From: Dominik Storck on 9 Aug 2010 14:25

Hello Andrzej,

thanks for your suggestion. However, increasing ulimit -n did not help
in the first place .

Debugging info by adding -v in master.cf came up with lots of debugging
info yet not leading to success.

After attaching strace to the local daemon, I observed a kind of a loop
when it came to evaluate the .forward files in the users home
directories, several of them in the form /home/username/.forward =
"\username". These files are produced by a vacation plugin to our
squirrelmail web access to the mailboxes, when the users disable
forwarding and or vacation messages.

After sending mails to almost all of the recipients - many of them have
these .forward-files, evaluated without any problem - strace showed up
repeated accesses to the .forward file for the same user about 4300 (!)
times - probably until the above ulimit is touched and the process
segfaults.

As I didn't figure out by now what causes the loop, I'd appreciate any
other idea.

Here is a snippet of the loops output, lots of similar block before and
after this one: The only parameter that changes is the 4304 here, which
i believe to be the file handle. This number is incremented with each
loop iteration.

=======================================================
Aug 9 12:06:05 postfix logger:
lstat64("/home/wre/.forward",{st_mode=S_IFREG|0600, st_size=6, ...}) = 0
Aug 9 12:06:05 postfix logger: geteuid32()
= 1193
Aug 9 12:06:05 postfix logger: setresuid32(-1, 0, -1) = 0
Aug 9 12:06:05 postfix logger: setresgid32(-1, 51, -1) = 0
Aug 9 12:06:05 postfix logger: setgroups32(1, [51]) = 0
Aug 9 12:06:05 postfix logger: setresuid32(-1, 51, -1) = 0
Aug 9 12:06:05 postfix logger: geteuid32() = 51
Aug 9 12:06:05 postfix logger: getegid32() = 51
Aug 9 12:06:05 postfix logger: geteuid32() = 51
Aug 9 12:06:05 postfix logger: setresuid32(-1, 0, -1) = 0
Aug 9 12:06:05 postfix logger: setresgid32(-1, 100, -1) = 0
Aug 9 12:06:05 postfix logger: setgroups32(1, [100]) = 0
Aug 9 12:06:05 postfix logger: setresuid32(-1, 1193, -1) = 0
Aug 9 12:06:05 postfix logger: open("/home/wre/.forward", O_RDONLY)
= 4304
Aug 9 12:06:05 postfix logger: geteuid32()
= 1193
Aug 9 12:06:05 postfix logger: setresuid32(-1, 0, -1) = 0
Aug 9 12:06:05 postfix logger: setresgid32(-1, 51, -1) = 0
Aug 9 12:06:05 postfix logger: setgroups32(1, [51]) = 0
Aug 9 12:06:05 postfix logger: setresuid32(-1, 51, -1) = 0
Aug 9 12:06:05 postfix logger: fcntl64(4304, F_GETFD) = 0
Aug 9 12:06:05 postfix logger: fcntl64(4304, F_SETFD, FD_CLOEXEC) = 0
Aug 9 12:06:05 postfix logger: read(4304, "\\wre\r\n", 4096) = 6
Aug 9 12:06:05 postfix logger: time(NULL)
= 1281348364
Aug 9 12:06:05 postfix logger: time(NULL)
= 1281348364
Aug 9 12:06:05 postfix logger: geteuid32() = 51
Aug 9 12:06:05 postfix logger: getegid32() = 51
Aug 9 12:06:05 postfix logger: geteuid32() = 51
Aug 9 12:06:05 postfix logger: setresuid32(-1, 0, -1) = 0
Aug 9 12:06:05 postfix logger: setresgid32(-1, 100, -1) = 0
Aug 9 12:06:05 postfix logger: setgroups32(1, [100]) = 0
Aug 9 12:06:05 postfix logger: setresuid32(-1, 1193, -1) = 0
============================================================================

For now, after deleting all the "quasi empty" forward-files seems to
solve the problem, but I fear running into the same thing when the
number of recipients will increase.

Dominik

Am 05.08.2010 12:58, schrieb Andrzej Kukuła:
> On Wed, Aug 4, 2010 at 10:39, Dominik Storck <dominik(a)storck.net> wrote:
>>
>> This has been working perfectly for years. Now the number of recipients
>> for some of
>> these lists have increased to more than 200.
>>
>> When a mail is sent to these recipients mail delivery starts as expected
>> but stops
>> short before end of list. The exact count changes, probably due to
different
>> state of of concurrent mail queue entries.
>>
>> The error message is an "unknown mail transport error", the mail stays
>> in the queue and
>> delivery starts over again from the beginnig until I remove the mail
>> from the queue.
>>
>> I believe there is some limit to 200 recipients, queue entries or
whatever.
>
> I'd speculate it's low open file limit in operating system. I had this
> once when my 'everyone' alias exceeded several hundred users. See
> ulimit -n
> Increase it in your postfix startup script to, say, 100000, and
> observe the difference.
>
> Regards,
> Andrzej

From: Wietse Venema on 9 Aug 2010 16:17

In case you wonder, I wrote Postfix.

Perhaps you can follow instructions in

http://www.postfix.org/.DEBUG_README.html#logging

TURN OFF -v logging before you do this.

Wietse

| Next | Last
Pages: 1 2 3
Prev: dnswl doesn't work?
Next: why no configure script?