Prev: set a catch-all for users that not exists in database
Next: Simple hack to get $500 to your home.
From: Florin Andrei on 6 Jul 2010 15:10 On 07/06/2010 11:30 AM, Victor Duchovni wrote: > > No, disabling the cache will still leave a skewed distribution. Connection > creation is uniform across the servers, but connection lifetime is much > longer on the slow server, so its connection concurrency is much higher > (potentially equal to the destination concurrency limit under suitable > conditions, thus keeping the fast servers essentially idle). > > A time-based cache is the fairness mechanism that keeps connection > lifetimes uniform across the servers, which ensures non-starvation > of fast servers, and avoids futher overload of (congested) slow servers. I see. I realize that email delivery is not a trivial problem, but it seems baffling that a seemingly simple task ("fair" volume-based load balancing between transports) is so hard to achieve. A very dumb algorithm should accomplish it: single-threaded delivery (no concurrency), a "voluntary" (sender-side) limit of N messages delivered per connection, then reconnect. DNS randomization should then do the trick. If the network and the servers are fast (and they are, in my case), this shouldn't slow down the delivery too much (in fact, a small speed decrease might be beneficial). I think I know how to eliminate concurrency, but I'm lacking a volume-based limit for the connections. I'll keep looking for a solution. -- Florin Andrei http://florin.myip.org/
From: Victor Duchovni on 6 Jul 2010 15:27 On Tue, Jul 06, 2010 at 12:10:41PM -0700, Florin Andrei wrote: > I realize that email delivery is not a trivial problem, but it seems > baffling that a seemingly simple task ("fair" volume-based load balancing > between transports) is so hard to achieve. If you want to deliver the same number of messages to each server, regardless of server performance, (message-count fairness, rather than concurrency fairness), and suffer high latency when a slow server starts to impede message flow, then turning off the cache will indeed give you roughly uniform message distribution: - *New* connections are distributed uniformly - There is at most one delivery per connection - Hence messages are distributed uniformly However, concurrency will not be distributed uniformly, and a slow server will account for most or all of the concurrency, ensuring a high average latency even when alternative servers are sitting idle. > I'll keep looking for a solution. What negative symptoms are your systems exhibiting? What *real* problem are you trying to solve? -- Viktor.
From: Florin Andrei on 6 Jul 2010 16:00 On 07/06/2010 12:27 PM, Victor Duchovni wrote: > > If you want to deliver the same number of messages to each server, > regardless of server performance, (message-count fairness, rather than > concurrency fairness), and suffer high latency when a slow server starts > to impede message flow, then turning off the cache will indeed give you > roughly uniform message distribution: > > - *New* connections are distributed uniformly > - There is at most one delivery per connection > - Hence messages are distributed uniformly > > However, concurrency will not be distributed uniformly, and a slow > server will account for most or all of the concurrency, ensuring a > high average latency even when alternative servers are sitting idle. That's fine. One transport is on the local network, the other is across a data link that would have been considered "as fast as local" not too long ago. Both servers are modern fast hardware. Both are highly available from the p.o.v. of the machines generating the emails. Even if one of them disappears, so what, the other will just magically take over and at most we're not worse off than before. The "slow" server, therefore, is not that "slow". It's just different enough (latency, mostly) to tip over the sensitive delivery algorithm, which seems to be fine-tuned for Internet conditions, rather than local or near-local networks. From what you're saying, it appears that single-threaded delivery is unnecessary - the email "generators" will simply hit the upper connection limit and stay near it, with newly released slots being occupied by either one relay or the other at random. That should ensure a "fair" distribution, I think. > What negative symptoms are your systems exhibiting? > What *real* problem are you trying to solve? The real problem was described in the other big thread I started recently: delivery to a certain big popular email provider is exceedingly slow. We have a pretty small delivery window between the moment the messages are created and the moment they should be available to the users - that's not a problem with all the other providers (heck, Gmail for instance seems to absorb emails way faster than we can send them - this even while their anti-spam filters seem at once more fair and more effective than the other providers'). We already did long time ago some of the stuff you indicated (the spam feedback loop, etc.) and have started a while ago working on the rest (whitelisting, etc.) which is supposed to get us out of the red zone. But *meanwhile* I have to make the best out of a tricky set of mutually-exclusive constraints. Having multiple exit points seems to improve the overall delivery speed - this is true even right now, when distribution is skewed to the faster server 4:1. My estimate is, a near-1:1 distribution would actually fix our time-constraint problem even before whitelisting. So you see how this is kind of a big incentive to get it done. -- Florin Andrei http://florin.myip.org/
From: Victor Duchovni on 6 Jul 2010 16:10 On Tue, Jul 06, 2010 at 01:00:14PM -0700, Florin Andrei wrote: > Having multiple exit points seems to improve the overall delivery speed - > this is true even right now, when distribution is skewed to the faster > server 4:1. My estimate is, a near-1:1 distribution would actually fix our > time-constraint problem even before whitelisting. So you see how this is > kind of a big incentive to get it done. So you have multiple exit points with non-uniform latency, but the more severe congestion is downstream, so you want to load the exit points uniformly. Yes, the solution is to disable the connection cache, and set reasonably low connection and helo timeouts in the transport feeding the two exit points, so that when one is down and non-responsive (no TCP reset), you don't suffer excessive hand-off latency for 50% of deliveries. master.cf: transp unix ... smtp -o smtp_connect_timeout=$<transp>_connect_timeout -o smtp_helo_timeout=$<transp>_helo_timeout main.cf: # default is 30s transp_connect_timeout = 2s # default is 300s transp_helo_timeout = 30s -- Viktor.
From: Florin Andrei on 8 Jul 2010 16:37 On 07/06/2010 01:10 PM, Victor Duchovni wrote: > > So you have multiple exit points with non-uniform latency, but the more > severe congestion is downstream, so you want to load the exit points > uniformly. Yes, the solution is to disable the connection cache, and > set reasonably low connection and helo timeouts in the transport feeding > the two exit points, so that when one is down and non-responsive (no TCP > reset), you don't suffer excessive hand-off latency for 50% of deliveries. I did that. You know what? It's amazingly accurate actually. After tens of thousands of messages, the logs on the two exit points showed almost exactly the same amount of messages relayed - within 1.2% or so. That was a very nice result to contemplate. After disabling the connection cache for internal delivery, it looks like we took a 2x performance hit internally, which is exactly what I expected. But that's ok, the internal rate is orders of magnitude above the Yahoo rate anyway. From an external perspective, things are actually much better now. Case closed. Thanks for all the help. -- Florin Andrei http://florin.myip.org/
First
|
Prev
|
Next
|
Last
Pages: 1 2 3 Prev: set a catch-all for users that not exists in database Next: Simple hack to get $500 to your home. |