From: bab on 9 May 2010 09:24 Hello, I'm running a number of FreeBSD VMs on an ESX cluster using an MD3000i (Dell iSCSI SAN) for storage. Most of the time these systems all work fine, however periodically our automated monitoring system reports one or two of the hosts inaccessible for a brief period 5-10 minutes). This usually occurs in the early morning so it doesn't affect our operations, but I'm concerned about the underlying cause. In the logs of the affected machines are errors like those below. I would assume this means that for some reason the connection to the SAN is timing out, but can't seem to find the definition of the mpt_cam_event code "0x60" anywhere. This occurs both on VMs where I've extended the disk timeout with "kern.cam.da.retry_count=100" and on VMs where I have not done this. Any ideas? kernel: mpt0: attempting to abort req 0xc4298800:11 function 0 kernel: mpt0: mpt_wait_req(1) timed out kernel: mpt0: mpt_recover_commands: abort timed-out. Resetting controller kernel: mpt0: mpt_cam_event: 0x60 kernel: mpt0: completing timedout/aborted req 0xc4298800:11 kernel: mpt0: request 0xc4298b70:32 timed out for ccb 0xc47f7800 (req- >ccb 0xc47f7800) kernel: mpt0: request 0xc42965a0:33 timed out for ccb 0xc4283000 (req- >ccb 0xc4283000) kernel: mpt0: request 0xc4297c70:34 timed out for ccb 0xc43e3000 (req- >ccb 0xc43e3000) kernel: mpt0: request 0xc4293f80:35 timed out for ccb 0xc4806000 (req- >ccb 0xc4806000) kernel: mpt0: request 0xc4292c70:36 timed out for ccb 0xc480c800 (req- >ccb 0xc480c800) kernel: mpt0: request 0xc4297220:37 timed out for ccb 0xc43db000 (req- >ccb 0xc43db000) kernel: mpt0: request 0xc429a650:38 timed out for ccb 0xc4810000 (req- >ccb 0xc4810000) kernel: mpt0: request 0xc42923b0:39 timed out for ccb 0xc480e000 (req- >ccb 0xc480e000) kernel: mpt0: request 0xc4299110:40 timed out for ccb 0xc480f000 (req- >ccb 0xc480f000) kernel: mpt0: request 0xc4297590:41 timed out for ccb 0xc468f000 (req- >ccb 0xc468f000) kernel: mpt0: attempting to abort req 0xc4298b70:32 function 0 kernel: mpt0: completing timedout/aborted req 0xc4298b70:32 kernel: mpt0: abort of req 0xc4298b70:0 completed kernel: mpt0: attempting to abort req 0xc42965a0:33 function 0 kernel: mpt0: completing timedout/aborted req 0xc42965a0:33 kernel: mpt0: abort of req 0xc42965a0:0 completed kernel: mpt0: attempting to abort req 0xc4297c70:34 function 0 kernel: mpt0: completing timedout/aborted req 0xc4297c70:34 kernel: mpt0: abort of req 0xc4297c70:0 completed kernel: mpt0: attempting to abort req 0xc4293f80:35 function 0 kernel: mpt0: completing timedout/aborted req 0xc4293f80:35 kernel: mpt0: abort of req 0xc4293f80:0 completed kernel: mpt0: attempting to abort req 0xc4292c70:36 function 0 kernel: mpt0: completing timedout/aborted req 0xc4292c70:36 kernel: mpt0: abort of req 0xc4292c70:0 completed kernel: mpt0: attempting to abort req 0xc4297220:37 function 0 kernel: mpt0: completing timedout/aborted req 0xc4297220:37 kernel: mpt0: abort of req 0xc4297220:0 completed kernel: mpt0: attempting to abort req 0xc429a650:38 function 0 kernel: mpt0: mpt_wait_req(1) timed out kernel: mpt0: mpt_recover_commands: abort timed-out. Resetting controller kernel: mpt0: mpt_cam_event: 0x60 kernel: mpt0: completing timedout/aborted req 0xc429a650:38 kernel: mpt0: completing timedout/aborted req 0xc42923b0:39 kernel: mpt0: completing timedout/aborted req 0xc4299110:40 kernel: mpt0: completing timedout/aborted req 0xc4297590:41
From: Dominic Fandrey on 11 May 2010 05:05 On 09/05/2010 15:24, bab wrote: > Most of the time these systems all work > fine, however periodically our automated monitoring system reports one > or two of the hosts inaccessible for a brief period 5-10 minutes). > This usually occurs in the early morning so it doesn't affect our > operations, but I'm concerned about the underlying cause. Did you consider physical causes? E.g. the cleaning crew pulls the plug of the storage system for their vacuum cleaners. Or maybe the fibres are bent too strongly to transmit the signals. -- A: Because it fouls the order in which people normally read text. Q: Why is top-posting such a bad thing? A: Top-posting. Q: What is the most annoying thing on usenet and in e-mail?
From: Torfinn Ingolfsen on 11 May 2010 17:39 On 05/11/2010 11:05, Dominic Fandrey wrote: > Did you consider physical causes? E.g. the cleaning crew pulls the > plug of the storage system for their vacuum cleaners. > > Or maybe the fibres are bent too strongly to transmit the signals. Or could it be other traffic on the SAN switches consuming all the bandwidth? Probably not a backup job if it is only 5 - 10 minutes -- Torfinn Ingolfsen, Norway
|
Pages: 1 Prev: Installing Open Office / Perl updating Next: Emacs port installs files with a non-root owner |