From: Joel Becker on 29 Mar 2010 16:00 On Mon, Mar 29, 2010 at 11:44:51AM -0700, john stultz wrote: > > But if timer interrupt is delayed by more than acpi_pm wrap-around > > time, then the update_wall_time() is also screwed. Since it is not, we > > can rely on getrawmonotonic(). > > Right, if the box hangs for longer then the clocksource can count for, > the timekeeping subsystem will be off by some multiple of that length. > > And That's exactly why I'm advising against using > gettimeofday/getrawmonotonic or any other software managed sense of time > for the hangcheck timer, as you won't be able to correctly detect hangs. > > I'm also suggesting using something like read_persistent_clock() is > better, because there is no OS/software management involved (other then > the minor syncing issue I mentioned before) so if the system hangs for a > long period of time, then returns, you'll still be able to detect the > hang. > > But maybe what folks are using the hangcheck timer for is shifting, so > its possible that I'm not quite understanding what you're trying to do > here. The people who use hangcheck-timer for the reasons I originally wrote it absolutely want any hang, including long ones, detected. Joel -- "For every complex problem there exists a solution that is brief, concise, and totally wrong." -Unknown Joel Becker Principal Software Developer Oracle E-mail: joel.becker(a)oracle.com Phone: (650) 506-8127 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
From: Yury Polyanskiy on 29 Mar 2010 17:10 >> > What I'm saying is that if you're using getrawmonotonic() to detect >> > hangs, you might miss them, as getrawmonotonic may wrap (and thus stop >> > continually increasing) if the timer interrupt is delayed. This does not >> > apply to systems using the TSC clocksource, but does apply to systems >> > using the acpi_pm. >> >> But if timer interrupt is delayed by more than acpi_pm wrap-around >> time, then the update_wall_time() is also screwed. Since it is not, we >> can rely on getrawmonotonic(). > > Right, if the box hangs for longer then the clocksource can count for, > the timekeeping subsystem will be off by some multiple of that length. > Oh, I see. You mean that getrawmonotonic() wouldn't work under abnormal conditions. I understand now, sorry for the confusion. You are correct, of course. I personally don't like the idea of relying on read_persistent_clock() not only because of hwclock and ntp. In fact, my core interest in hangcheck-timer is to set a very low margin (1 to 3 jiffies for example) so that I would get a log message upon any kernel slow down or a tick-miss (as a hardware integrity check). I don't think read_persistent_clock() is precise enough for this purpose, is it? Also, hooking to ntp update code complicates an otherwise simple driver. I propose to simply check on non-S390 if the clock source resolves to something other than TSC and dump a warning message on driver load (something like "Hangcheck: kernel using clocksource %s, which is not reliable for hang detection"). What do you think about it? Thanks, Yury -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
From: john stultz on 29 Mar 2010 17:50 On Mon, 2010-03-29 at 17:08 -0400, Yury Polyanskiy wrote: > >> > What I'm saying is that if you're using getrawmonotonic() to detect > >> > hangs, you might miss them, as getrawmonotonic may wrap (and thus stop > >> > continually increasing) if the timer interrupt is delayed. This does not > >> > apply to systems using the TSC clocksource, but does apply to systems > >> > using the acpi_pm. > >> > >> But if timer interrupt is delayed by more than acpi_pm wrap-around > >> time, then the update_wall_time() is also screwed. Since it is not, we > >> can rely on getrawmonotonic(). > > > > Right, if the box hangs for longer then the clocksource can count for, > > the timekeeping subsystem will be off by some multiple of that length. > > > > Oh, I see. You mean that getrawmonotonic() wouldn't work under > abnormal conditions. I understand now, sorry for the confusion. You > are correct, of course. And something else I thought of, while the TSC won't wrap, the multiplication done to convert to nanoseconds will overflow when you hit a large enough cycle delta. So even TSC systems are not guaranteed to have timekeeping (and thus getrawmonotonic) work over infinite time without accumulation. We try to establish this length via timekeeping_max_deferment(), so that we make sure we don't go into tickless mode for longer then the clocksource can handle. > I personally don't like the idea of relying on read_persistent_clock() > not only because of hwclock and ntp. In fact, my core interest in > hangcheck-timer is to set a very low margin (1 to 3 jiffies for > example) so that I would get a log message upon any kernel slow down > or a tick-miss (as a hardware integrity check). I don't think > read_persistent_clock() is precise enough for this purpose, is it? read_persistent_clock is a bit coarse, so for small intervals it would not do. However, the current timeout range for the hangcheck timer is in seconds, which should be fine for read_persistent_clock(). You might also have some trouble with small intervals. Since things like tickless systems or other advanced power-savings systems might try to collate or push timers together to save battery. So ticks may be delayed a small amount (timers are only guaranteed to fire AFTER the time specified, there really is no promised bound on how late they may be). Additionally, on -rt systems, you might have higher priority FIFO tasks blocking the hangcheck timer from executing for a smallish amount of time. > Also, hooking to ntp update code complicates an otherwise simple > driver. I propose to simply check on non-S390 if the clock source > resolves to something other than TSC and dump a warning message on > driver load (something like "Hangcheck: kernel using clocksource %s, > which is not reliable for hang detection"). That requires the hangcheck code to parse the current clocksource, which might change as the system runs, so it also has to track the clocksource over time. So I'm not sure its that much easier of a solution. Something to also consider might also be to look at the softlockup watchdog, which is fairly similar but somewhat more deeply integrated into the kernel. Maybe some of this could be merged? thanks -john -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
From: Yury Polyanskiy on 29 Mar 2010 18:40 On Mon, 29 Mar 2010 14:43:44 -0700 john stultz <johnstul(a)us.ibm.com> wrote: > On Mon, 2010-03-29 at 17:08 -0400, Yury Polyanskiy wrote: > > >> > What I'm saying is that if you're using getrawmonotonic() to detect > > >> > hangs, you might miss them, as getrawmonotonic may wrap (and thus stop > > >> > continually increasing) if the timer interrupt is delayed. This does not > > >> > apply to systems using the TSC clocksource, but does apply to systems > > >> > using the acpi_pm. > And something else I thought of, while the TSC won't wrap, the > multiplication done to convert to nanoseconds will overflow when you hit > a large enough cycle delta. So even TSC systems are not guaranteed to > have timekeeping (and thus getrawmonotonic) work over infinite time > without accumulation. Agreed (large clock->shift, right?), but for hangcheck-timer this would hardly be a problem, since such a large overflow very unlikely to land inside allowed interval around the pre-planned timer fire instant. > > You might also have some trouble with small intervals. Since things like > tickless systems or other advanced power-savings systems might try to > collate or push timers together to save battery. So ticks may be delayed > a small amount (timers are only guaranteed to fire AFTER the time > specified, there really is no promised bound on how late they may be). > > Additionally, on -rt systems, you might have higher priority FIFO tasks > blocking the hangcheck timer from executing for a smallish amount of > time. Yes, these are the events I want to see logged. Essentially I use hangcheck timer to check stability of kernel's heartbeat. > > Also, hooking to ntp update code complicates an otherwise simple > > driver. I propose to simply check on non-S390 if the clock source > > resolves to something other than TSC and dump a warning message on > > driver load (something like "Hangcheck: kernel using clocksource %s, > > which is not reliable for hang detection"). > > That requires the hangcheck code to parse the current clocksource, which > might change as the system runs, so it also has to track the clocksource > over time. So I'm not sure its that much easier of a solution. Oh, shoot, you are right. So if compiled-in it would always complain. > Something to also consider might also be to look at the softlockup > watchdog, which is fairly similar but somewhat more deeply integrated > into the kernel. Maybe some of this could be merged? Yeah, for softlockup detection, I don't understand why one would prefer hangcheck-timer to watchdog. I am sure Joel has some reasons though. For me read_persistent_clock() is not a solution, and others perhaps are indeed would be using softlockup watchdog, which leaves the decision to Joel. Best, Y
From: Joel Becker on 7 Apr 2010 21:00 On Mon, Mar 29, 2010 at 06:34:14PM -0400, Yury Polyanskiy wrote: > On Mon, 29 Mar 2010 14:43:44 -0700 > john stultz <johnstul(a)us.ibm.com> wrote: > > On Mon, 2010-03-29 at 17:08 -0400, Yury Polyanskiy wrote: > > > >> > What I'm saying is that if you're using getrawmonotonic() to detect > > > >> > hangs, you might miss them, as getrawmonotonic may wrap (and thus stop > > > >> > continually increasing) if the timer interrupt is delayed. This does not > > > >> > apply to systems using the TSC clocksource, but does apply to systems > > > >> > using the acpi_pm. > > And something else I thought of, while the TSC won't wrap, the > > multiplication done to convert to nanoseconds will overflow when you hit > > a large enough cycle delta. So even TSC systems are not guaranteed to > > have timekeeping (and thus getrawmonotonic) work over infinite time > > without accumulation. Ugh. > Agreed (large clock->shift, right?), but for hangcheck-timer this > would hardly be a problem, since such a large overflow very unlikely to > land inside allowed interval around the pre-planned timer fire instant. But if you go beyond that interval... > > You might also have some trouble with small intervals. Since things like > > tickless systems or other advanced power-savings systems might try to > > collate or push timers together to save battery. So ticks may be delayed > > a small amount (timers are only guaranteed to fire AFTER the time > > specified, there really is no promised bound on how late they may be). > > > > Additionally, on -rt systems, you might have higher priority FIFO tasks > > blocking the hangcheck timer from executing for a smallish amount of > > time. > > Yes, these are the events I want to see logged. Essentially I use > hangcheck timer to check stability of kernel's heartbeat. Which is neat, but not the original reason for hangcheck. > > Something to also consider might also be to look at the softlockup > > watchdog, which is fairly similar but somewhat more deeply integrated > > into the kernel. Maybe some of this could be merged? > > Yeah, for softlockup detection, I don't understand why one would > prefer hangcheck-timer to watchdog. I am sure Joel has some reasons > though. For me read_persistent_clock() is not a solution, and others > perhaps are indeed would be using softlockup watchdog, which leaves the > decision to Joel. hangcheck originally was designed to kill a box as fast as possible. It comes out of the cluster environment. Imagine you have two machines, node1 and node2, working against a shared data store. They coordinate their access via a lock manager. Then node2 goes out to lunch. Maybe qla2xxx decides to udelay() while waiting for an FC device. Something like that. After a time period, node1 decides that node2 must have crashed. It recovers any intermediate state, then proceeds as if node2 is gone. Now the udelay() finally finishes and node2 starts working again. node2 does not know that node1 has continued without it. It will write old data to the shared storage, corrupting it. hangcheck-timer reduces this exposure significantly, because the timer interrupt will fire reliably and quickly. hangcheck-timer - if using the right clock source - will notice the time discrepancy and immediately trigger the reset. Note that the reset is the only valid solution here. We can't wait for node2 to try to figure anything out; old data might be already queued in the I/O layer. This is why hangcheck-timer must rely on wallclock time. softdog was originally tried, but after a true hang (udelay(), PCI, something with timer interrupts off) the system clock doesn't actually notice the time change. So the system might have been hung for 30 seconds, but the system clock thinks it has only been gone for 10. Softdog won't fire, but hangcheck-timer will. This is also why suspend/resume has to be treated as a hang. Joel -- "The lawgiver, of all beings, most owes the law allegiance. He of all men should behave as though the law compelled him. But it is the universal weakness of mankind that what we are given to administer we presently imagine we own." - H.G. Wells Joel Becker Principal Software Developer Oracle E-mail: joel.becker(a)oracle.com Phone: (650) 506-8127 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
First
|
Prev
|
Pages: 1 2 3 4 Prev: Protect prefetch macro arguments. Next: dmar: section mismatch cleanup |