Prev: [PATCH v2] Documentation/sysctl/vm.txt typo
Next: perf tools: allow cross compiling with DWARF support
From: Peter Zijlstra on 2 Jul 2010 06:00 On Fri, 2010-07-02 at 11:57 +0900, Paul Mundt wrote: > At the moment it's not an issue since we have big enough counters that > overflows don't really happen, especially if we're primarily using them > for one-shot measuring. > > SH-4A style counters behave in such a fashion that we have 2 general > purpose counters, and 2 counters for measuring bus transactions. These > bus counters can optionally be disabled and used in a chained mode to > provide the general purpose counters a 64-bit counter (the actual > validity in the upper half of the chained counter varies depending on the > CPUs, but all of them can do at least 48-bits when chained). Right, so I was reading some of that code and I couldn't actually find where you keep consistency between the hardware counter value and the stored prev_count value. That is, suppose I'm counting, the hardware starts at 0, hwc->prev_count = 0 and event->count = 0. At some point, x we context switch this task away, so we ->disable(), which disables the counter and updates the values, so at that time hwc->prev = x and event->count = x, right? Now suppose we schedule the task back in, so we do ->enable(), then what happens? sh_pmu_enable() finds an unused index, (disables it for some reason.. it should already be cleared if its not used, but I guess a few extra hardware writes dont hurt) and calls sh4a_pmu_enable() on it. sh4a_pmu_enable() does 3 writes: PPC_PMCAT -- does this clear the counter value? PPC_CCBR -- writes the ->config bits PPC_CCBR (adds CCBR_DUC, couldn't this be done in the previous write to this reg?) Now assuming that enable does indeed clear the hardware counter value, shouldn't you also set hwc->prev_count to 0 again? Otherwise the next update will see a massive jump? Alternatively you could write the hwc->prev_count value back to the register. If you eventually want to drop the chained counter support I guess it would make sense to have sh_perf_event_update() read and clear the counter so that you're always 0 based and then enforce an update from the arch tick hander so you never overflow. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
From: Will Deacon on 2 Jul 2010 09:00 Hi Peter, On Thu, 2010-07-01 at 15:36 +0100, Peter Zijlstra wrote: > On Fri, 2010-06-25 at 16:50 +0200, Peter Zijlstra wrote: > > > Not exactly sure how I could have messed up the ARM architecture code to > > make this happen though... will have a peek. > > I did find a bug in there, not sure it could have been responsible for > this but who knows... > > Pushed out a new git tree with the below delta folded in. > I had a look at this yesterday and discovered a bug in the ARM backend code, which I've posted a patch for to ALKML: http://lists.infradead.org/pipermail/linux-arm-kernel/2010-July/019461.html Unfortunately, with this applied and your latest changes I still get 0 from pinned hardware counters: # perf stat -r 5 -e cycles -e instructions -e cs -e faults -e branches -a -- git status Performance counter stats for 'git status' (5 runs): 0 cycles ( +- nan% ) 0 instructions # 0.000 IPC ( +- nan% ) 88447 context-switches ( +- 12.624% ) 13647 page-faults ( +- 0.015% ) 0 branches ( +- nan% ) The changes you've made to arch/arm/kernel/perf_event.c look sane. If I get some time I'll try and dig deeper. Will -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
From: Paul Mundt on 5 Jul 2010 07:20 On Fri, Jul 02, 2010 at 11:52:03AM +0200, Peter Zijlstra wrote: > Right, so I was reading some of that code and I couldn't actually find > where you keep consistency between the hardware counter value and the > stored prev_count value. > > That is, suppose I'm counting, the hardware starts at 0, hwc->prev_count > = 0 and event->count = 0. > > At some point, x we context switch this task away, so we ->disable(), > which disables the counter and updates the values, so at that time > hwc->prev = x and event->count = x, right? > > Now suppose we schedule the task back in, so we do ->enable(), then what > happens? sh_pmu_enable() finds an unused index, (disables it for some > reason.. it should already be cleared if its not used, but I guess a few > extra hardware writes dont hurt) and calls sh4a_pmu_enable() on it. > I don't quite remember where the ->disable() came from, I vaguely recall copying it from one of the other architectures, but it could have just been a remnant of something I had for debug code. In any event, you're correct, we don't seem to need it anymore. > sh4a_pmu_enable() does 3 writes: > > PPC_PMCAT -- does this clear the counter value? Yes, the counters themselves are read-only, so clearing is done through the PMCAT control register. > PPC_CCBR -- writes the ->config bits > PPC_CCBR (adds CCBR_DUC, couldn't this be done in the > previous write to this reg?) > No, the DUC bit needs to be set by itself or the write is discarded on some CPUs. Clearing it with other bits is fine, however. This is what starts the counter running. > Now assuming that enable does indeed clear the hardware counter value, > shouldn't you also set hwc->prev_count to 0 again? Otherwise the next > update will see a massive jump? > I think that's a correct observation, but I'm having difficulty verifying it on my current board since it seems someone moved the PMCAT register, as the counters aren't being cleared on this particular CPU. I'll test on the board I wrote this code for initially tomorrow and see how that goes. It did used to work fine at least. > Alternatively you could write the hwc->prev_count value back to the > register. > That would be an option if the counters weren't read-only, yes. > If you eventually want to drop the chained counter support I guess it > would make sense to have sh_perf_event_update() read and clear the > counter so that you're always 0 based and then enforce an update from > the arch tick hander so you never overflow. > Yes, I'd thought about that too. I'll give it a go once I find out where the other half of my registers disappeared to. As it is, it seems my bat and I have an appointment to make. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
From: Peter Zijlstra on 8 Jul 2010 07:20 On Thu, 2010-07-01 at 17:39 +0200, Peter Zijlstra wrote: > > Ah, for sampling for sure, simply group a software perf event and a > hardware perf event together and use PERF_SAMPLE_READ. So the idea is to sample using a software event (periodic timer of sorts, maybe randomize it) and weight its samples by the hardware event deltas. Suppose you have a workload consisting of two main parts: my_important_work() { load_my_data(); compute_me_silly(); } Now, lets assume that both these functions take the same time to complete for each part of work. In that case a periodic timer generate samples that are about 50/50 distributed between these two functions. Now, let us further assume that load_my_data() is so slow because its missing all the caches and compute_me_silly() is slow because its defeating the branch predictor. So what we want to end up with, is that when we sample for cache-misses we get load_my_data() as the predominant function, not a nice 50/50 relation. Idem for branch misses and compute_me_silly(). By weighting the samples by the hw counter delta we get this, if we assume that the sampling frequency is not a harmonic of the runtime of these functions, then statistics will dtrt. It basically generates a massive skid on the sample, but as long as most of the samples end up hitting the right function we're good. For a periodic workload like: while (lots) { my_important_work() } that is even true for period > function_runtime with the exception of that harmonic thing. For less neat workloads like: while (lots) { my_important_work(); other_random_things(); } This doesn't need to work unless period < function_runtime. Clearly we cannot attribute anything to the actual instruction hit due to the massive skid, but we can (possibly) say something about the function based on these statistical rules. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
From: Ingo Molnar on 8 Jul 2010 07:30 * Peter Zijlstra <peterz(a)infradead.org> wrote: > On Thu, 2010-07-01 at 17:39 +0200, Peter Zijlstra wrote: > > > > Ah, for sampling for sure, simply group a software perf event and a > > hardware perf event together and use PERF_SAMPLE_READ. > > So the idea is to sample using a software event (periodic timer of sorts, > maybe randomize it) and weight its samples by the hardware event deltas. > > Suppose you have a workload consisting of two main parts: > > my_important_work() > { > load_my_data(); > compute_me_silly(); > } > > Now, lets assume that both these functions take the same time to complete > for each part of work. In that case a periodic timer generate samples that > are about 50/50 distributed between these two functions. > > Now, let us further assume that load_my_data() is so slow because its > missing all the caches and compute_me_silly() is slow because its defeating > the branch predictor. > > So what we want to end up with, is that when we sample for cache-misses we > get load_my_data() as the predominant function, not a nice 50/50 relation. > Idem for branch misses and compute_me_silly(). > > By weighting the samples by the hw counter delta we get this, if we assume > that the sampling frequency is not a harmonic of the runtime of these > functions, then statistics will dtrt. Yes. And if the platform code implements this then the tooling side already takes care of it - even if the CPU itself cannot geneate interrupts based on say cachemisses or branches (but can measure them via counts). The only situation where statistics will not do the right thing is when the likelyhood of the sample tick significantly correlates with the likelyhood of the workload itself executing. Timer-dominated workloads would be an example. Real hrtimers are sufficiently tick-less to solve most of these artifacts in practice. Ingo -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
First
|
Prev
|
Next
|
Last
Pages: 1 2 3 4 5 Prev: [PATCH v2] Documentation/sysctl/vm.txt typo Next: perf tools: allow cross compiling with DWARF support |