perf_events: improve Intel event scheduling [Kernel]

Prev: BKL: Explicitly add BKL around get_sb/fill_super
Next: [PATCH 2/3] sysctl: Remove CTL_NONE and CTL_UNNUMBERED

From: Paul Mackerras on 21 Dec 2009 20:20

On Fri, Dec 11, 2009 at 12:59:16PM +0100, stephane eranian wrote:

> There is a major difference between PPC and X86 here. PPC has a centralized
> register to control start/stop. This register uses bitmask to enable
> or disable counters. Thus, in hw_perf_enable(), if n_added=0, then you
> just need to
> use the pre-computed bitmask. Otherwise, you need to recompute the bitmask to
> include the new registers. The assignment of events and validation is done in
> hw_group_sched_in().

That's not entirely accurate. Yes there is a global start/stop bit,
but there isn't a bitmask to enable or disable counters. There is a
selector bitfield for each counter (except the limited-function
counters) and you can set the selector to the 'count nothing' value if
you don't want a particular counter to count.

Validation is done in hw_group_sched_in() but not the assignment of
events to counters. That's done in hw_perf_enable(), via the
model-specific ppmu->compute_mmcr() call.

> In X86, assignment and validation is done in hw_group_sched_in(). Activation is
> done individually for each counter. There is no centralized register
> used here, thus
> no bitmask to update.
>
> Disabling a counter does not trigger a complete reschedule of events.
> This happens
> only when hw_group_sched_in() is called.
>
> The n_events = 0 in hw_perf_disable() is used to signal that something
> is changing.
> It should not be here but here.

The meaning of "It should not be here but here" is quite unclear to me.

> The problem is that
> hw_group_sched_in() needs a way
> to know that it is called for a completely new series of group
> scheduling so it can
> discard any previous assignment. This goes back to the issue I raised
> in my previous
> email. You could add a parameter to hw_group_sched_in() that would
> indicate this is
> the first group. that would cause n_events =0 and the function would
> start accumulating
> events for the new scheduling period.

I don't think hw_group_sched_in is ever called for a completely new
series of group scheduling. If you have per-cpu counters active, they
don't get scheduled out and in again with each task switch. So you
will tend to get a hw_pmu_disable call, then a series of disable calls
for the per-task events for the old task, then a series of
hw_group_sched_in calls for the per-task events for the new task, then
a hw_pmu_enable call.

On powerpc we maintain an array with pointers to all the currently
active events. That makes it easy to know at hw_pmu_enable() time
what events need to be put on the PMU. Also it means that at
hw_group_sched_in time you can look at the whole set of events,
including the ones just added, to see if it's feasible to put them all
on. At that point we just check feasibility, which is quite quick and
easy using the bit-vector encoding of constraints. The bit-vector
encoding lets us represent multiple constraints of various forms in
one pair of 64-bit values per event. We can express constraints such
as "you can have at most N events in a class X" or "you can't have
events in all of classes A, B, C and D" or "control register bitfield
X must be set to Y", and then check that a set of events satisfies all
the constraints with some simple integer arithmetic. I don't know
exactly what constraints you have on x86 but I would be surprised if
you couldn't handle them the same way.

Paul.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Stephane Eranian on 29 Dec 2009 09:50

Paul,

So if I understand what both of you are saying, it seems that
event scheduling has to take place in the pmu->enable() callback
which is per-event.

In the case of X86, you can chose to do a best-effort scheduling,
i.e., only assign
the new event if there is a compatible free counter. That would be incremental.

But the better solution would be to re-examine the whole situation and
potentially
move existing enabled events around to free a counter if the new event is more
constrained. That would require stopping the PMU, rewriting config and
data registers
and re-enabling the PMU. This latter solution is the only possibility
to avoid ordering
side effects, i.e., the assignment of events to counters depends on
the order in which
events are created (or enabled).

The register can be considered freed by pmu->disable() if scheduling takes place
in pmu->enable().

From what Paul was saying about hw_perf_group_sched_in(), it seems like this
function can be used to check if a new group is compatible with the existing
enabled events. Compatible means that there is a possible assignment of
events to counters.

As for the n_added logic, it seems like perf_disable() resets n_added to zero.
N_added is incremented in pmu->enable(), i.e., add one event, or the
hw_perf_group_sched_in(), i.e., add a whole group. Scheduling is based on
n_events. The point of n_added is to verify whether something needs to be
done, i.e., event scheduling, if an event or group was added between
perf_disable()
and perf_enable(). In pmu->disable(), the list of enabled events is
compacted and
n_events is decremented.

Did I get this right?

All the enable and disable calls can be called from NMI interrupt context
and thus must be very careful with locks.

On Tue, Dec 22, 2009 at 2:02 AM, Paul Mackerras <paulus(a)samba.org> wrote:
> On Mon, Dec 21, 2009 at 04:40:40PM +0100, Peter Zijlstra wrote:
>
>> I'm not really seeing the problem here...
>>
>>
>> perf_disable() <-- shut down the full pmu
>>
>> pmu->disable() <-- hey someone got removed (easy free the reg)
>> pmu->enable() <-- hey someone got added (harder, check constraints)
>>
>> hw_perf_group_sched_in() <-- hey a full group got added
>> (better than multiple ->enable)
>>
>> perf_enable() <-- re-enable pmu
>>
>>
>> So ->disable() is used to track freeing, ->enable is used to add
>> individual counters, check constraints etc..
>>
>> hw_perf_group_sched_in() is used to optimize the full group enable.
>>
>> Afaict that is what power does (Paul?) and that should I think be
>> sufficient to track x86 as well.
>
> That sounds right to me.
>
>> Since sched_in() is balanced with sched_out(), the ->disable() calls
>> should provide the required information as to the occupation of the pmu.
>> I don't see the need for more hooks.
>>
>> Paul, could you comment, since you did all this for power?
>
> On powerpc we maintain a list of currently enabled events in the arch
> code. Does x86 do that as well?
>
> If you have the list (or array) of events easily accessible, it's
> relatively easy to check whether the whole set is feasible at any
> point, without worrying about which events were recently added. The
> perf_event structure has a spot where the arch code can store which
> PMU register is used for that event, so you can easily optimize the
> case where the event doesn't move.
>
> Like you, I'm not seeing where the difficulty lies. Perhaps Stephane
> could give us a detailed example if he still thinks there's a
> difficulty.
>
> Paul.
>

--
Stephane Eranian | EMEA Software Engineering
Google France | 38 avenue de l'Opéra | 75002 Paris
Tel : +33 (0) 1 42 68 53 00
This email may be confidential or privileged. If you received this
communication by mistake, please
don't forward it to anyone else, please erase all copies and
attachments, and please let me know that
it went to the wrong person. Thanks
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Paul Mackerras on 6 Jan 2010 23:20

Stephane,

> So if I understand what both of you are saying, it seems that
> event scheduling has to take place in the pmu->enable() callback
> which is per-event.

How much you have to do in the pmu->enable() function depends on
whether the PMU is already stopped (via a call hw_perf_disable()) or
not. If it is already stopped you don't have to do the full event
scheduling computation, you only have to do enough to work out if the
current set of events plus the event being enabled is feasible, and
record the event for later. In other words, you can defer the actual
scheduling until hw_perf_enable() time. Most calls to the
pmu->enable() function are with the PMU already stopped, so it's a
worthwhile optimization.

> In the case of X86, you can chose to do a best-effort scheduling,
> i.e., only assign
> the new event if there is a compatible free counter. That would be incremental.
>
> But the better solution would be to re-examine the whole situation and
> potentially
> move existing enabled events around to free a counter if the new event is more
> constrained. That would require stopping the PMU, rewriting config and
> data registers
> and re-enabling the PMU. This latter solution is the only possibility
> to avoid ordering
> side effects, i.e., the assignment of events to counters depends on
> the order in which
> events are created (or enabled).

On powerpc, the pmu->enable() function always stops the PMU if it
wasn't already stopped. That simplifies the code a lot because it
means that I can do all the actual event scheduling (i.e. deciding on
which event goes on which counter and working out PMU control register
values) in hw_perf_enable().

> The register can be considered freed by pmu->disable() if scheduling takes place
> in pmu->enable().
>
> >From what Paul was saying about hw_perf_group_sched_in(), it seems like this
> function can be used to check if a new group is compatible with the existing
> enabled events. Compatible means that there is a possible assignment of
> events to counters.

The hw_perf_group_sched_in() function is an optimization so I can add
all the counters in the group to the set under consideration and then
do just one feasibility check, rather than adding each counter in
the group in turn and doing the feasibility check for each one.

> As for the n_added logic, it seems like perf_disable() resets n_added to zero.
> N_added is incremented in pmu->enable(), i.e., add one event, or the
> hw_perf_group_sched_in(), i.e., add a whole group. Scheduling is based on
> n_events. The point of n_added is to verify whether something needs to be
> done, i.e., event scheduling, if an event or group was added between
> perf_disable()
> and perf_enable(). In pmu->disable(), the list of enabled events is
> compacted and
> n_events is decremented.
>
> Did I get this right?

The main point of n_added was so that hw_perf_enable() could know
whether the current set of events is a subset of the last set. If it
is a subset, the scheduling decisions are much simpler.

> All the enable and disable calls can be called from NMI interrupt context
> and thus must be very careful with locks.

I didn't think the pmu->enable() and pmu->disable() functions could be
called from NMI context. Also, I would expect that if hw_perf_enable
and hw_perf_disable are called from NMI context, that the calls would
be balanced. That simplifies things a lot.

Paul.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Peter Zijlstra on 7 Jan 2010 04:00

On Thu, 2010-01-07 at 15:13 +1100, Paul Mackerras wrote:
>
> On powerpc, the pmu->enable() function always stops the PMU if it
> wasn't already stopped.

I just looked at the generic code and it looks like kernel/perf_event.c
always calls perf_disable() before calling ->enable().

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Peter Zijlstra on 7 Jan 2010 04:10

On Thu, 2010-01-07 at 15:13 +1100, Paul Mackerras wrote:
>
> > All the enable and disable calls can be called from NMI interrupt context
> > and thus must be very careful with locks.
>
> I didn't think the pmu->enable() and pmu->disable() functions could be
> called from NMI context.

I don't think they're called from NMI context either, most certainly not
from the generic code.

The x86 calls the raw disable from nmi to throttle the counter, but all
that (should) do is disable that counter, which is limited to a single
msr write. After that it schedules a full disable by sending a self-ipi.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

First | Prev | Next | Last
Pages: 1 2 3 4
Prev: BKL: Explicitly add BKL around get_sb/fill_super
Next: [PATCH 2/3] sysctl: Remove CTL_NONE and CTL_UNNUMBERED