scalable classic RCU implementation [Kernel]

Prev: Keyboard Jammed error patch 2.4.35-pre4
Next: init: DEBUG_BLOCK_EXT_DEVT requires explicit root= param

From: Paul E. McKenney on 23 Aug 2008 22:50

On Sat, Aug 23, 2008 at 06:07:35PM +0200, Ingo Molnar wrote:
>
> * Paul E. McKenney <paulmck(a)linux.vnet.ibm.com> wrote:
>
> > Is this a sufficient improvement?
>
> yeah - looks much better. This was the block that meets the eye for the
> first time in the patch so it stuck out.
>
> just one more small pet peeve of mine: please use vertical alignment too
> to improve readability. Instead of:
>
> > #define MAX_RCU_LEVELS 3
> > #define RCU_FANOUT (CONFIG_RCU_FANOUT)
> > #define RCU_FANOUT_SQ (RCU_FANOUT * RCU_FANOUT)
> > #define RCU_FANOUT_CUBE (RCU_FANOUT_SQ * RCU_FANOUT)
>
> this looks a bit more structured IMO:
>
> > #define MAX_RCU_LEVELS 3
> > #define RCU_FANOUT (CONFIG_RCU_FANOUT)
> > #define RCU_FANOUT_SQ (RCU_FANOUT * RCU_FANOUT)
> > #define RCU_FANOUT_CUBE (RCU_FANOUT_SQ * RCU_FANOUT)

Good point, fixed.

> maybe even this:
>
> > #if (NR_CPUS) <= RCU_FANOUT
> > # define NUM_RCU_LVLS 1
> > # define NUM_RCU_LVL_0 1
> > # define NUM_RCU_LVL_1 (NR_CPUS)
> > # define NUM_RCU_LVL_2 0
> > # define NUM_RCU_LVL_3 0
> > #elif (NR_CPUS) <= RCU_FANOUT_SQ
> > # define NUM_RCU_LVLS 2
> > # define NUM_RCU_LVL_0 1
> > # define NUM_RCU_LVL_1 (((NR_CPUS) + RCU_FANOUT - 1) / RCU_FANOUT)
> > # define NUM_RCU_LVL_2 (NR_CPUS)
> > # define NUM_RCU_LVL_3 0
> > #elif (NR_CPUS) <= RCU_FANOUT_CUBE
> > # define NUM_RCU_LVLS 3
> > # define NUM_RCU_LVL_0 1
> > # define NUM_RCU_LVL_1 (((NR_CPUS) + RCU_FANOUT_SQ - 1) / RCU_FANOUT_SQ)
> > # define NUM_RCU_LVL_2 (((NR_CPUS) + (RCU_FANOUT) - 1) / (RCU_FANOUT))
> > # define NUM_RCU_LVL_3 NR_CPUS
> > #else
> > # error "CONFIG_RCU_FANOUT insufficient for NR_CPUS"
> > #endif /* #if (NR_CPUS) <= RCU_FANOUT */
>
> but no strong feelings on that one. (maybe inserting a space at the
> right places helps too, no need for a full tab)

Yep, just like you, spaced it just enough to keep the longest one from
running over one line. ;-)

I left the definitions for RCU_SUM and NUM_RCU_NODES compact, though:

#define RCU_SUM (NUM_RCU_LVL_0 + NUM_RCU_LVL_1 + NUM_RCU_LVL_2 + NUM_RCU_LVL_3)
#define NUM_RCU_NODES (RCU_SUM - NR_CPUS)

The other alternative would be to stack RCU_SUM as follows:

#define RCU_SUM (NUM_RCU_LVL_0 + NUM_RCU_LVL_1 + \
NUM_RCU_LVL_2 + NUM_RCU_LVL_3)

which seemed to me to add more ugly than enlightenment.

Testing is going well. Having to occasionally restrain myself to keep
from going full-bore for 4096 CPU optimality -- but have to keep it
simple until/unless someone with that large of a machine shows where
improvements are needed.

Thanx, Paul
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Manfred Spraul on 24 Aug 2008 04:10

Paul E. McKenney wrote:
> +/*
> + * Definition for node within the RCU grace-period-detection hierarchy.
> + */
> +struct rcu_node {
> + spinlock_t lock;
> + unsigned long qsmask; /* CPUs or groups that need to switch in */
> + /* order for current grace period to proceed.*/
> + unsigned long qsmaskinit;
> + /* Per-GP initialization for qsmask. */
>
I'm not sure if a bitmap is the right storage. If I understand the code
correctly, it contains two information:
1) If the bitmap is clear, then all cpus have completed whatever they
need to do.
A counter is more efficient than a bitmap. Especially: It would allow to
choose the optimal fan-out, independent from 32/64 bits.
2) The information if the current cpu must do something to complete the
current period.non
This is a local information, usually (always?) only the current cpu
needs to know if it must do something.
But this doesn't need to be stored in a shared structure, the
information could be stored in a per-cpu structure.

> + /*
> + * Extract the list of ready callbacks, disabling to prevent
> + * races with call_rcu() from interrupt handlers.
> + */
> + local_irq_save(flags);
> + list = rdp->nxtlist;
> + rdp->nxtlist = *rdp->nxttail[RCU_DONE_TAIL];
> + *rdp->nxttail[RCU_DONE_TAIL] = NULL;
> + tail = rdp->nxttail[RCU_DONE_TAIL];
> + for (count = RCU_NEXT_SIZE - 1; count >= 0; count--)
> + if (rdp->nxttail[count] == rdp->nxttail[RCU_DONE_TAIL])
> + rdp->nxttail[count] = &rdp->nxtlist;
> + local_irq_restore(flags);
>
Do you have a description of the events between call_rcu() and the rcu
callback?
Is the following description correct?

__call_rcu() queues in RCU_NEXT_TAIL.
In the middle of the current grace period: rcu_check_quiescent_state()
calls rcu_next_callbacks_are_ready().
Entry now in RCU_NEXT_READY_TAIL
** 0.5 cycles: wait until all cpus have completed the current cycle.
rcu_process_gp_end() moves from NEXT_READY_TAIL to WAIT_TAIL

** full grace period
rcu_process_gp_end() moves from WAIT_TAIL to DONE_TAIL
rcu_do_batch() finds the entries in DONE_TAIL and calls the callback.

> +/*
> + * Do softirq processing for the current CPU.
> + */
> static void rcu_process_callbacks(struct softirq_action *unused)
> {
> /*
> * Memory references from any prior RCU read-side critical sections
> - * executed by the interrupted code must be see before any RCU
> + * executed by the interrupted code must be seen before any RCU
> * grace-period manupulations below.
> */
>
> smp_mb(); /* See above block comment. */
>
Why this mb()? There was a grace period between the last read side
critical section that might have accessed the pointer.
The rcu internal code already does spin_lock()+spin_unlock(). Isn't that
sufficient?

>
> - __rcu_process_callbacks(&rcu_ctrlblk, &__get_cpu_var(rcu_data));
> - __rcu_process_callbacks(&rcu_bh_ctrlblk, &__get_cpu_var(rcu_bh_data));
> + __rcu_process_callbacks(&rcu_state, &__get_cpu_var(rcu_data));
> + __rcu_process_callbacks(&rcu_bh_state, &__get_cpu_var(rcu_bh_data));
>
Have you considered merging RCU_DONE_TAIL for rcu_data and rcu_bh_data?

> +
> +/**
> + * call_rcu - Queue an RCU callback for invocation after a grace period.
> + * @head: structure to be used for queueing the RCU updates.
> + * @func: actual update function to be invoked after the grace period
> + *
> + * The update function will be invoked some time after a full grace
> + * period elapses, in other words after all currently executing RCU
> + * read-side critical sections have completed. RCU read-side critical
> + * sections are delimited by rcu_read_lock() and rcu_read_unlock(),
> + * and may be nested.
> + */
>
The docbook entry is duplicated: They are in include/linux/rcupdate.h
and here.
What about removing one of them?
I would go one step further:
Even add call_rcu_sched() into rcupdate.h. Add a Kconfig bool
"RCU_NEEDS_SCHED" and automatically define either the extern or the #define.

--
Manfred
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Paul E. McKenney on 24 Aug 2008 12:40

On Sun, Aug 24, 2008 at 10:08:44AM +0200, Manfred Spraul wrote:
> Paul E. McKenney wrote:

Thank you for looking this over!

>> + * Definition for node within the RCU grace-period-detection hierarchy.
>> + */
>> +struct rcu_node {
>> + spinlock_t lock;
>> + unsigned long qsmask; /* CPUs or groups that need to switch in */
>> + /* order for current grace period to proceed.*/
>> + unsigned long qsmaskinit;
>> + /* Per-GP initialization for qsmask. */
>>
> I'm not sure if a bitmap is the right storage. If I understand the code
> correctly, it contains two information:
> 1) If the bitmap is clear, then all cpus have completed whatever they need
> to do.
> A counter is more efficient than a bitmap. Especially: It would allow to
> choose the optimal fan-out, independent from 32/64 bits.
> 2) The information if the current cpu must do something to complete the
> current period.non
> This is a local information, usually (always?) only the current cpu needs
> to know if it must do something.
> But this doesn't need to be stored in a shared structure, the information
> could be stored in a per-cpu structure.

I am using the bitmap in force_quiescent_state() to work out who to
check dynticks and who to send reschedule IPIs to. I could scan all
of the per-CPU rcu_data structures, but am assuming that after a few
jiffies there would typically be relatively few CPUs still needing to do
a quiescent state. Given this assumption, on systems with large numbers
of CPUs, scanning the bitmask greatly reduces the number of cache misses
compared to scanning the rcu_data structures.

>> + /*
>> + * Extract the list of ready callbacks, disabling to prevent
>> + * races with call_rcu() from interrupt handlers.
>> + */
>> + local_irq_save(flags);
>> + list = rdp->nxtlist;
>> + rdp->nxtlist = *rdp->nxttail[RCU_DONE_TAIL];
>> + *rdp->nxttail[RCU_DONE_TAIL] = NULL;
>> + tail = rdp->nxttail[RCU_DONE_TAIL];
>> + for (count = RCU_NEXT_SIZE - 1; count >= 0; count--)
>> + if (rdp->nxttail[count] == rdp->nxttail[RCU_DONE_TAIL])
>> + rdp->nxttail[count] = &rdp->nxtlist;
>> + local_irq_restore(flags);
>>
> Do you have a description of the events between call_rcu() and the rcu
> callback?
> Is the following description correct?
>
> __call_rcu() queues in RCU_NEXT_TAIL.
> In the middle of the current grace period: rcu_check_quiescent_state()
> calls rcu_next_callbacks_are_ready().
> Entry now in RCU_NEXT_READY_TAIL
> ** 0.5 cycles: wait until all cpus have completed the current cycle.
> rcu_process_gp_end() moves from NEXT_READY_TAIL to WAIT_TAIL
>
> ** full grace period
> rcu_process_gp_end() moves from WAIT_TAIL to DONE_TAIL
> rcu_do_batch() finds the entries in DONE_TAIL and calls the callback.

Yes, that is the sequence of events if grace periods are happening back to
back and if the current CPU has not yet passed through a quiescent state.
If RCU is idle, then there is no need to wait for a previous grace
period to complete. If this CPU already passed through its quiescent
state for the first grace period, and is not the CPU that starts the next
grace period, then there will be an additional grace period to move from
RCU_NEXT_TAIL to RCU_NEXT_READY_TAIL. If this CPU is the one starting
the next grace period, then all of its callbacks get advanced.

>> +/*
>> + * Do softirq processing for the current CPU.
>> + */
>> static void rcu_process_callbacks(struct softirq_action *unused)
>> {
>> /*
>> * Memory references from any prior RCU read-side critical sections
>> - * executed by the interrupted code must be see before any RCU
>> + * executed by the interrupted code must be seen before any RCU
>> * grace-period manupulations below.
>> */
>> smp_mb(); /* See above block comment. */
>>
> Why this mb()? There was a grace period between the last read side critical
> section that might have accessed the pointer.
> The rcu internal code already does spin_lock()+spin_unlock(). Isn't that
> sufficient?

The combination of spin_lock()+spin_unlock()+spin_lock()+spin_unlock()
would suffice. But the pair of smp_mb()s suffice regardless of fast
paths through the rest of the mechanism. Because rcu_process_callbacks()
is on the slow path, the trival proof of ordering is more important than
the slight reduction in overhead.

>> - __rcu_process_callbacks(&rcu_ctrlblk, &__get_cpu_var(rcu_data));
>> - __rcu_process_callbacks(&rcu_bh_ctrlblk, &__get_cpu_var(rcu_bh_data));
>> + __rcu_process_callbacks(&rcu_state, &__get_cpu_var(rcu_data));
>> + __rcu_process_callbacks(&rcu_bh_state, &__get_cpu_var(rcu_bh_data));
>>
> Have you considered merging RCU_DONE_TAIL for rcu_data and rcu_bh_data?

I have (and am) considering this. It would require an additional per-CPU
data structure, would require an additional check in rcu_pending(), and
require a bit more manipulation when callbacks become "done". However,
but would simplify the requeuing code in rcu_do_batch() a bit.

One interesting side effect would be that blimit would apply globally
to both rcu and rcu_bh, and that a burst of (say) rcu callbacks could
delay rcu_bh callbacks. I am not yet sure whether this is good or bad.

>> +
>> +/**
>> + * call_rcu - Queue an RCU callback for invocation after a grace period.
>> + * @head: structure to be used for queueing the RCU updates.
>> + * @func: actual update function to be invoked after the grace period
>> + *
>> + * The update function will be invoked some time after a full grace
>> + * period elapses, in other words after all currently executing RCU
>> + * read-side critical sections have completed. RCU read-side critical
>> + * sections are delimited by rcu_read_lock() and rcu_read_unlock(),
>> + * and may be nested.
>> + */
>>
> The docbook entry is duplicated: They are in include/linux/rcupdate.h and
> here.
> What about removing one of them?

Good point, done!

> I would go one step further:
> Even add call_rcu_sched() into rcupdate.h. Add a Kconfig bool
> "RCU_NEEDS_SCHED" and automatically define either the extern or the
> #define.

Another approach would be to have an rcupdate.h definition that
was a wrapper around __call_rcu_sched(), and put the docbook stuff in
rcupdate.h. It would appear that docbook was not created with the idea
of alternative implementations in mind. ;-)

Thanx, Paul
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Manfred Spraul on 24 Aug 2008 14:30

Paul E. McKenney wrote:
>>> + */
>>> +struct rcu_node {
>>> + spinlock_t lock;
>>> + unsigned long qsmask; /* CPUs or groups that need to switch in */
>>> + /* order for current grace period to proceed.*/
>>> + unsigned long qsmaskinit;
>>> + /* Per-GP initialization for qsmask. */
>>>
>>>
>> I'm not sure if a bitmap is the right storage. If I understand the code
>> correctly, it contains two information:
>> 1) If the bitmap is clear, then all cpus have completed whatever they need
>> to do.
>> A counter is more efficient than a bitmap. Especially: It would allow to
>> choose the optimal fan-out, independent from 32/64 bits.
>> 2) The information if the current cpu must do something to complete the
>> current period.non
>> This is a local information, usually (always?) only the current cpu needs
>> to know if it must do something.
>> But this doesn't need to be stored in a shared structure, the information
>> could be stored in a per-cpu structure.
>>
>
> I am using the bitmap in force_quiescent_state() to work out who to
> check dynticks and who to send reschedule IPIs to. I could scan all
> of the per-CPU rcu_data structures, but am assuming that after a few
> jiffies there would typically be relatively few CPUs still needing to do
> a quiescent state. Given this assumption, on systems with large numbers
> of CPUs, scanning the bitmask greatly reduces the number of cache misses
> compared to scanning the rcu_data structures.
>
>
It's an optimization question: What is rarer? force_quiescent_state() or
"normal" cpu_quiet calls.
You have optimized for force_quiescent_state(), I have optimized for
"normal" cpu_quiet calls. [ok, I admit: force_quiescent_state() is still
missing in my code].
Do you have any statistics?

--
Manfred
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Paul E. McKenney on 24 Aug 2008 17:30

On Sun, Aug 24, 2008 at 08:25:02PM +0200, Manfred Spraul wrote:
> Paul E. McKenney wrote:
>>>> + */
>>>> +struct rcu_node {
>>>> + spinlock_t lock;
>>>> + unsigned long qsmask; /* CPUs or groups that need to switch in */
>>>> + /* order for current grace period to proceed.*/
>>>> + unsigned long qsmaskinit;
>>>> + /* Per-GP initialization for qsmask. */
>>>>
>>> I'm not sure if a bitmap is the right storage. If I understand the code
>>> correctly, it contains two information:
>>> 1) If the bitmap is clear, then all cpus have completed whatever they
>>> need to do.
>>> A counter is more efficient than a bitmap. Especially: It would allow to
>>> choose the optimal fan-out, independent from 32/64 bits.
>>> 2) The information if the current cpu must do something to complete the
>>> current period.non
>>> This is a local information, usually (always?) only the current cpu needs
>>> to know if it must do something.
>>> But this doesn't need to be stored in a shared structure, the information
>>> could be stored in a per-cpu structure.
>>
>> I am using the bitmap in force_quiescent_state() to work out who to
>> check dynticks and who to send reschedule IPIs to. I could scan all
>> of the per-CPU rcu_data structures, but am assuming that after a few
>> jiffies there would typically be relatively few CPUs still needing to do
>> a quiescent state. Given this assumption, on systems with large numbers
>> of CPUs, scanning the bitmask greatly reduces the number of cache misses
>> compared to scanning the rcu_data structures.
>>
> It's an optimization question: What is rarer? force_quiescent_state() or
> "normal" cpu_quiet calls.
> You have optimized for force_quiescent_state(), I have optimized for
> "normal" cpu_quiet calls. [ok, I admit: force_quiescent_state() is still
> missing in my code].

;-)

> Do you have any statistics?

If the system is completely busy, then I would expect normal cpu_quiet()
calls to be more common. But if the system were sized for peak
workload, it would spend a fair amount of time with many of the CPUs
idle. Power-conservation measures would hopefully push the idleness
into single cores/dies/whatever which could then be powered down.

A large fraction of the systems I see have utilizations well under 50%.
And latency concerns would also focus on force_quiescent state.

That said, I haven't had much to do with systems having more than 128
CPUs.

Thanx, Paul
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

First | Prev | Next | Last
Pages: 1 2 3 4
Prev: Keyboard Jammed error patch 2.4.35-pre4
Next: init: DEBUG_BLOCK_EXT_DEVT requires explicit root= param