ipc semaphores: reduce ipc_lock contention in semtimedop [Kernel]

Prev: [PATCH -next] eeepc-wmi: depends on INPUT_SPARSEKMAP
Next: jump label v4 - x86: Introduce generic jump patching without stop_machine

From: Manfred Spraul on 13 Apr 2010 13:20

Hi Chris,

On 04/12/2010 08:49 PM, Chris Mason wrote:
> /*
> + * when a semaphore is modified, we want to retry the series of operations
> + * for anyone that was blocking on that semaphore. This breaks down into
> + * a few different common operations:
> + *
> + * 1) One modification releases one or more waiters for zero.
> + * 2) Many waiters are trying to get a single lock, only one will get it.
> + * 3) Many modifications to the count will succeed.
> + *
>
Have you thought about odd corner cases:
Nick noticed the last time that it is possible to wait for arbitrary values:
in one semop:
- decrease semaphore 5 by 10
- wait until semaphore 5 is 0
- increase semaphore 5 by 10.

> SYSCALL_DEFINE4(semtimedop, int, semid, struct sembuf __user *, tsops,
> unsigned, nsops, const struct timespec __user *, timeout)
> {
> @@ -1129,6 +1306,8 @@ SYSCALL_DEFINE4(semtimedop, int, semid, struct sembuf __user *, tsops,
> struct sem_queue queue;
> unsigned long jiffies_left = 0;
> struct ipc_namespace *ns;
> + struct sem *blocker = NULL;
> + LIST_HEAD(pending);
>
> ns = current->nsproxy->ipc_ns;
>
> @@ -1168,6 +1347,14 @@ SYSCALL_DEFINE4(semtimedop, int, semid, struct sembuf __user *, tsops,
> alter = 1;
> }
>
> + /*
> + * try_atomic_semop takes all the locks of all the semaphores in
> + * the sops array. We have to make sure we don't deadlock if userland
> + * happens to send them out of order, so we sort them by semnum.
> + */
> + if (nsops> 1)
> + sort(sops, nsops, sizeof(*sops), sembuf_compare, NULL);
> +
>
Does sorting preserve the behavior?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Zach Brown on 13 Apr 2010 14:30

> What about Zach's simpler wakeup API?

It's making slow progress in the background as a longer-term experiment.

http://oss.oracle.com/~zab/wake-many/

That URL still has an API description, patches, and little test
utilities for the simple first draft.

- z
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Manfred Spraul on 14 Apr 2010 12:20

On 04/13/2010 08:19 PM, Chris Mason wrote:
> On Wed, Apr 14, 2010 at 04:09:45AM +1000, Nick Piggin wrote:
>
>> On Tue, Apr 13, 2010 at 01:39:41PM -0400, Chris Mason wrote:
>>
>> The other thing I don't know if your patch gets right is requeueing on
>> of the operations. When you requeue from one list to another, then you
>> seem to lose ordering with other pending operations, so that would
>> seem to break the API as well (can't remember if the API strictly
>> mandates FIFO, but anyway it can open up starvation cases).
>>
> I don't see anything in the docs about the FIFO order. I could add an
> extra sort on sequence number pretty easily, but is the starvation case
> really that bad?
>
>
How do you want to determine the sequence number?
Is atomic_inc_return() on a per-semaphore array counter sufficiently fast?

>> I was looking at doing a sequence number to be able to sort these, but
>> it ended up getting over complex (and SAP was only using simple ops so
>> it didn't seem to need much better).
>>
>> We want to be careful not to change semantics at all. And it gets
>> tricky quickly :( What about Zach's simpler wakeup API?
>>
> Yeah, that's why my patches include code to handle userland sending
> duplicate semids. Zach's simpler API is cooking too, but if I can get
> this done without insane complexity it helps with more than just the
> post/wait oracle workload.
>
>
What is the oracle workload, which multi-sembuf operations does it use?
How many semaphores are in one array?

When the last optimizations were written, I've searched a bit:
- postgres uses per-process semaphores, with small semaphore arrays.
[process sleeps on it's own semaphore and is woken up by someone
else when it can make progress]
- with google, I couldn't find anything relevant that uses multi-sembuf
semop() calls.

And I agree with Nick: We should be careful about changing the API.

--
Manfred
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Manfred Spraul on 14 Apr 2010 15:20

On 04/14/2010 07:33 PM, Chris Mason wrote:
> On Wed, Apr 14, 2010 at 06:16:53PM +0200, Manfred Spraul wrote:
>
>> On 04/13/2010 08:19 PM, Chris Mason wrote:
>>
>>> On Wed, Apr 14, 2010 at 04:09:45AM +1000, Nick Piggin wrote:
>>>
>>>> On Tue, Apr 13, 2010 at 01:39:41PM -0400, Chris Mason wrote:
>>>> The other thing I don't know if your patch gets right is requeueing on
>>>> of the operations. When you requeue from one list to another, then you
>>>> seem to lose ordering with other pending operations, so that would
>>>> seem to break the API as well (can't remember if the API strictly
>>>> mandates FIFO, but anyway it can open up starvation cases).
>>>>
>>> I don't see anything in the docs about the FIFO order. I could add an
>>> extra sort on sequence number pretty easily, but is the starvation case
>>> really that bad?
>>>
>>>
>> How do you want to determine the sequence number?
>> Is atomic_inc_return() on a per-semaphore array counter sufficiently fast?
>>
> I haven't tried yet, but hopefully it won't be a problem. A later patch
> does atomics on the reference count and it doesn't show up in the
> profiles.
>
>
>>
>>>> I was looking at doing a sequence number to be able to sort these, but
>>>> it ended up getting over complex (and SAP was only using simple ops so
>>>> it didn't seem to need much better).
>>>>
>>>> We want to be careful not to change semantics at all. And it gets
>>>> tricky quickly :( What about Zach's simpler wakeup API?
>>>>
>>> Yeah, that's why my patches include code to handle userland sending
>>> duplicate semids. Zach's simpler API is cooking too, but if I can get
>>> this done without insane complexity it helps with more than just the
>>> post/wait oracle workload.
>>>
>>>
>> What is the oracle workload, which multi-sembuf operations does it use?
>> How many semaphores are in one array?
>>
>> When the last optimizations were written, I've searched a bit:
>> - postgres uses per-process semaphores, with small semaphore arrays.
>> [process sleeps on it's own semaphore and is woken up by someone
>> else when it can make progress]
>>
> This is similar to Oracle (and the sembench program). Each process has
> a semaphore and when it is waiting for a commit it goes to sleep on it.
> They are woken up in bulk with semtimedop calls from a single process.
>
>
Hmm. Thus you have:
- single sembuf decrease operations that are waiting frequently.
- multi-sembuf increase operations.

What about optimizing for that case?
Increase operations succeed immediately. Thus complex_count is 0.

If we have performed an update operation, then we can scan all
simple_lists that have seen an increase instead of checking the global
list - as long as there are no complex operations waiting.
Right now, we give up if the update operation was a complex operation -
but that does not matter.
All that matters are the sleeping operations, not the operation that did
the wakeup.
I've attached an untested idea.

> But oracle also uses semaphores for locking in a traditional sense.
>
> Putting the waiters into a per-semaphore list is really only part of the
> speedup. The real boost comes from the patch to break up the locks into
> a per semaphore lock.
>
>
Ok. Then simple tricks won't help.
How many semaphores are in one array?

--
Manfred

From: Manfred Spraul on 15 Apr 2010 12:40

On 04/14/2010 09:50 PM, Chris Mason wrote:
> On a big system I saw about 4000 semaphores total. The database will
> just allocate as many as it can into a single array and keep creating
> arrays until it has all it needs.
>
>
What happens if SEMMSL is reduced (first entry in /proc/sys/kernel/sem)?

--
Manfred
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

| Next | Last
Pages: 1 2
Prev: [PATCH -next] eeepc-wmi: depends on INPUT_SPARSEKMAP
Next: jump label v4 - x86: Introduce generic jump patching without stop_machine