[PATCH 0/8] Suspend block api (version 8) [Kernel]

Prev: [ANN] Linux Security Summit 2010 - Announcement and CFP
Next: [PATCH 4/8] PM: suspend_block: Add debugfs file

From: Rafael J. Wysocki on 1 Jun 2010 18:10

On Tuesday 01 June 2010, Alan Stern wrote:
> On Tue, 1 Jun 2010, Neil Brown wrote:
>
> > > As I said before, we generally can't prevent such things from happening,
> > > because even if we handle the particular race described above, it still is
> > > possible that the event will be "lost" if it arrives just a bit later (eg.
> > > during a suspend-toggle switch). So the PRE_SUSPEND thing won't really
> > > solve the entire problem while increasing complexity.
> > >
> > > > My freerunner has a single core so without CONFIG_PREEMPT it may be that
> > > > there is no actual race-window - maybe the PRE_SUSPENDs will all run before a
> > > > soft_irq thread has a chance to finish handling of the interrupt (my
> > > > knowledge of these details is limits). But on a muilti-core device I think
> > > > there would definitely be a race-window.
> > >
> > > Yes, there always will be a race window. The only thing we can do is to
> > > narrow it, but we cannot really close it (at least not on a PC, but I'm not
> > > really sure it can be closed at all).
> >
> > Well you are the expert so I assume you are right, but I would really like to
> > understand why this is.
> >
> > I guess that if the event was delivered as an edge-triggered interrupt which
> > the interrupt controller couldn't latch, then it might get lost. Is that
> > what happens?
>
> You're barking up the wrong tree. If I understand correctly, Rafael is
> saying that there's a race involving events which are not _wakeup_
> events. If a non-wakeup event arrives shortly before a suspend, it can
> have its normal effect. If it arrives while a suspend is in progress,
> its delivery may be delayed until the system resumes. And if it
> arrives after the system is fully suspended, it may never be noticed at
> all.
>
> With wakeup events the problem isn't so bad. Wakeup events are always
> noticed, and if the system is designed properly they will either abort
> a suspend-in-progress or else cause the system to resume as soon as the
> suspend is complete. (Linux is not yet properly designed in this
> sense.)
>
> Or maybe I'm misunderstanding also, and Rafael is saying that there's
> a race involving events whose meaning changes depending on whether or
> not the system is asleep. This is obviously true and clearly
> unavoidable.

In fact I was referring to both in different places. That's why what I said
wasn't too clear, sorry about that.

Rafael
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Rafael J. Wysocki on 1 Jun 2010 18:30

On Tuesday 01 June 2010, James Bottomley wrote:
> On Tue, 2010-06-01 at 14:51 +0100, Matthew Garrett wrote:
> > On Mon, May 31, 2010 at 04:21:09PM -0500, James Bottomley wrote:
> >
> > > You're the one mentioning x86, not me. I already explained that some
> > > MSM hardware (the G1 for example) has lower power consumption in S3
> > > (which I'm using as an ACPI shorthand for suspend to ram) than any
> > > suspend from idle C state. The fact that current x86 hardware has the
> > > same problem may be true, but it's not entirely relevant.
> >
> > As long as you can set a wakeup timer, an S state is just a C state with
> > side effects. The significant one is that entering an S state stops the
> > process scheduler and any in-kernel timers. I don't think Google care at
> > all about whether suspend is entered through an explicit transition or
> > something hooked into cpuidle - the relevant issue is that they want to
> > be able to express a set of constraints that lets them control whether
> > or not the scheduler keeps on scheduling, and which doesn't let them
> > lose wakeup events in the process.
>
> Exactly, so my understanding of where we currently are is:

Thanks for the recap.

> 1. pm_qos will be updated to be able to express the android suspend
> blockers as interactivity constraints (exact name TBD, but
> probably /dev/cpu_interactivity)

I think that's not been decided yet precisely enough. I saw a few ideas
here and there in the thread, but which of them are we going to follow?

> 2. pm_qos will be updated to be callable from atomic context
> 3. pm_qos will be updated to export statistics initially closely
> matching what suspend blockers provides (simple update of the rw
> interface?)
>
> After this is done, the current android suspend block patch becomes a
> re-expression in kernel space in terms of pm_qos, with the current
> userspace wakelocks being adapted by the android framework into pm_qos
> requirements expressed to /dev/cpu_interactivity (or whatever name is
> chosen). Then opportunistic suspend is either a small add-on kernel
> patch they have in their tree to suspend when the interactivity
> constraint goes to NONE, or it could be done entirely by a userspace
> process. Long term this could migrate to the freezer and suspend from
> idle approach as the various problem timers get fixed.
>
> I think the big unresolved issue is the stats extension. For android,
> we need just a name written along with the value, so we have something
> to hang the stats off ... current pm_qos userspace users just write a
> value, so the name would be optional. From the kernel, we probably just
> need an additional API that takes a stats name or NULL if none
> (pm_qos_add_request_named()?). Then reading the stats could be done by
> implementing a fops read routine on the misc device.

Is the original idea of having that information in debugfs objectionable?

Rafael
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: James Bottomley on 1 Jun 2010 18:40

On Wed, 2010-06-02 at 00:24 +0200, Rafael J. Wysocki wrote:
> On Tuesday 01 June 2010, James Bottomley wrote:
> > On Tue, 2010-06-01 at 14:51 +0100, Matthew Garrett wrote:
> > > On Mon, May 31, 2010 at 04:21:09PM -0500, James Bottomley wrote:
> > >
> > > > You're the one mentioning x86, not me. I already explained that some
> > > > MSM hardware (the G1 for example) has lower power consumption in S3
> > > > (which I'm using as an ACPI shorthand for suspend to ram) than any
> > > > suspend from idle C state. The fact that current x86 hardware has the
> > > > same problem may be true, but it's not entirely relevant.
> > >
> > > As long as you can set a wakeup timer, an S state is just a C state with
> > > side effects. The significant one is that entering an S state stops the
> > > process scheduler and any in-kernel timers. I don't think Google care at
> > > all about whether suspend is entered through an explicit transition or
> > > something hooked into cpuidle - the relevant issue is that they want to
> > > be able to express a set of constraints that lets them control whether
> > > or not the scheduler keeps on scheduling, and which doesn't let them
> > > lose wakeup events in the process.
> >
> > Exactly, so my understanding of where we currently are is:
>
> Thanks for the recap.
>
> > 1. pm_qos will be updated to be able to express the android suspend
> > blockers as interactivity constraints (exact name TBD, but
> > probably /dev/cpu_interactivity)
>
> I think that's not been decided yet precisely enough. I saw a few ideas
> here and there in the thread, but which of them are we going to follow?

Well, android only needs two states (block and don't block), so that
gets translated as 2 s32 values (say 0 and INT_MAX). I've seen defines
like QOS_INTERACTIVE and QOS_NONE (or QOS_DRECKLY or QOS_MANANA) to
describe these, but if all we're arguing over is the define name, that's
progress.

The other piece they need is the suspend block name, which comes with
the stats API, and finally we need to decide what the actual constraint
is called (which is how the dev node gets its name) ...

> > 2. pm_qos will be updated to be callable from atomic context
> > 3. pm_qos will be updated to export statistics initially closely
> > matching what suspend blockers provides (simple update of the rw
> > interface?)
> >
> > After this is done, the current android suspend block patch becomes a
> > re-expression in kernel space in terms of pm_qos, with the current
> > userspace wakelocks being adapted by the android framework into pm_qos
> > requirements expressed to /dev/cpu_interactivity (or whatever name is
> > chosen). Then opportunistic suspend is either a small add-on kernel
> > patch they have in their tree to suspend when the interactivity
> > constraint goes to NONE, or it could be done entirely by a userspace
> > process. Long term this could migrate to the freezer and suspend from
> > idle approach as the various problem timers get fixed.
> >
> > I think the big unresolved issue is the stats extension. For android,
> > we need just a name written along with the value, so we have something
> > to hang the stats off ... current pm_qos userspace users just write a
> > value, so the name would be optional. From the kernel, we probably just
> > need an additional API that takes a stats name or NULL if none
> > (pm_qos_add_request_named()?). Then reading the stats could be done by
> > implementing a fops read routine on the misc device.
>
> Is the original idea of having that information in debugfs objectionable?

Well ... debugfs is usually used to get around the sysfs rules. In this
case, pm_qos has a dev interface ... I don't specifically object to
using debugfs, but I don't see any reason to forbid it from being a
simple dev read interface either.

James

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Neil Brown on 1 Jun 2010 20:50

On Tue, 1 Jun 2010 10:47:49 -0400 (EDT)
Alan Stern <stern(a)rowland.harvard.edu> wrote:

> On Tue, 1 Jun 2010, Neil Brown wrote:
>
> > > With wakeup events the problem isn't so bad. Wakeup events are always
> > > noticed, and if the system is designed properly they will either abort
> > > a suspend-in-progress or else cause the system to resume as soon as the
> > > suspend is complete. (Linux is not yet properly designed in this
> > > sense.)
> >
> > This is exactly the issue. Linux doesn't get this right. A wakeup that is
> > processed by the driver just before suspend may not get all the way out to
> > it's final destination (possibly in meat-space) until after some other event
> > wakes the device. It is not possible to guarantee full delivery of a wakeup
> > event with Linux in its current state. This is a bug.
>
> That wasn't what I meant. An important requirement is that a wakeup
> event received after the device has been suspended, but before the
> system suspend is complete, should cause the system suspend to abort.
> This is one of the things suspend blockers were meant to help with. In
> theory it should already work correctly -- but it doesn't. That's a
> real bug.
>
> The other things you're talking about aren't really bugs. That is,
> they are things which the kernel _ought_ to do but it wasn't _intended_
> to do. They are misfeatures: failures of design, not failures of
> execution.

A fairly subtle distinction, but I can accept it. We are working with a
design failure.
But when a design failure results in a feature not being usable in a
race-free way, I would still call that failure of design a bug - a bug in the
design.

>
> > I don't think I'm missing that. Wakeup events are the only events of
> > interest. But wakeup events are more than just low-level hardware events.
> > They are also the virtual events that are a direct consequence of the
> > original hardware event.
>
> This is a matter of semantics and it is highly debatable.
>
> > A UART asserts 'data ready' which is routed to an IO interrupt line that
> > wakes the device. This is a wakeup event.
>
> Yes.
>
> > So are the bits that appear on the data line which represent the data
> > So are the characters that appear on the serial port
> > So is the "here is a message from the GSM processor" event
> > So is the "there is a new TXT message" event
> > So is the "'beep beep beep' out the speaker" event
>
> None of those things can cause the system to wake up after it is
> suspended. Indeed, they can't happen at all if the CPU isn't running
> (unless the GSM processor runs autonomously -- let's ignore such
> details). Hence it makes little sense to call them "wakeup" events.
>

I think we are seeing this very differently. In my mind these are in one
sense all the same event, just represented in different ways.

One of the reasons that we have an OS is to support levels of abstraction.
A disk drives stores sectors, but I want to store files
A monitor displays pixels, but I want to display text
A network sends packets, but I want to send Email.

The OS (including but not only the kernel) handles the conversion.
It is still my Email that is going out over the network even though the
network card has no concept of 'email'.

Similarly the event of importance above in the notification that a TXT has
arrived.
At the lowest level, it appears as 'data-ready' on a UART. At the highest
level is it sound-waves in the air. But it is the same event. And the role
of the OS is to allow me to treat it all as one event - one abstract concept.

Obviously the involvement of the kernel ends when the event reaches or
crosses the boundary to user-space, and user-space must be responsible for
taking it to the boundary to meat-space.
But the kernel needs to ensure that this holistic wake-up event can be
handled without racing with a suspend request.
If you just protect the lowest level representation of the event, but don't
allow the higher level representations to be protected, then it is a bit like
a filesystem that requires you to send SCSI commands to the disk drive if you
want to be sure your data is safe.

> > All these events are wakeup events, and should abort or instant-resume the
> > system. Currently only the first can be guaranteed to do this (... and maybe
> > the second, depending on the details of the implementation).
> > suspend-blocks effectively pass a lock from one event to the next to ensure
> > that all of these event block the suspend. There are various other ways to
> > achieve the same thing, and I think it can be done with no significant change
> > to the API. But some how we need to allow all of these events to be
> > linked wake-up events, not just the first one.
>
> That's not entirely clear. Yes, for Android's use case that's what you
> want. But in other situations maybe you don't. Consider the example
> of someone who closes the lid of their laptop and throws it in a
> backpack. The computer should suspend immediately; it shouldn't be
> delayed by these "virtual wakeup" events.

Completely agree. Old behaviour must remain unchanged, new behaviour must be
explicitly requested.

At each level, it must be explicitly stated that an event is to be treated
like a wake-up event. I believe that is already the case in drivers - they
need to explicitly configure the interrupt to cause a wake-from-suspend.

I've considered several possibilities for how user-space could explicitly say
that events are wake-event. My current favourite is to use fcntl(F_SETOWN)
to request a signal be sent to the process that is initiating the suspend
when the relevant events are ready to be consumed by user-space.
This would work for events that come through the input system (button press)
or through the network, or rtc alarm events. I suspect all potential
wake-events could be made to use this scheme, but I don't have an exhausting
list at present.

So a suspend request waits just long enough to close any race, checks if
there are signals pending, and proceeds with the suspend if there aren't.
The "just long enough" is a little tricky to measure, though the in-kernel
part of suspend-blockers presumably can be used to get that right.

If the suspending process hasn't requested the delivery of these signals, the
suspend will possibly pause to make sure that any events-in-transit in the
kernel have reached their in-kernel destination (milliseconds at most) and
then proceed with the suspend as normal.

So yes, there are different use cases and we should support all of them,
both "I shut the lid and my laptop really stays asleep" and "I never miss a
TXT because my phone went to sleep at a bad time". The process that
initiates the suspend has a role in choosing what can wake it up.

Thanks,
NeilBrown
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Arve Hjønnevåg on 1 Jun 2010 21:20

On Tue, Jun 1, 2010 at 3:36 PM, James Bottomley <James.Bottomley(a)suse.de> wrote:
> On Wed, 2010-06-02 at 00:24 +0200, Rafael J. Wysocki wrote:
>> On Tuesday 01 June 2010, James Bottomley wrote:
>> > On Tue, 2010-06-01 at 14:51 +0100, Matthew Garrett wrote:
>> > > On Mon, May 31, 2010 at 04:21:09PM -0500, James Bottomley wrote:
>> > >
>> > > > You're the one mentioning x86, not me. �I already explained that some
>> > > > MSM hardware (the G1 for example) has lower power consumption in S3
>> > > > (which I'm using as an ACPI shorthand for suspend to ram) than any
>> > > > suspend from idle C state. �The fact that current x86 hardware has the
>> > > > same problem may be true, but it's not entirely relevant.
>> > >
>> > > As long as you can set a wakeup timer, an S state is just a C state with
>> > > side effects. The significant one is that entering an S state stops the
>> > > process scheduler and any in-kernel timers. I don't think Google care at
>> > > all about whether suspend is entered through an explicit transition or
>> > > something hooked into cpuidle - the relevant issue is that they want to
>> > > be able to express a set of constraints that lets them control whether
>> > > or not the scheduler keeps on scheduling, and which doesn't let them
>> > > lose wakeup events in the process.
>> >
>> > Exactly, so my understanding of where we currently are is:
>>
>> Thanks for the recap.
>>
>> > � � �1. pm_qos will be updated to be able to express the android suspend
>> > � � � � blockers as interactivity constraints (exact name TBD, but
>> > � � � � probably /dev/cpu_interactivity)
>>
>> I think that's not been decided yet precisely enough. �I saw a few ideas
>> here and there in the thread, but which of them are we going to follow?
>
> Well, android only needs two states (block and don't block), so that
> gets translated as 2 s32 values (say 0 and INT_MAX). �I've seen defines
> like QOS_INTERACTIVE and QOS_NONE (or QOS_DRECKLY or QOS_MANANA) to
> describe these, but if all we're arguing over is the define name, that's
> progress.

I think we need separate state constraints for suspend and idle low
power modes. On the msm platform only a subset of the interrupts can
wake up from the low power mode, so we block the use if the low power
mode from idle while other interrupts are enabled. We do not block
suspend however if those interrupts are not marked as wakeup
interrupts. Most constraints that prevent suspend are not hardware
specific and should not prevent entering low power modes from idle. In
other words we may need to prevent low power idle modes while allowing
suspend, and we may need to prevent suspend while allowing low power
idle modes.

It would also be good to not have an implementation that gets slower
and slower the more clients you have. With binary constraints this is
trivial.

>
> The other piece they need is the suspend block name, which comes with
> the stats API, and finally we need to decide what the actual constraint
> is called (which is how the dev node gets its name) ...
>
>> > � � �2. pm_qos will be updated to be callable from atomic context
>> > � � �3. pm_qos will be updated to export statistics initially closely
>> > � � � � matching what suspend blockers provides (simple update of the rw
>> > � � � � interface?)

4. It would be useful to change pm_qos_add_request to not allocate
anything so can add constraints from init functions that currently
cannot fail.

>> >
>> > After this is done, the current android suspend block patch becomes a
>> > re-expression in kernel space in terms of pm_qos, with the current
>> > userspace wakelocks being adapted by the android framework into pm_qos
>> > requirements expressed to /dev/cpu_interactivity (or whatever name is
>> > chosen). �Then opportunistic suspend is either a small add-on kernel
>> > patch they have in their tree to suspend when the interactivity
>> > constraint goes to NONE, or it could be done entirely by a userspace
>> > process. �Long term this could migrate to the freezer and suspend from
>> > idle approach as the various problem timers get fixed.
>> >
>> > I think the big unresolved issue is the stats extension. �For android,
>> > we need just a name written along with the value, so we have something
>> > to hang the stats off ... current pm_qos userspace users just write a
>> > value, so the name would be optional. �From the kernel, we probably just
>> > need an additional API that takes a stats name or NULL if none
>> > (pm_qos_add_request_named()?). �Then reading the stats could be done by
>> > implementing a fops read routine on the misc device.
>>
>> Is the original idea of having that information in debugfs objectionable?
>
> Well ... debugfs is usually used to get around the sysfs rules. �In this
> case, pm_qos has a dev interface ... I don't specifically object to
> using debugfs, but I don't see any reason to forbid it from being a
> simple dev read interface either.
>

We don't currently have a dev interface for stats so this is not an
immediate requirement. The suspend blocker debugfs interface is just
as good as the proc interface we have for wakelocks.

--
Arve Hj�nnev�g
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

First | Prev | Next | Last
Pages: 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76
Prev: [ANN] Linux Security Summit 2010 - Announcement and CFP
Next: [PATCH 4/8] PM: suspend_block: Add debugfs file