From: Christoph Hellwig on
On Wed, Jun 23, 2010 at 09:44:20PM -0400, Vivek Goyal wrote:
> Let me explain the general idling logic and then see if it makes sense in case
> of WRITE_SYNC.
>
> Once a request has completed, if the cfq queue is empty, we have two choices.
> Either expire the cfq queue and move on to dispatch requests from a
> different queue or we idle on the queue hoping we will get more IO from
> same process/queue.

queues are basically processes in this context?

> Idling can help (on SATA disks with high seek cost), if
> our guess was right and soon we got another request from same process. We
> cut down on number of seeks hence increased throghput.

I don't really understand the logic behind this. If we lots of I/O
that actually is close to each other we should generally submit it in
one batch. That is true for pagecache writeback, that is true for
metadata (at least in XFS..), and it's true for any sane application
doing O_DIRECT / O_SYNC style I/O.

What workloads produde I/O that is local (not random) writes with small
delays between the I/O requests?

I see the point of this logic for reads where various workloads have
dependent reads that might be close to each other, but I don't really
see any point for writes.

> So looks like fsync path will do bunch of IO and then will wait for jbd thread
> to finish the work. In this case idling is waste of time.

Given that ->writepage already does WRITE_SYNC_PLUG I/O which includes
REQ_NODILE I'm still confused why we still have that issue.

> I guess same will
> be true for umount and sync() path. But same probably is not necessarily true
> for a O_DIRECT writer (database comes to mind), and for O_SYNC writer
> (virtual machines?).

For virtual machines idling seems like a waste of ressources. If we
have sequential I/O we dispatch in batches - in fact qemu even merges
sequential small block I/O it gets from the guest into one large request
we hand off to the host kernel. For reads the same caveat as above
applies as read requests as handed through 1:1 from the guest.

> O_SYNC writers will get little disk share in presence of heavy buffered
> WRITES. If we choose to not special case WRITE_SYNC and continue to
> idle on the queue then we probably are wasting time and reducing overall
> throughput. (The fsync() case Jeff is running into).

Remember that O_SYNC writes are implemented as normal buffered write +
fsync (a range fsync to be exact, but that doesn't change a thing).

And that's what they conceptually are anyway, so treating a normal
buffered write + fsync different from an O_SYNC write is not only wrong
conceptuall but also in implementation. You have the exact same issue
of handing off work to the journal commit thread in extN. Note that
the log write (or at least parts of it) will always use WRITE_BARRIER,
which completey bypasses the I/O scheduler.

> So one possible way could be that don't try to special case synchronous
> writes and continue to idle on the queue based on other parameters. If
> kernel/higher layers have knowledge that we are not going to issue more
> IO in same context, then they should explicitly call blk_yield(), to
> stop idling and give up slice.

We have no way to know what userspace will do if we are doing
O_SYNC/O_DIRECT style I/O or use fsync. We know that we will most
likely continue kicking things from the same queue when doing page
writeback. One thing that should help with this is Jens' explicit
per-process plugging stuff, which I noticed he recently updated to a
current kernel.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Vivek Goyal on
On Fri, Jun 25, 2010 at 01:03:20PM +0200, Christoph Hellwig wrote:
> On Wed, Jun 23, 2010 at 09:44:20PM -0400, Vivek Goyal wrote:
> > Let me explain the general idling logic and then see if it makes sense in case
> > of WRITE_SYNC.
> >
> > Once a request has completed, if the cfq queue is empty, we have two choices.
> > Either expire the cfq queue and move on to dispatch requests from a
> > different queue or we idle on the queue hoping we will get more IO from
> > same process/queue.
>
> queues are basically processes in this context?

Yes. Basically IO context. Generally one per process but multiple threads
can also share the IO context and requests from all the threads will go
into single cfq queue.

>
> > Idling can help (on SATA disks with high seek cost), if
> > our guess was right and soon we got another request from same process. We
> > cut down on number of seeks hence increased throghput.
>
> I don't really understand the logic behind this. If we lots of I/O
> that actually is close to each other we should generally submit it in
> one batch. That is true for pagecache writeback, that is true for
> metadata (at least in XFS..), and it's true for any sane application
> doing O_DIRECT / O_SYNC style I/O.
>
> What workloads produde I/O that is local (not random) writes with small
> delays between the I/O requests?
>

That's a good question. I don't have an answer for that. I have most of the
time done my testing using fio where one can create synthetic workload of
doing O_DIRECT/O_SYNC sequential/local writes after small delays.

I think I need to run blktrace on some of the real life workloads and monitor
for WRITE_SYNC pattern.

> I see the point of this logic for reads where various workloads have
> dependent reads that might be close to each other, but I don't really
> see any point for writes.
>
> > So looks like fsync path will do bunch of IO and then will wait for jbd thread
> > to finish the work. In this case idling is waste of time.
>
> Given that ->writepage already does WRITE_SYNC_PLUG I/O which includes
> REQ_NODILE I'm still confused why we still have that issue.

In current form, cfq honors REQ_NOIDLE conditionally and that's why we
still have the issue. If you look at cfq_completed_request(), we continue
to idle in following two cases.

- If we classifed the queue as SYNC_WORKLOAD.
- If there is another random read/write happening on sync-noidle service
tree.

SYNC_WORKLOAD means that cfq thinks this particular queue is doing sequential
IO. For random IO queues, we don't idle on each individual queue but a
group of queue.

In jeff's testing, fsync thread/queue sometimes is viewed as sequential
workload and goes on SYNC_WORKLOAD tree. In that case even if request is
REQ_NOIDLE, we will continue to idle hence fsync issue.

This logic was introduced by corrado to ensure WRITE_SYNC does not
lose fair share. Now we are back to the same question, what is the workload
which does that.

8e55063 cfq-iosched: fix corner cases in idling logic

Before this patch, we will simply not do any idling on WRITE_SYNC. That
means no idling on O_SYNC/fsync paths but still idle on direct IO
WRITE_SYNC. Which is a bit of discrepancy.

>
> > I guess same will
> > be true for umount and sync() path. But same probably is not necessarily true
> > for a O_DIRECT writer (database comes to mind), and for O_SYNC writer
> > (virtual machines?).
>
> For virtual machines idling seems like a waste of ressources. If we
> have sequential I/O we dispatch in batches - in fact qemu even merges
> sequential small block I/O it gets from the guest into one large request
> we hand off to the host kernel. For reads the same caveat as above
> applies as read requests as handed through 1:1 from the guest.
>

Ok, That's good to know.

> > O_SYNC writers will get little disk share in presence of heavy buffered
> > WRITES. If we choose to not special case WRITE_SYNC and continue to
> > idle on the queue then we probably are wasting time and reducing overall
> > throughput. (The fsync() case Jeff is running into).
>
> Remember that O_SYNC writes are implemented as normal buffered write +
> fsync (a range fsync to be exact, but that doesn't change a thing).
>
> And that's what they conceptually are anyway, so treating a normal
> buffered write + fsync different from an O_SYNC write is not only wrong
> conceptuall but also in implementation. You have the exact same issue
> of handing off work to the journal commit thread in extN.

Actually right now I think O_DIRECT is getting different treatment as
it does not mark queue IO as RQ_NOIDLE. As you said O_SYNC is just fsync
on a specific range, so both the paths will mark RQ_NOIDLE and get treated
same in IO scheduler.

Intially Jens treated O_DIRECT differently and then Jeff put a patch which
accidently got rid of that difference and finally I had reintroduced the
O_DIRECT flag.

d9449ce Fix regression in direct writes performance due to WRITE_ODIRECT flag

I have explained the effect of not idling on DIRECT writes in changelog. It
gets very little BW in presence of sequential reads.

With idling on WRITE_SYNC, distribution looks as follows.

direct writes: aggrb=2,968KB/s
readers : aggrb=101MB/s

Without idling on WRITE_SYNC, distribution of BW looks as follows.

direct write: aggrb=19KB/s
readers aggrb=137MB/s

So notice how sequential reads can starve a DIRECT sequential writer. Now
this is a synthetic workload created with the help of fio where sequential
writes are submitted with delay in between (i think so). We are back to
the question that the application should coalesce sequential writes and
submit bigger write requests. Otherwise you will lose share and get
starved out.

So going forward, I think there are few possible options.

- Stop idling on all the WRITE_SYNC IO. There is no reasonable way to
tell whether there will be more IO or not from applicatoin. This will
impact direct writes, O_SYNC writes and fsync().

If direct IO application is submitting writes with a delay in between
it can be starved out in presnce of competing workloads.

In this scheme we get rid of all rq_noidle logic, get rid of separate
flag for direct IO, WRITE_ODIRECT_PLUG, and also don't need any queue
yielding patches as we never idle on SYNC_WRITES.

The important thing to note here is that even direct random writes
will lose share here if we don't keep the queue busy continuously
and issue random writes after a delay. (database?).

- Do idling by default on WRITE_SYNC path. Implement Jeff's queue yielding
patches. That way O_SYNC/fsyn path will call blk_yield() and stop idling.
Direct IO write path will stil continue to idle (I think if file has already
been laid out?).

I have no idea in real world how SYNC_WRITE workloads look like and how
their traffic pattern is. So it is hard to say which scheme to go for.
Jens, what do you think of above?

> Note that
> the log write (or at least parts of it) will always use WRITE_BARRIER,
> which completey bypasses the I/O scheduler.
>
> > So one possible way could be that don't try to special case synchronous
> > writes and continue to idle on the queue based on other parameters. If
> > kernel/higher layers have knowledge that we are not going to issue more
> > IO in same context, then they should explicitly call blk_yield(), to
> > stop idling and give up slice.
>
> We have no way to know what userspace will do if we are doing
> O_SYNC/O_DIRECT style I/O or use fsync. We know that we will most
> likely continue kicking things from the same queue when doing page
> writeback. One thing that should help with this is Jens' explicit
> per-process plugging stuff, which I noticed he recently updated to a
> current kernel.

Ok. I will have a look at that.

Thanks
Vivek

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Christoph Hellwig on
On Sat, Jun 26, 2010 at 07:25:56PM +1000, Nick Piggin wrote:
> Biggest thing is multiple small files operations like on the same
> directory. Best case I measured back when doing AS io scheduler
> versus deadline was about 100x improvement on a uncached kernel
> grep workload when competing with a streaming writeout (the writeout
> probably ended up going somewhat slower naturally, but it is fairer).

As I mentioned below I absolutely see the case for reads. A normal
grep basically is a dependent read kind of workload. For for writes
it should either be O_SYNC-style workloads or batched I/O.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Nick Piggin on
On Fri, Jun 25, 2010 at 01:03:20PM +0200, Christoph Hellwig wrote:
> On Wed, Jun 23, 2010 at 09:44:20PM -0400, Vivek Goyal wrote:
> > Let me explain the general idling logic and then see if it makes sense in case
> > of WRITE_SYNC.
> >
> > Once a request has completed, if the cfq queue is empty, we have two choices.
> > Either expire the cfq queue and move on to dispatch requests from a
> > different queue or we idle on the queue hoping we will get more IO from
> > same process/queue.
>
> queues are basically processes in this context?
>
> > Idling can help (on SATA disks with high seek cost), if
> > our guess was right and soon we got another request from same process. We
> > cut down on number of seeks hence increased throghput.
>
> I don't really understand the logic behind this. If we lots of I/O
> that actually is close to each other we should generally submit it in
> one batch. That is true for pagecache writeback, that is true for
> metadata (at least in XFS..), and it's true for any sane application
> doing O_DIRECT / O_SYNC style I/O.
>
> What workloads produde I/O that is local (not random) writes with small
> delays between the I/O requests?

Biggest thing is multiple small files operations like on the same
directory. Best case I measured back when doing AS io scheduler
versus deadline was about 100x improvement on a uncached kernel
grep workload when competing with a streaming writeout (the writeout
probably ended up going somewhat slower naturally, but it is fairer).

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Christoph Hellwig on
On Fri, Jun 25, 2010 at 11:35:10PM -0400, Vivek Goyal wrote:
> This logic was introduced by corrado to ensure WRITE_SYNC does not
> lose fair share. Now we are back to the same question, what is the workload
> which does that.
>
> 8e55063 cfq-iosched: fix corner cases in idling logic
>
> Before this patch, we will simply not do any idling on WRITE_SYNC. That
> means no idling on O_SYNC/fsync paths but still idle on direct IO
> WRITE_SYNC. Which is a bit of discrepancy.

Corrado, can you explain what workloads you doing that commit for?
The commit message doesn't really given any useful information.


> - Stop idling on all the WRITE_SYNC IO. There is no reasonable way to
> tell whether there will be more IO or not from applicatoin. This will
> impact direct writes, O_SYNC writes and fsync().
>
> If direct IO application is submitting writes with a delay in between
> it can be starved out in presnce of competing workloads.

So what application does this?

> - Do idling by default on WRITE_SYNC path. Implement Jeff's queue yielding
> patches. That way O_SYNC/fsyn path will call blk_yield() and stop idling.
> Direct IO write path will stil continue to idle (I think if file has already
> been laid out?).

I don't think this makes much sense. And O_SYNC write / fsync are
defined to not return before the I/O makes it to disk, and for any
sane filesystem end with a WRITE_BARRIER request because of that.
So we end these with an explicit unplug anyway, no need for idling
logic.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/