From: Vivek Goyal on
On Mon, Jul 19, 2010 at 11:19:21PM +0200, Corrado Zoccolo wrote:
> On Mon, Jul 19, 2010 at 10:44 PM, Vivek Goyal <vgoyal(a)redhat.com> wrote:
> > On Mon, Jul 19, 2010 at 01:32:24PM -0700, Divyesh Shah wrote:
> >> On Mon, Jul 19, 2010 at 11:58 AM, Vivek Goyal <vgoyal(a)redhat.com> wrote:
> >> > Yes it is mixed now for default CFQ case. Whereever we don't have the
> >> > capability to determine the slice_used, we charge IOPS.
> >> >
> >> > For slice_idle=0 case, we should charge IOPS almost all the time. Though
> >> > if there is a workload where single cfqq can keep the request queue
> >> > saturated, then current code will charge in terms of time.
> >> >
> >> > I agree that this is little confusing. May be in case of slice_idle=0
> >> > we can always charge in terms of IOPS.
> >>
> >> I agree with Jeff that this is very confusing. Also there are
> >> absolutely no bets that one job may end up getting charged in IOPs for
> >> this behavior while other jobs continue getting charged in timefor
> >> their IOs. Depending on the speed of the disk, this could be a huge
> >> advantage or disadvantage for the cgroup being charged in IOPs.
> >>
> >> It should be black or white, time or IOPs and also very clearly called
> >> out not just in code comments but in the Documentation too.
> >
> > Ok, how about always charging in IOPS when slice_idle=0?
> >
> > So on fast devices, admin/user space tool, can set slice_idle=0, and CFQ
> > starts doing accounting in IOPS instead of time. On slow devices we
> > continue to run with slice_idle=8 and nothing changes.
> >
> > Personally I feel that it is hard to sustain time based logic on high end
> > devices and still get good throughput. We could make CFQ a dual mode kind
> > of scheduler which is capable of doing accouting both in terms of time as
> > well as IOPS. When slice_idle !=0, we do accounting in terms of time and
> > it will be same CFQ as of today. When slice_idle=0, CFQ starts accounting
> > in terms of IOPS.
> There is an other mode in which cfq can operate: for ncq ssds, it
> basically ignores slice_idle, and operates as if it was 0.
> This mode should also be handled as an IOPS counting mode.
> SSD mode, though, differs from rotational mode for the definition of
> "seekyness", and we should think if this mode is appropriate also for
> the other hardware where slice_idle=0 is beneficial.

I am always wondering that in practice, what is the difference between
slice_idle=0 and rotational=0. I think the only difference is NCQ queue
detection. slice_idle=0 will always not idle, irrespective of the fact
whether queue is NCQ or not and rotational=0 will disable idling only
if device supports NCQ.

If that's the case, then we can probably internally switch the
slice_idle=0 once we have detected that an SSD supports NCQ and we can
get rid of this confusion.

Well looking more closely, there seems to be one more difference. With
SSD, and NCQ, we still idle on sync-noidle tree. This seemingly, will
provide us protection from WRITES. Not sure if this is true for good
SSDs also. I am assuming they should be giving priority to reads and
balancing things out. cfq_should_idle() is interesting though, that
we disable idling for sync-idle tree. So we idle on sync-noidle tree
but do not provide any protection to sequential readers. Anyway, that's
a minor detail....

In fact we can switch to IOPS model for NCQ SSD also.

> >
> > I think this change should bring us one step closer to our goal of one
> > IO sheduler for all devices.
>
> I think this is an interesting instance of a more general problem: cfq
> needs a cost function applicable to all requests on any hardware. The
> current function is a concrete one (measured time), but unfortunately
> it is not always applicable, because:
> - for fast hardware the resolution is too coarse (this can be fixed
> using higher resolution timers)

Yes this is fixable.

> - for hardware that allows parallel dispatching, we can't measure the
> cost of a single request (can we try something like average cost of
> the requests executed in parallel?).

This is the biggest problem. How to get right estimate of time when
a request queue can have requests from multiple processes at the same
time.

> IOPS, instead, is a synthetic cost measure. It is a simplified model,
> that will approximate some devices (SSDs) better than others
> (multi-spindle rotational disks).

Agreed that IOPS is a simplified model.

> But if we want to go for the
> synthetic path, we can have more complex measures, that also take into
> account other parameters, as sequentiality of the requests,

Once we start dispatching requests from multiple cfq queues at a time,
notion of sequentiality is lost (at least on the device).

> their size
> and so on, all parameters that may have still some impact on high-end
> devices.

size is an interesting factor though. Again we can only come up with
some kind of approximation only as this cost will vary from device to
device

I think we can begin with something simple (IOPS) and if it works fine,
then we can take into account additional factors (especially size of
request) and factor that into the cost.

The only thing to keep in mind is that group scheduling will benefit
most from it. The notion of ioprio is fairly weak currently in CFQ
(especially on SSD and with slice_idle=0).

Thanks
Vivek

>
> Thanks,
> Corrado
> >
> > Jens, what do you think?
> >
> > Thanks
> > Vivek
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/