What happened to computer architecture (and comp.arch?) [Computer Architecture]

Prev: VMWare tools killed my Mac OS X ?
Next: Software vs hardware floating-point [was Re: What happened ...]

From: Tim McCaffrey on 15 Sep 2009 11:05

In article <gL6dnWc2Y61AOzPXnZ2dnUVZ_j-dnZ2d(a)bestweb.net>, mayan(a)bestweb.net
says...
>
>Tim McCaffrey wrote:
>
>>
>> I have been working, of late, with device drivers that have multiple
threads
>> (either worker threads or running on user threads). In the case where it
was
>> talking to a real device, spinlock/sync issues became bad enough problem
that
>> I got negative scaling going from 2 to 4 CPUs. Analysis indicated that the
>> spin locks themselves were not the problem, it was any line of cache that
>> could be shared between CPUs (for instance, I implemented a "lock-less"
>> circular queue that had no spin locks. Still spent alot of time there due
to
>> cache thrashing).
>>
>> I think the trick is to 1) Don't share cache lines if you can help it, and
2)
>> push information instead of pulling it. The 1st will probably require deep
OS
>> changes, because this requires running on the "correct" CPU (this is
>> especially difficult for a device driver), and #2 will require hardware
(and
>> therefore OS) support.
>>
>> - Tim
>>
>
>Have you tried doing the following (assuming, of course, that your h/w
>has appropriate support):
>1. Allocate the line without reading (dcbz in PowerPC)
>2. Fill the line (normal stores)
>3. Force the line to memory (dcbf in PowerPC)
>You'll also have to sprinkle in some sync's no doubt.
>
>It looks like on the x86 architectures, you may be able to get the same
>effect by using write-combining (WC memory) and WBINVD (or maybe CLFLUSH).

That doesn't really get rid of the looong pause when a CPU has to read a cache
line into memory. If I could push it to another CPU, then there might some
situations that could be sped up.

Here is another case:

Device has a queue of requests in memory. Unfortunately, the queue entries
are not cache-line sized, so you get the following:

CPU #1:
Spin locks the queue (cache line miss if spin lock was grabbed by
another CPU).
Update pointers (ditto if another CPU wrote to the queue previously).
Write to the queue (ditto).
Write word to the device telling it has something in the queue.
Device:
Reads queue.
- CPU #1 is forced to write back any data that device wants to
read, including queue pointers and queue data.

CPU #2:
Spin locks the queue ....

I won't even get into the way the responses are handled (to protect the
guilty). I have already tried to explain to the vendor the problems in these
areas, but I'm not sure I got through as we abandoned working with them, at
least for now.

Some problems are inheirent with Windows/PC hardware, so it isn't just the
vendors fault.

As you can see, as you get more CPUs the likelyhood that the cache line gets
thrashed goes up. I got much better performance with a small benchmark
program with one CPU vs. 2 or more despite the fact the benchmark and the
device driver were both heavily multi-threaded. The downside was that I had
no idle time left with one CPU, whereas I had some (10-20%) with multiple
CPUs.

- Tim

From: Stephen Fuld on 15 Sep 2009 14:13

Terje Mathisen wrote:

snip

> To me, when I see an O(n*n) situation, I try to look for O(n*log(n)) or
> O(n):
>
> One obvious approach would be to have a hierarchy of locks:
>
> Just having two levels with sqrt(n) first-level barriers all protecting
> access to the single "real" lock would turn the problem into something
> that takes twice as long in the zero contention situation, but where the
> worst case becomes O(sqrt(n)*sqrt(n)) * 2 => O(n), right?
>
> Each thread could use a hash of its thread ID to determine which
> first-level lock to use.
>
> With log(n) levels, each lock would only have two users, making the time
> for each of them constant, and the total becomes O(log(n)).

I like this idea. You could even make it adaptive. Start with a
single, "primary" lock. If the contention for that lock gets too high
(the count on the atomic add instruction exceeds some threshold for some
time), you add a new layer of locks and have all new lock requesters use
that level. As the processes that are still contending for the primary
lock based on not using the new layer get their turn, complete their
work, and release the lock, the system slowly converts to the new two
layer scheme. Similarly, if contention decreases on the secondary
locks, you could eliminate them. Thus, in periods of low contention,
you avoid the overhead of the second lock request, but avoid the O N**2
problem as contention increases. This could even save some analysis up
front as to how high the contention on a particular lock is going to be.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

From: Terje Mathisen on 15 Sep 2009 15:03

Stephen Fuld wrote:
> Terje Mathisen wrote:
>> Each thread could use a hash of its thread ID to determine which
>> first-level lock to use.
>>
>> With log(n) levels, each lock would only have two users, making the
>> time for each of them constant, and the total becomes O(log(n)).
>
> I like this idea. You could even make it adaptive. Start with a single,
> "primary" lock. If the contention for that lock gets too high (the count
> on the atomic add instruction exceeds some threshold for some time), you
> add a new layer of locks and have all new lock requesters use that
> level. As the processes that are still contending for the primary lock
> based on not using the new layer get their turn, complete their work,
> and release the lock, the system slowly converts to the new two layer

Yeah, this could work quite well. One idea would be to add an additional
level each time the simultaneous users reach a given number, and the
thread which picks this "lucky" number would be responsible for adding a
new layer:

prev = xadd(&lock, 1);
if (prev == 0) { // We won! Do the work
...
}
else {
if (prev == 16) { // Too many simultaneous users, add level
nested_lock = new_lock_array();

wait(prev);
}

> scheme. Similarly, if contention decreases on the secondary locks, you
> could eliminate them. Thus, in periods of low contention, you avoid the

This is much harder to do safely, since you need some form of janitor
process to wait until there are no more users of the secondary lock
array before you can free() it.

> overhead of the second lock request, but avoid the O N**2 problem as
> contention increases. This could even save some analysis up front as to
> how high the contention on a particular lock is going to be.

Just allowing a single additional level would be quite cheap, even if
you never free it up again. Allowing log(n) levels otoh is much more iffy!

Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

From: Eugene Miya on 15 Sep 2009 21:02

In article <YNmdnS6S7-70-DnXnZ2dnUVZ_hSdnZ2d(a)bestweb.net>,
Mayan Moudgill <mayan(a)bestweb.net> wrote:
>I've been reading comp.arch off and on for more than 20 years now. In
>the past few years the SNR has deteriorated considerably, and I was
>wondering why. Maybe people who used to post at comp.arch are on other
>forums? Maybe its that I've gotten a little harder to impress? Then I
>thought about the quality of most papers at ISCA and Micro, the fact
>that both EDF and MPF have gone away, and I think the rot is not
>confined to just comp.arch.
>
>So, whats going on? I'm sure part of it is that the latest generation of
>architects is talking at other sites.

net.arch and comp.arch got hit by Win-tel (market force).

Past arch ideas are museum pieces which cause Mash, myself, and others
look for them before they are lost.

Patterson has noted a sort of crisis at ISCA'90. Architectural flanks
and back waters exist, but they will get pulled in (assimilated) given time.

>However, equally important is that there are far fewer of them. The
>number of companies designing processors has gone down and there are
>fewer startups doing processors. So, less architects.
>
>Within those processors there is less architecture (or micro
>architecture) being done; instead, the imperative that clock cycle has
>to be driven down leaves less levels of logic per cycle, which in turn
>means that the "architecture" has to be simpler. So, less to talk about.

KISS.

>There is less low-hanging fruit around; most of the simpler and
>obviously beneficial ideas are known, and most other ideas are more
>complex and harder to explain/utilize.

Yep.

>A larger number of decisions are being driven by the details of the
>process, libraries and circuit families. This stuff is less accessible
>to a non-practitioner, and probably propietary to boot.

Yep. A fact.

>Basically, I think the field has gotten more complicated and less
>accessible to the casual reader (or even the gifted well read amateur).

A lot of people like the idea of computing as a cottage industry.
It goes back to Homebrew.

>The knowledge required of a computer architect have increased to the
>point that its probably impossible to acquire even a *basic* grounding
>in computer architecture outside of actually working in the field
>developing a processor or _possibly_ studying with one of a few PhD
>programs. The field has gotten to the point where it _may_ require
>architects to specialize in different application areas; a lot of the
>skills transfer, but it still requires retraining to move from, say,
>general-purpose processors to GPU design.
>
>I look around and see a handful of guys posting who've actually been
>doing computer architecture. But its a shrinking pool....

I'm not an architect. I've been offered architect jobs. My dad was an
architect of buildings. I represent customers. Some architects use my
database (I know Hennessy does).

>Ah, well - I guess I can always go hang out at alt.folklore.computers.

Just lurking gets what lurkers do.
It's an issue of activity.
Old architectural ideas there.

--

Looking for an H-912 (container).

From: Chris M. Thomasson on 16 Sep 2009 00:27

"Terje Mathisen" <Terje.Mathisen(a)tmsw.no> wrote in message
news:ZJ6dnTKJq8hXdzDXnZ2dnUVZ8sednZ2d(a)lyse.net...
> MitchAlsup wrote:
>> On Sep 10, 10:04 pm, Mayan Moudgill<ma...(a)bestweb.net> wrote:
[...]
>>> But think of the OS/runtime changes that are required to guarantee that
>>> we can use spin-locks efficiently.
>>
>> Don't want spin-locks--I want a means whereby everyone trying to
>> synchronize recognizes their place in line in BigO( ln(x) ) time and
>> then another means whereby they can use their place in line to access
>> the queue (or whatever) without (/with minimal) interfering with the
>> other threads doing likewise.
>
> This is indeed the proper solution.
>
> As I've stated previously, XADD is IMHO the best of the currently
> available primitives, as long as you use the return value as your index
> into the work table: A single access/update per thread to get forward
> progress for all of them, so at least we avoid looping by all those who
> didn't win the first exchange.
[...]

FWIW, I did a bounded multi-producer/consumer queue a while back using
nothing but XADD on the fast-path's. Here is the rough idea:

< pseudo-code typed in newsreader, please forgive typos! ;^) >
________________________________________________________________________
#define DEPTH 256 // power of 2
#define MASK (DEPTH - 1)

void* volatile slots[DEPTH] = { NULL };
signed slots_free = DEPTH; // slots free
signed slots_used = 0; // slots used
unsigned push_idx = 0; // push index
unsigned pop_idx = 0; // pop index
semaphore push_sem = 0; // push waitset
semaphore pop_sem = 0; // pop waitset

void push(void* ptr)
{
if (XADD(&slots_free, -1) < 1)
{
semaphore_wait(&push_sem);
}

unsigned idx = XADD(&push_idx, 1) & MASK;

assert(! slots[idx]);

slots[idx] = ptr;

if (XADD(&slots_used, 1) < 0)
{
semaphore_post(&pop_sem);
}
}

void* pop()
{
if (XADD(&slots_used, -1) < 1)
{
semaphore_wait(&pop_sem);
}

unsigned idx = XADD(&pop_idx, 1) & MASK;

void* ptr = slots[idx];

while (! ptr)
{
yield(); // SwitchToThread(); Sleep(1); asm("PAUSE"); whatever

ptr = slots[idx];
}

slots[idx] = NULL;

if (XADD(&slots_free, 1) < 0)
{
semaphore_post(&push_sem);
}

return ptr;
}
________________________________________________________________________

I hope I did not screw that up. Anyway, thanks to XADD, the algorithm has
100% wait-free fast-paths, which is pretty darn good for any
multi-producer/consumer queue! Heck, it's even loop-free on the slow-paths.
Those are pretty nice properties indeed.

;^)

First | Prev | Next | Last
Pages: 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
Prev: VMWare tools killed my Mac OS X ?
Next: Software vs hardware floating-point [was Re: What happened ...]