Larrabee delayed: anyone know what's happening? [Computer Architecture]

Prev: PEEEEEEP
Next: Texture units as a general function

From: "Andy "Krazy" Glew" on 24 Dec 2009 10:00

Terje Mathisen wrote:
>> Since the whole point of this exercise is to try to reduce the overhead
>> of cache coherency, but people have demonstrated they don't like the
>> consequences semantically, I am trying a different combination: allow A,
>> multiple values; allow B weak ordering; but disallow C losing writes.
>>
>> I possibly that this may be more acceptable and fewer bugs.
>>
>> I.e. I am suspecting that full cache coherency is overkill, but that
>> completely eliminating cache coherency is underkill.
>
> I agree, and I think most programmers will be happy with word-size
> tracking, i.e. we assume all char/byte operations happens on private
> memory ranges.

Doesn't that fly in the face of the Alpha experience, where originally
they did not have byte memory operations, but were eventually forced to?

Why? What changed? What is different?

Some of the Alpha people have said that the biggest reason was I/O
devices from PC-land. Sounds like a special case.

I suspect that there is at least some user level parallel code that
assumes byte writes are - what is the proper term? Atomic? Non-lossy?
Not implemented via a non-atomic RMW?

Can we get away with having byte and other sub-word writes? Saying that
they may be atomic/non-lossy in cache memory, but not in uncached remote
memory. But that word writes are non-lossy in all memory types? Or do
we need to having explicit control?

>> * This * post, by the way, is composed almost exclusively by speech
>> recognition, using the pen for certain trivial edits. It's nice to find
>> a way that I can actually compose stuff on a plane again.
>
> Seems to work better than your wan-based tablet posts!

Did you mean "van"? Are you using handwriting or speech recognition? :-)

The big problem with these "natural user interfaces" is review. Making
it easy to make sure that what was input was actually what was meant.
Hmmm...

===

Happy Xmas, Terje!!!!!

From: "Andy "Krazy" Glew" on 24 Dec 2009 10:09

Terje Mathisen wrote:
> Bernd Paysan wrote:
>> Terje Mathisen<"terje.mathisen at tmsw.no"> wrote:
>>> PS. This is my very first post from my personal leafnode installation:
>>> I have free news access via my home (fiber) ISP, but not here in
>>> Rauland on Christmas/New Year vacation, so today I finally broke down
>>> and installed leafnode on my home FreeBSD gps-based ntp server. :-)
>>
>> I use leafnode locally for a decade or so now; it does a good job on
>> message prefetching, and it also can be used to hide details like where
>> my actual news feed is coming from.

At the moment my newsreader (I switched back to Thunderbird from
Seamonkey) is my only non-cloud based app. The only app that ties me to
a particular machine, which I cannot use when that machine is not
around. (And the laptop that I read news on is often physically turned
off and not accessible to the net, so remote access is not a possibility.)

I mean to install a newsreader on my cloud machine. Unfortunately, not
allowed to use X. Probably will fall back to Emacs gnus, although that
loses HTML and graphics, for newsgroups that aren't in the ascii
stoneages like comp.arch.

Then I will have remote access...

IIRC, gnus will also get me offline access.

---

I've been wondering about leafnode ever since Bernd told me about it
years ago. But, I don't see any advantage for it on my cloud
configuration. Do you?

===

comp.arch relevancy: not much, except that cloud apps are becoming more
and more important, and this sort of issue, for our persona hacking,
provides some insight.

From: "Andy "Krazy" Glew" on 24 Dec 2009 10:15

Terje Mathisen wrote:
> I think active code/message passing/dataflow is the obvious direction of
> all big systems, including everything that needs to run over the internet.
>
> After all, downloading java applets to the client that knows how to
> handle the accompanying server data is one working exam

Q: at what level should there be support?

There have been past proposals to have active messaging supported in
hardware. Not just network hardware, but also CPU hardware. It has
been used as one of the arguments for Burton Smith style multithreading,
with incoming active messages from the network automatically being
allocated a thread slot, with the network carrying hardware comprehended
privileges around.

Or, as you say, active messaging already can be done in pure software.
In some ways Java already is. (Or Forth.)

Should we even bother providing any hardware support?

---

Somebody used to say in his .sig "All good ideas eventually move from
software to hardware."

The history of active messages illustrates another historical pattern:
sometimes ideas start off in hardware, move to software, and oscillate
back and forth.

From: Anne & Lynn Wheeler on 24 Dec 2009 11:36

Terje Mathisen <"terje.mathisen at tmsw.no"> writes:
> Del have already answered, but since I know far less than him about
> IBM systems, I'll try anyway:
>
> As Del said, an IBM mainframe has lots of dedicated slave processors,
> think of them as very generalized DMA engines where you can do stuff
> like:
>
> seek to and read block # 48, load the word at offset 56 in that block
> and compare with NULL: If equal return the block, otherwise use the
> word at offset 52 as the new block number and repeat the process.
>
> I.e. you could implement most operations on most forms of diskbased
> tree structures inside the channel cpu, with no need to interrupt the
> host before everything was done.

re:
http://www.garlic.com/~lynn/2009s.html#18 Larrabee delayed: anyone know what's happening?
http://www.garlic.com/~lynn/2009s.html#20 Larrabee delayed: anyone know what's happening?

but it was all in main processor real storage ... so search operations
that compared on something would be constantly be fetching the search
argument from main memory. lots of latency and heavy load on path.
frequently channel was supposed to have lots of concurrent activity
.... but during a search operation ... the whole infrastructure was
dedicated to that operation ... & locked out all other operations.

Issue was that design point was from early 60s when I/O resources were
really abundant and real-storage was very scarce. In the late 70s, I
would periodically get called into customer situations (when everybody
else had come up dry).

late 70s, large national retailer ... several processors in
loosely-coupled, shared-disk environment ... say half-dozen regional
operations with processor complex per region ... but all sharing the
same disk with application program library.

program library was organized in something called PDS ... and PDS
directory (of programs) was "scanned" with multi-track search for ever
program load. this particular environment had a three "cylinder" PDS
directory ... so avg. depth of search was 1.5 cylinders. This was 3330
drives that spun at 60 revs/sec and had 19 tracks per cylinder. The
elapsed time for a multi-track search of whole cylinder ran 19/60s of a
second ... during which time the device, (shared) device controler, and
(shared) channel was unavailable for any other operations. Drive with
the application library for the whole complex was peaking out at about
six disk I/Os per second (2/3rds multi-track search of the library PDS
directory and one disk I/O to load the actual program, peak maybe two
program loads/sec).

before I knew all this ... I'm brought into class room with six foot
long student tables ... several of them covered with foot high piles of
paper print outs of performance data from the half different systems.
Basically print out for specific system with stats showing activity for
10-15 minute period (processor utilization, and i/o counts for
individual disks, other stuff) ... for several days ... starting in the
morning and continued during the day.

Nothing stands out from their description ... just that thruput degrades
enormously under peak load ... when the complex is attempting to do
dozens of program loads/second across the whole operation).

I effectively have to integrate the data from the different processor
complex performance printouts in my head ... and then do the correlation
that specific drive (out of dozens) is peaking at (aggregate) of 6-7
disk i/os per second (across all the processors) ... during periods of
poor performance (takes 30-40 mins). I then get out of them that drive
is the application program library for the whole complex with a three
cylinder PDS directory. I then explain how PDS directory works with
multi-track search ... and the whole complex is limited to two program
loads/sec.

The design trade-off was based on environment from the early 60s ...
and was obsolete by the mid-70s ... when real-storage was starting to
get abundant enough that the library directory could be cached in real
storage ... and didn't have to do rescan on disk for every program load.

lots of past posts mentioning CKD DASD (disk) should have moved
away from multi-track search several decades ago
http://www.garlic.com/~lynn/submain.html#dasd

other posts about getting to play disk engineer in bldgs 14&15
http://www.garlic.com/~lynn/subtopic.html#disk

the most famous was ISAM channel programs ... that would could go thru
things like multi-level index ... with "self-modifying" ... where an
operation would read into real storage ... the seek/search argument(s)
for following channel commands (in the same channel program).

ISAM resulted in heartburn for the real->virtual transition. Channel
programs all involved "real" addresses. For virtual machine operation
.... it required a complete scan of the "virtual" channel program and
making a "shadow" ... that then had real addresses (in place of the
virtual addresses), and executing the "shadow" program. Also seek
arguments may need to be translated in the shadow (so the channel
program that was actually being executed no longer referred to the
address that the self-modifying arguments was happening).

The old time batch, operating system ... with limited real-storage
.... also had convention that the channel programs were built in the
application space ... and passed to the kernel for execution. In their
transition from real to virtual storage environment ... they found
themselves faced with the same translation requirement faced by the
virtual machine operating systems. In fact, they started out by
borrowing the channel program translation routine from the virtual
machine operating system.

--
40+yrs virtualization experience (since Jan68), online at home since Mar1970

From: Robert Myers on 24 Dec 2009 12:55

On Dec 24, 8:33 am, Terje Mathisen <"terje.mathisen at tmsw.no">
wrote:
> Robert Myers wrote:
> > On Dec 23, 12:21 pm, Terje Mathisen<"terje.mathisen at tmsw.no">
> >> Why do a feel that this feels a lot like IBM mainframe channel programs?
> >> :-)
>
> > Could I persuade you to take time away from your first love
> > (programming your own computers, of course) to elaborate/pontificate a
> > bit? After forty years, I'm still waiting for someone to tell me
> > something interesting about mainframes. Well, other than that IBM bet
> > big and won big on them.
>
> > And CHANNELS. Well. That's clearly like the number 42.
>
> Del have already answered, but since I know far less than him about IBM
> systems, I'll try anyway:
>
> As Del said, an IBM mainframe has lots of dedicated slave processors,
> think of them as very generalized DMA engines where you can do stuff like:
>
> seek to and read block # 48, load the word at offset 56 in that block
> and compare with NULL: If equal return the block, otherwise use the word
> at offset 52 as the new block number and repeat the process.
>
> I.e. you could implement most operations on most forms of diskbased tree
> structures inside the channel cpu, with no need to interrupt the host
> before everything was done.
>
Thanks for that detailed reply, Terje.

I tend to think there is something important I am still missing.
Maybe my lack of appreciation of mainframes comes from never having
been involved in the slightest with the history, except briefly as a
frustrated and annoyed user.

I can't see anything about channels that you can't do with modern PC I/
O. You send "stuff" to a peripheral device and it does something with
it. It's generally up to the peripheral to know what to do with the
"stuff," whatever it is. In the case of a TCP/IP offload engine, the
what to do could be quite complicated.

The things you *can't* do mostly seem to have to do with not doing I/O
in userland, so that the programmability of anything is never exposed
to you unless you are writing a driver. Is that the point? With a
mainframe channel, a user *could* program the I/O devices, at least to
some extent.

Sorry if I seem obtuse.

Robert.

First | Prev | Next | Last
Pages: 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33
Prev: PEEEEEEP
Next: Texture units as a general function