From: Anton Ertl on 23 Nov 2009 15:49 Mayan Moudgill <mayan(a)bestweb.net> writes: >Anton Ertl wrote: > >> To avoid these overheads and to achieve load balancing, I have been >> thinking about an implementation that automatically switches between >> communicating between threads and using coroutining for efficient >> communication within a thread. Currently these are just ideas, but I >> have written them down in >> <http://www.complang.tuwien.ac.at/kps09/pdfs/ertl.pdf> >> > >Did a quick read of the paper. My summary [Correct me if I'm wrong] No correction needed. >Kernel scheduling >5. If kenel Ki pipes information to a Ki+1 on the same core, it may >yield (using coroutine semantics) to Ki+1 after writing the information The way I envision is that it yields, eliminating the need for buffering between the tasks (kernels) in this case. >Critique: >1. Only works for pipelines; it doesn't work with more complicated >graphs such as DAGs. Does this refer to the load balancing? It would certainly get more complicated with DAGs. However, I don't think that DAGs would be used much even if they are supported. >2. POSIX pipes are byte-streams of potentially unbounded length (i.e. >you can write an arbitrary number of bytes before the first one is read). Normally a buffer (typical size: 4KB) is associated with a Unix pipe, and once that buffer is full, the writer blocks. > a. You probably want a stronger type system on the data That's up to the language designer. I would expect that if you add such a feature to a statically type-checked language, you would also statically type-check the communication over a channel. That's what programmers in such a language expect, want, and would miss if a feature was added without such typechecking. Personally, I feel that the lack of static type-checking and the uniformity of data in files and pipes in Unix is a strength that allows the reuse of filters for many tasks. > b. Do you want to bound the buffer size? No bound is guaranteed, but there has to be a bound because of (at least) hardware constraints (e.g., address space), and practical considerations (such as wanting to process the data warm in some cache rather than letting is chill out into main memory and later having to drag it back in). >3. You're going to have to worry about copying the data at least once >and probably twice (data gets created, copied into buffer, data gets >copied from buffer, data gets used). My idea is that (on a shared-memory machine) the writing task stores the data directly into the buffer, and later the reading task fetches the data exactly from the same place. The tasks may copy the data from/to another buffer, or the data may come directly from a computation, or may be used directly in a computation; that's up to the tasks. There will of course be an implicit copy of the cache line from the writing task's core to the reading task's core, but no explicit copy is necessary. >It is possible to come up with >implementations of 2b that, if the data is not copied, can lead to lockup. Do you have a particular scenario in mind? >4. The overhead of coroutining is not the state save, its the loss of >locality of the caching and prediction structures. Yes, that's a potential problem. We will have to see if it's a real problem (and if so, maybe the hardware designers can come up with prediction structures that also work well under these circumstances). Why I think that it might be no problem: Today nearly all programs are sequential. For dealing with the problems that can be divided into pipelines, they often have quite convoluted control flow. Now if we divide such a program (part) into pipeline stages, each stage will be simpler and smaller, and maybe more predictable. If we then combine some of these stages back together again on one core using coroutining, they will still not be as bad as a sequential program; and the hardware was designed to perform well on the sequential program. There is one exception: the return stack (predicting function return addresses) of current machines will not work well when coroutining is involved. Let's see it as opportunity for computer architects to come up with a better return predictor. >5. What if the kernels have different data sizes (e.g. kernel A produces >data in chunks of 512 and the next one, B, processes it in a chunk of >2K; your workload will be AbAbAbAB where b is kernel B reading the 512B, >and then blocking waiting for the next read). That's for the case when the tasks/kernels are on different cores, right? My idea is that the load balancing will try to avoid cases where a buffer is completely empty or completely full. But yes, if it cannot avoid that, one of the threads involved will block. >My main objection is (1) - I don't expect that pipelines >is going to be a very interesting case. I think there are several reasons why pipes are interesting: * They help modularity (most other parallel and concurrent programming practices are at odds with modularity). * They encourage reusable components (filters). * They are so useful that programmers have used pipelines even when they did not intend to exploit parallelism (in Unix shell scripts). * Not much synchronization is necessary; when the buffer is neither empty nor full, you get very cheap communication. And you can make the buffer larger to reduce the expensive cases. - anton -- M. Anton Ertl Some things have to be seen to be believed anton(a)mips.complang.tuwien.ac.at Most things have to be believed to be seen http://www.complang.tuwien.ac.at/anton/home.html
|
Pages: 1 Prev: MCMT in Bulldozer and SIMT in GPUs Next: I love USB displays! |