Prev: From Scripting to Scaling: Multi-core is challenging even the most battle-scared programmer
Next: Coarray Fortran
From: Tobias Burnus on 14 Jul 2010 17:54 Am 14.07.2010 22:23, schrieb nmm1(a)cam.ac.uk: > In article <1jlmjr4.1h9muj0pihqb1N%see(a)sig.for.address>, > Victor Eijkhout <see(a)sig.for.address> wrote: >> >>>>> OpenMP uses a >>>>> "shared memory" model, which is harder to implement on a cluster >>>>> architecture. But it has been done too. >>>> >>>> What are you thinking of? >>> >>> Intel Cluster Tools. >> >> Hm. I watched a video on the Intel site, and there is no hint of >> distributed shared memory. (MPI, Tracers, math libraries, but nothing >> deeper.) >> >> http://software.intel.com/en-us/intel-cluster-toolkit/ >> >> Can you give me a more specific pointer? See: http://software.intel.com/en-us/articles/cluster-openmp-for-intel-compilers/ "Cluster OpenMP is now included with version 11 of the Intel compilers." - However, one seems to need a special licence. Tobias
From: gmail-unlp on 16 Jul 2010 17:21 Just a few thoughts: 1) Parallel programming forgetting (parallel) performance issues is a problem. And OpenMP helps, in some way, to forget important performance details such as pipelining, memory hierarchy, cache coherence, etc. However, if you remember you are parallelizing to improve performance I think you will not forget performance penalties and implicitly or explicitly optimize data traffic, for example. 2) If you have a legacy application with more than a few thousands lines it's possible you will start with OpenMP because of the large work required to re-code the same application with MPI. However, it's likely you will have to learn MPI or at least have in mind the distributed memory parallel architectures if you need to process more data, or more accurately, or ... 3) Extending the shared memory in a distributed memory architecture would help in maintaining the shared memory model (i.e. maintaining OpenMP), but I think that the risk on hiding strong performance penalties is too high... I don't know very much "things" like that from Intel, there is some help (tool/methodology) for analyzing and solving performance issues? 4) I'm rather convinced that MPI is the best at long term focusing performance, but I'm working with a legacy application of about 100k lines of (sequential) code and I understand those suggesting OpenMP, and I'm using OpenMP and looking for ways to distribute data for distributing computing in a distributed memory architecture. If I have to program from scratch, I would use MPI from the beginning. 5) I suggest learning both: OpenMP and MPI, and this is not only to have all options to make clear (the best) choices, both are simple enough to learn, at least to know the focus/ideas/etc. of each one. I suggest looking for tutorials (maybe two or three of each one) and follow them carefully. Again: you will not learn all of the details, but the interesting one to make your own choices. Neither OpenMP nor MPI are a big deal for scientific programmers. 6) An alternative way: just identify BLAS and/or LAPACK subroutines/ functions and use shared and/or distributed memory libraries to call for computing. Both libraries are implemented for shared as well as distributed memory parallel computing. Hope this helps, Fernando.
From: gmail-unlp on 16 Jul 2010 21:08 Just a few thoughts: 1) Making oarallel programs while forgetting (parallel) performance issues is a problem. And OpenMP helps, in some way, to forget important performance details such as pipelining, memory hierarchy, cache coherence, etc. However, if you remember you are parallelizing to improve performance I think you will not forget performance penalties and implicitly or explicitly optimize data traffic, for example. 2) If you have a legacy application with more than a few thousands lines it's possible you will start with OpenMP because of the large work required to re-code the same application with MPI. However, it's likely you will have to learn MPI or at least have in mind the distributed memory parallel architectures if you need to process more data, or more accurately, or more... 3) Extending the shared memory in a distributed memory architecture would help in maintaining the shared memory model (i.e. maintaining OpenMP), but I think that the risk of hiding strong performance penalties is too high... I don't know very much "things" like that from Intel, is there some help (tool/methodology) for analyzing and solving performance issues? 4) I'm rather convinced that MPI is "the best" at long term when focusing performance, but I'm working with a legacy application of about 100k lines of (sequential) code and I understand those suggesting OpenMP, and I'm using OpenMP and looking for ways to distribute data for distributing computing in a distributed memory architecture. If I have to program from scratch, I would use MPI from the beginning. 5) I suggest learning both: OpenMP and MPI, and this is not only to have all options to make clear (the best) choices, both are simple enough to learn, at least to know the focus/ideas/etc. of each one. I suggest looking for tutorials (maybe two or three of each one) and follow them carefully. Again: you will not learn all of the details, but the interesting ones to make your own choices. Neither OpenMP nor MPI are a big deal for scientific programmers. 6) An alternative way: just identify BLAS and/or LAPACK subroutines/ functions and use shared and/or distributed memory libraries to call for computing. Both libraries are implemented for shared as well as distributed memory parallel computing. Hope this helps, Fernando.
From: sturlamolden on 17 Jul 2010 14:01 On 17 Jul, 03:08, gmail-unlp <ftine...(a)gmail.com> wrote: > 1) Making oarallel programs while forgetting (parallel) performance > issues is a problem. And OpenMP helps, in some way, to forget > important performance details such as pipelining, memory hierarchy, > cache coherence, etc. However, if you remember you are parallelizing > to improve performance I think you will not forget performance > penalties and implicitly or explicitly optimize data traffic, for > example. We should not forget that OpenMP is often used on "multi-core processors". These are rather primitive parallel devices, they e.g. have shared cache. Data traffic due to OpenMP can therefore be minimal, because a dirty cache line need not be communicated. So if the target is common desktop computers with quadcore Intel or AMD CPUs, OpenMP can be perfectly fine. And this is the common desktop computer these days. So for small scale parallelization on modern desktop computers, OpenMP can be very good. But on large servers with multiple processors, OpenMP can generate excessive data traffic and scale very badly. > 6) An alternative way: just identify BLAS and/or LAPACK subroutines/ > functions and use shared and/or distributed memory libraries to call > for computing. This is very important. GotoBLAS and Intel MKL have BLAS and LAPACK optimized for SMP servers. FFTW and MLK have parallel FFTs. But look at the majority of today's 'system developers': They hardly know any math, neither linear algebra nor calculus. They would not recognize a linear system of equations or a convolution if they saw it. So why would they use LAPACK or an FFT? A website for Norwegian IT specialists (digi.no), once had a quiz that claimed LAPACK is a program for "testing the speed of computers". They are on a different planet. The sadness of this is that if we scientists want programs that run fast, we have to write them ourselves. Those educated to do so are too dumb, even with a computer, nor do they understand the problem they are requested to solve. But scientists who write computer programs are not educated to do so, nor is this the major focus of our jobs. P.S. It is a common misconception, particularly among computer science scholars, that "shared memory" means no data traffic, and that threads are better then processes for that matter. I.e. they can see that IPC has a cost, and thus conclude that threads must be more efficient and scale better. The lack of a native fork() on Windows has also thought many of them to think in terms of threads rather than processes. The use of MPI seem to be limited to scientists and engineers, the majority of computer scientists don't even know what it is. Concurrency to them means threads, and particularly C++ classes that wrap threads. Most of the expect i/o bound programs that use threads to be faster on multi-core computers, and they wonder why parallel programming is so hard.
From: sturlamolden on 17 Jul 2010 14:40
On 17 Jul, 20:01, sturlamolden <sturlamol...(a)yahoo.no> wrote: > Concurrency to them means threads, and particularly C++ classes that > wrap threads. Most of the expect i/o bound programs that use threads > to be faster on multi-core computers, and they wonder why parallel > programming is so hard. This is e.g. a common complaint about Python's GIL (global interpreter lock) on comp.lang.python: - As Python's interpreter has as global lock, Python programs cannot exploit multicore computers. The common answer to this is: - You don't get a faster network connection by using multiple processors. This is too hard for most IT developers to understand. But if they do understand this, we can ask them this instead: - Why do you accept the 100x speed penalty from using Python, but complain about not being allowed to use more than one core? If they have a reasonable answer to this as well, such as hating C++ immensely, we can tell them the real story: - Any mutex (like Python's GIL) can be released. Python threads not using the interpreter can run simultaneously (e.g. they might be waiting for i/o or a library call to return). Libraries can use as many threads as they want internally. And processes can of course be spawned and forked. It is really sad to see how badly educated many so-called "IT specialists" actually are. If we ask them to solve a problem, chances are they will spend all the time writing yet another web XML framework in C#, without even touching the real problem. |