From: alex-lurk on 11 Dec 2009 17:01 I have made a simple fortran test programm to check/test the performance of OpenMP. In the following you can see the souce code without OpenMP. This test program (without OpenMP) needs about 35 seconds. ---------------------------------------- C PROGRAM OPENMP C IMPLICIT NONE INTEGER TICK, STARTTIME, STOPTIME, TIME INTEGER N, I, 1I1, I2, I3, I4, 2J1, J2, J3, J4 PARAMETER (N=10000000) REAL A1(N), A2(N), A3(N) REAL B1(N), B2(N), B3(N) REAL C1(N), C2(N), C3(N) REAL D1(N), D2(N), D3(N) REAL parallel_time_begin, parallel_time_end REAL section1_time_begin, section1_time_end REAL section2_time_begin, section2_time_end REAL section3_time_begin, section3_time_end REAL section4_time_begin, section4_time_end real sum PRINT *, '----- Serial Start -----' CALL SYSTEM_CLOCK(COUNT_RATE = TICK) CALL SYSTEM_CLOCK (COUNT = STARTTIME) ! Some initializations DO I = 1, N A1(I) = I + 1.5 A2(I) = I + 22.35 B1(I) = I + 1.5 B2(I) = I + 22.35 C1(I) = I + 1.5 C2(I) = I + 22.35 D1(I) = I + 1.5 D2(I) = I + 22.35 ENDDO PRINT *, '***** Serial Start *****' PRINT *, '***** 1. Section Start' DO J1 = 1, 400 DO I1 = 1, N A3(I1) = A1(I1) + A2(I1) ENDDO ENDDO PRINT *, '***** 1. Section End' PRINT *, '***** 2. Section Start' DO J2 = 1, 400 DO I2 = 1, N B3(I2) = B1(I2) + B2(I2) ENDDO ENDDO PRINT *, '***** 2. Section End' PRINT *, '***** 3. Section Start' DO J3 = 1, 400 DO I3 = 1, N C3(I3) = C1(I3) + C2(I3) ENDDO ENDDO PRINT *, '***** 3. Section End' PRINT *, '***** 4. Section Start' DO J4 = 1, 400 DO I4 = 1, N D3(I4) = D1(I4) + D2(I4) ENDDO ENDDO PRINT *, '***** 4. Section End' sum = 0 do i4 = 1,N sum = sum + a3(i4) + b3(i4) + c3(i4) + d3(i4) enddo print*,'Sum = ',sum PRINT *, '***** Serial End *****' CALL SYSTEM_CLOCK (COUNT = STOPTIME) TIME = REAL(STOPTIME-STARTTIME) / REAL(TICK) PRINT *, '>>>>> time of Serial was ', 1TIME, ' seconds <<<<<' PRINT *, '----- Serial End -----' END ---------------------------------------- ---------------------------------------- Now I have parallelized this test programm with OpenMP on 3 ways. 1.) With the directive SECTIONS I have divided the program in 4 sections. The number of threads is 4. This test program needs about 34 seconds. In the following you can find the used code: ---------------------------------------- C PROGRAM OPENMP C IMPLICIT NONE REAL omp_get_wtime INTEGER TICK, STARTTIME, STOPTIME, TIME INTEGER N, I, 1I1, I2, I3, I4, 2J1, J2, J3, J4 PARAMETER (N=10000000) REAL A1(N), A2(N), A3(N) REAL B1(N), B2(N), B3(N) REAL C1(N), C2(N), C3(N) REAL D1(N), D2(N), D3(N) REAL parallel_time_begin, parallel_time_end REAL section1_time_begin, section1_time_end REAL section2_time_begin, section2_time_end REAL section3_time_begin, section3_time_end REAL section4_time_begin, section4_time_end INTEGER NTHREADS real sum PRINT *, '----- Parallel start -----' CALL SYSTEM_CLOCK(COUNT_RATE = TICK) CALL SYSTEM_CLOCK (COUNT = STARTTIME) ! Some initializations DO I = 1, N A1(I) = I + 1.5 A2(I) = I + 22.35 B1(I) = I + 1.5 B2(I) = I + 22.35 C1(I) = I + 1.5 C2(I) = I + 22.35 D1(I) = I + 1.5 D2(I) = I + 22.35 ENDDO NTHREADS = 4 PRINT *, '***** Parallel Start *****' parallel_time_begin = omp_get_wtime() CALL omp_set_num_threads(NTHREADS) C$OMP SECTIONS C$OMP SECTION PRINT *, '***** 1. Section Start' section1_time_begin = omp_get_wtime() DO J1 = 1, 400 DO I1 = 1, N A3(I1) = A1(I1) + A2(I1) ENDDO ENDDO section1_time_end = omp_get_wtime() PRINT *, '====> Time of 1. Section was ', 1section1_time_end - section1_time_begin, ' seconds <====' PRINT *, '***** 1. Section End' C$OMP SECTION PRINT *, '***** 2. Section Start' section2_time_begin = omp_get_wtime() DO J2 = 1, 400 DO I2 = 1, N B3(I2) = B1(I2) + B2(I2) ENDDO ENDDO section2_time_end = omp_get_wtime() PRINT *, '====> Time of 2. Section was ', 1section2_time_end - section2_time_begin, ' seconds <====' PRINT *, '***** 2. Section End' C$OMP SECTION PRINT *, '***** 3. Section Start' section3_time_begin = omp_get_wtime() DO J3 = 1, 400 DO I3 = 1, N C3(I3) = C1(I3) + C2(I3) ENDDO ENDDO section3_time_end = omp_get_wtime() PRINT *, '====> Time of 3. Section was ', 1section3_time_end - section3_time_begin, ' seconds <====' PRINT *, '***** 3. Section End' C$OMP SECTION PRINT *, '***** 4. Section Start' section4_time_begin = omp_get_wtime() DO J4 = 1, 400 DO I4 = 1, N D3(I4) = D1(I4) + D2(I4) ENDDO ENDDO section4_time_end = omp_get_wtime() PRINT *, '====> Time of 4. Section was ', 1section4_time_end - section4_time_begin, ' seconds <====' PRINT *, '***** 4. Section End' C$OMP END SECTIONS NOWAIT sum = 0 do i4 = 1,n sum = sum + A3(i4) + B3(i4) + C3(i4) + D3(i4) enddo print*,'Sum = ',sum parallel_time_end = omp_get_wtime() PRINT *, '====> Time of Parallel was ', 1parallel_time_end - parallel_time_begin, ' seconds <====' PRINT *, '***** Parallel end *****' CALL SYSTEM_CLOCK (COUNT = STOPTIME) TIME = REAL(STOPTIME-STARTTIME) / REAL(TICK) PRINT *, '>>>>> time of Parallel was ', 1TIME, ' seconds <<<<<' PRINT *, '----- Parallel End -----' END ---------------------------------------- ---------------------------------------- 2.) With the directive PARALLEL SECTIONS I have divided the program in 4 sections too but I also have used the directive PARALLEL. The number of threads is 4. This test program needs about 12 seconds. In the following you can find the used code: ---------------------------------------- C PROGRAM OPENMP C IMPLICIT NONE REAL omp_get_wtime INTEGER TICK, STARTTIME, STOPTIME, TIME INTEGER N, I, 1I1, I2, I3, I4, 2J1, J2, J3, J4 PARAMETER (N=10000000) REAL A1(N), A2(N), A3(N) REAL B1(N), B2(N), B3(N) REAL C1(N), C2(N), C3(N) REAL D1(N), D2(N), D3(N) REAL parallel_time_begin, parallel_time_end REAL section1_time_begin, section1_time_end REAL section2_time_begin, section2_time_end REAL section3_time_begin, section3_time_end REAL section4_time_begin, section4_time_end INTEGER NTHREADS real sum PRINT *, '----- Parallel start -----' CALL SYSTEM_CLOCK(COUNT_RATE = TICK) CALL SYSTEM_CLOCK (COUNT = STARTTIME) ! Some initializations DO I = 1, N A1(I) = I + 1.5 A2(I) = I + 22.35 B1(I) = I + 1.5 B2(I) = I + 22.35 C1(I) = I + 1.5 C2(I) = I + 22.35 D1(I) = I + 1.5 D2(I) = I + 22.35 ENDDO NTHREADS = 4 PRINT *, '***** Parallel Start *****' parallel_time_begin = omp_get_wtime() CALL omp_set_num_threads(NTHREADS) C$OMP PARALLEL SECTIONS C$OMP SECTION PRINT *, '***** 1. Section Start' section1_time_begin = omp_get_wtime() DO J1 = 1, 400 DO I1 = 1, N A3(I1) = A1(I1) + A2(I1) ENDDO ENDDO section1_time_end = omp_get_wtime() PRINT *, '====> Time of 1. Section was ', 1section1_time_end - section1_time_begin, ' seconds <====' PRINT *, '***** 1. Section End' C$OMP SECTION PRINT *, '***** 2. Section Start' section2_time_begin = omp_get_wtime() DO J2 = 1, 400 DO I2 = 1, N B3(I2) = B1(I2) + B2(I2) ENDDO ENDDO section2_time_end = omp_get_wtime() PRINT *, '====> Time of 2. Section was ', 1section2_time_end - section2_time_begin, ' seconds <====' PRINT *, '***** 2. Section End' C$OMP SECTION PRINT *, '***** 3. Section Start' section3_time_begin = omp_get_wtime() DO J3 = 1, 400 DO I3 = 1, N C3(I3) = C1(I3) + C2(I3) ENDDO ENDDO section3_time_end = omp_get_wtime() PRINT *, '====> Time of 3. Section was ', 1section3_time_end - section3_time_begin, ' seconds <====' PRINT *, '***** 3. Section End' C$OMP SECTION PRINT *, '***** 4. Section Start' section4_time_begin = omp_get_wtime() DO J4 = 1, 400 DO I4 = 1, N D3(I4) = D1(I4) + D2(I4) ENDDO ENDDO section4_time_end = omp_get_wtime() PRINT *, '====> Time of 4. Section was ', 1section4_time_end - section4_time_begin, ' seconds <====' PRINT *, '***** 4. Section End' C$OMP END PARALLEL SECTIONS sum = 0 do i4 = 1,n sum = sum + A3(i4) + B3(i4) + C3(i4) + D3(i4) enddo print*,'Sum = ',sum parallel_time_end = omp_get_wtime() PRINT *, '====> Time of Parallel was ', 1parallel_time_end - parallel_time_begin, ' seconds <====' PRINT *, '***** Parallel end *****' CALL SYSTEM_CLOCK (COUNT = STOPTIME) TIME = REAL(STOPTIME-STARTTIME) / REAL(TICK) PRINT *, '>>>>> time of Parallel was ', 1TIME, ' seconds <<<<<' PRINT *, '----- Parallel End -----' END ---------------------------------------- ---------------------------------------- 3.) With the directive PARALLEL Here I have parallized the 4 double DO-loops. For this test program I have worked with several threads: - For 1 threads the test program needs about 19 seconds. - For 2 and 3 threads it needs about 22 seconds. - And for 4 threads it needs about 23 seconds. In the following you can find the used code: ---------------------------------------- C PROGRAM OPENMP C IMPLICIT NONE REAL omp_get_wtime INTEGER TICK, STARTTIME, STOPTIME, TIME INTEGER N, I, 1I1, I2, I3, I4, 2J1, J2, J3, J4 PARAMETER (N=10000000) REAL A1(N), A2(N), A3(N) REAL B1(N), B2(N), B3(N) REAL C1(N), C2(N), C3(N) REAL D1(N), D2(N), D3(N) REAL parallel_time_begin, parallel_time_end REAL section1_time_begin, section1_time_end REAL section2_time_begin, section2_time_end REAL section3_time_begin, section3_time_end REAL section4_time_begin, section4_time_end INTEGER NTHREADS real sum PRINT *, '----- Parallel start -----' CALL SYSTEM_CLOCK(COUNT_RATE = TICK) CALL SYSTEM_CLOCK (COUNT = STARTTIME) ! Some initializations DO I = 1, N A1(I) = I + 1.5 A2(I) = I + 22.35 B1(I) = I + 1.5 B2(I) = I + 22.35 C1(I) = I + 1.5 C2(I) = I + 22.35 D1(I) = I + 1.5 D2(I) = I + 22.35 ENDDO NTHREADS = 1 C NTHREADS = 2 C NTHREADS = 3 C NTHREADS = 4 PRINT *, '***** Parallel Start *****' parallel_time_begin = omp_get_wtime() C CALL omp_set_num_threads(NTHREADS) PRINT *, '***** 1. Section Start' section1_time_begin = omp_get_wtime() C$OMP PARALLEL DO J1 = 1, 400 DO I1 = 1, N A3(I1) = A1(I1) + A2(I1) ENDDO ENDDO C$OMP END PARALLEL section1_time_end = omp_get_wtime() PRINT *, '====> Time of 1. Section was ', 1section1_time_end - section1_time_begin, ' seconds <====' PRINT *, '***** 1. Section End' PRINT *, '***** 2. Section Start' section2_time_begin = omp_get_wtime() C$OMP PARALLEL DO J2 = 1, 400 DO I2 = 1, N B3(I2) = B1(I2) + B2(I2) ENDDO ENDDO C$OMP END PARALLEL section2_time_end = omp_get_wtime() PRINT *, '====> Time of 2. Section was ', 1section2_time_end - section2_time_begin, ' seconds <====' PRINT *, '***** 2. Section End' PRINT *, '***** 3. Section Start' section3_time_begin = omp_get_wtime() C$OMP PARALLEL DO J3 = 1, 400 DO I3 = 1, N C3(I3) = C1(I3) + C2(I3) ENDDO ENDDO C$OMP END PARALLEL section3_time_end = omp_get_wtime() PRINT *, '====> Time of 3. Section was ', 1section3_time_end - section3_time_begin, ' seconds <====' PRINT *, '***** 3. Section End' PRINT *, '***** 4. Section Start' section4_time_begin = omp_get_wtime() C$OMP PARALLEL DO J4 = 1, 400 DO I4 = 1, N D3(I4) = D1(I4) + D2(I4) ENDDO ENDDO C$OMP END PARALLEL section4_time_end = omp_get_wtime() PRINT *, '====> Time of 4. Section was ', 1section4_time_end - section4_time_begin, ' seconds <====' PRINT *, '***** 4. Section End' sum = 0 do i4 = 1,n sum = sum + A3(i4) + B3(i4) + C3(i4) + D3(i4) enddo print*,'Sum = ',sum parallel_time_end = omp_get_wtime() PRINT *, '====> Time of Parallel was ', 1parallel_time_end - parallel_time_begin, ' seconds <====' PRINT *, '***** Parallel end *****' CALL SYSTEM_CLOCK (COUNT = STOPTIME) TIME = REAL(STOPTIME-STARTTIME) / REAL(TICK) PRINT *, '>>>>> time of Parallel was ', 1TIME, ' seconds <<<<<' PRINT *, '----- Parallel End -----' END ---------------------------------------- ---------------------------------------- In the following you can find some basic informations: - Fortran Compiler: pgf95 9.0-4 64-bit target on x86-64 Linux -tp nehalem-64 - OS: Suse Linux - 4 CPUs Now my questions: a.) For test program 1.) (see above) Here it is interesting that the time for parallizing the program is nearly the same like without using OpenMP. Has someone an idea why? I thought/hoped the parallel version is much quicker. b.) For test program 2.) (see above) Is it correct that for the directive "PARALLEL SECTIONS" on the one hand the 4 sections will be parallized, that means every section will run alone on one CPU (as one thread) and on the other hand the DO- loops within the 4 sections will be parallized too? c.) For test program 3.) (see above) Here I think the time by using several threads (2, 3 and 4) is slower than using only one thread because the overhead of OpenMP to parallize the DO-loops is too big. Is this correct? Thanks a lot for your help, Alex
From: alex-lurk on 12 Dec 2009 12:44 Hi Tim, thanks a lot for you hint, but I don't understand why I can't learn much from my example. Could you explain your hint in more detail? I forgot to say that I compiled all 4 examples (without and with OpenMP) with the compiler optimization "-O3" like in the following: Without OpenMP: CFLAGS=-c -O3 With OpenMP: CFLAGS=-c -O3 -mp Thanks a lot, Alex On 11 Dez., 23:05, Tim Prince <TimothyPri...(a)> wrote: > Depending on your compiler, it may be capable of optimizing away the > loops in the non-OpenMP case. You can't learn much from this example.
From: Mark Morss on 14 Dec 2009 13:36 On Dec 12, 12:44 pm, alex-lurk <alex.l...(a)> wrote: > Hi Tim, > > thanks a lot for you hint, but I don't understand why I can't learn > much from my example. > Could you explain your hint in more detail? > > I forgot to say that I compiled all 4 examples (without and with > OpenMP) with the compiler optimization "-O3" like in the following: > Without OpenMP: CFLAGS=-c -O3 > With OpenMP: CFLAGS=-c -O3 -mp > > Thanks a lot, > Alex > > On 11 Dez., 23:05, Tim Prince <TimothyPri...(a)> wrote: > > > Depending on your compiler, it may be capable of optimizing away the > > loops in the non-OpenMP case. You can't learn much from this example.. > > I've been doing a lot of parallel processing on an AIX 5.3 server with 20 ppc processors and the xlf compiler, using the openmp directives. An example that works is of course useful for learning how to use openmp. The reason I would have said that you'll learn little, beyond that, from any simple example is that there is always a tradeoff between the overhead necessary to manage multiple threads and the direct gain from using them. Whether this works out in your favor depends on the specifics of your case, and may vary even for a given application as your input data varies. To find out whether it's worth parallelizing code there really is no substitute for just doing it and comparing the difference between what you get and what happens with highly optimized by not parallelized code. You have to pay attention to the structure of your problem and be alert for the possibility that with some data your application may run slower because you've parallelized it. If I may digress into the realm of general advice, it's quite important to specify as private all the variables that you actually want to be private within given threads. Failure to do this will produce totally fouled up results. Further, with xlf, my experience has been that if you have an allocatable array which is allocated before a parallel code block and then declared private, >>this array will nevertheless be treated as shared<<. I had to discover this by experience, though perhaps the xlf manual has something about it. You have to allocate the private object in each thread, making sure of course to deallocate it also in each thread. Also unless you're working on some sort of mega-computer, I don't think you'll miss the absence of nested openmp functionality very much. In general there is scant gain from having more active threads than the number of processors on your machine.
From: alex-lurk on 15 Dec 2009 09:07 On 13 Dez., 23:02, Tim Prince <tpri...(a)> wrote: > DO loops will not be parallelized without the OMP DO directive. Hi Tim, thanks for your hint. I thought using the PARALLEL directive alone is enough. Now I have added the DO directive and it works. In the following you can find the source code (as example only the first section): ---------------------------------------- .... .... .... PRINT *, '***** 1. Section Start' !$OMP PARALLEL !$OMP DO DO K1 = 1, O1 DO J1 = 1, N1 DO I1 = 1, M1 IF ((A1(I1,J1,K1).GT.0.0).AND.(A2(I1,J1,K1).GT.0.0)) THEN A3(I1,J1,K1) = (SQRT((A1(I1,J1,K1)/A2(I1,J1,K1)))) 1 * (SQRT((A2(I1,J1,K1)/A1(I1,J1,K1)))) ELSE A3(I1,J1,K1) = (SQRT((A1(I1,J1,K1)*A2(I1,J1,K1)))) 1 * (SQRT((A1(I1,J1,K1)*A2(I1,J1,K1)))) ENDIF ENDDO ENDDO ENDDO !$OMP END DO !$OMP END PARALLEL PRINT *, '***** 1. Section End' .... .... .... ----------------------------------------
From: alex-lurk on 16 Dec 2009 07:05 On 14 Dez., 19:36, Mark Morss <mfmo...(a)> wrote: > An example that works is of course useful for learning how to use > openmp. The reason I would have said that you'll learn little, beyond > that, from any simple example is that there is always a tradeoff > between the overhead necessary to manage multiple threads and the > direct gain from using them. Dear Mark, thanks a lot for your hints. Yes, I start with an easy example to make my first experiences with OpenMP. The fortran program which I have to parallelize is very complicated. It is a very old fortran modul. The next days I will start to parallelize it with the help of OpenMP. I will keep you informed. Many greetings Alex
Pages: 1 2 Prev: stop adverts on this board? Next: Which version of Intel Fortran 1st had C Interop? |