Prev: Type of the target of an assignment shall not be abstract
Next: Is there an Ada compiler whoseAda.Numerics.Generic_Elementary_Functions.Log(Base=>10, X=>variable) isefficient?
From: Colin Paul Gloster on 15 Feb 2010 05:58 Hello, I have been improving a program by suppressing C++ in it. After speeding it up a lot by making changes, I have found one considerable part which calls a library routine, which is unfortunately very slow as provided as standard with a number of Ada compilers but which is very fast with implementations of other languages. For some Ada compilers perhaps I shall simply need to not rely on an implementation of this routine by an Ada vendor. Perhaps this post will motivate Ada vendors to speed up their implementations or for people to report timings for efficient implementations which are used by default by other Ada compilers. At present the routine (which is still being called by a C++ part of the code) takes up maybe approximately one per cent of the running time. (I have profiled it but I do not have the timing profiles available where I am typing this.) This might not sound as if it is much and it is certainly much less than much of the C++ overhead which I have removed, but it is trivial to speed it up and speeding up this one per cent or so would reduce running time by hours. To contrast different implementations, I have produced the C++ program #include<iostream> #include<cmath> int main() { double answer = 0.0; for(int I=1; I<=1000000; ++I) { #include"body_in_CPlusPlus.cc" } std::cout << answer; } where body_in_CPlusPlus.cc contained answer += std::log10(0.10000000000000000000 ); /*Lines produced by for((i=1; i<=500; ++i)) do echo -n 'answer += std::log10(0' ; echo "$i/10" | bc -l ; echo ');'; done */ answer += std::log10(050.00000000000000000000 ); and the Ada program with Ada.Numerics.Generic_Elementary_Functions; with Interfaces.C; with Ada.Text_IO; procedure Logarithmic_Work_In_Ada is answer : Interfaces.C.double := 0.0; package double_library is new Ada.Numerics.Generic_Elementary_Functions(Interfaces.C.double); package double_output_library is new Ada.Text_IO.Float_IO(Interfaces.C.double); begin for I in 1 .. 1_000_000 loop --I would not have this loop but one compiler crashed if --the source file was sized a few megabytes. answer := Interfaces.C."+"(answer, double_library.log(0.10000000000000000000 , 10.0 )); --Lines produced by --for((i=1; i<=500; ++i)) do echo -n 'answer := Interfaces.C."+"(answer, double_library.log(0' ; echo "$i/10" | bc -l ; echo ', 10.0 ));'; done answer := Interfaces.C."+"(answer, double_library.log(050.00000000000000000000 , 10.0 )); end loop; double_output_library.Put(answer); end; The aforementioned code which I have been speeding up did not consist of merely 500 million calls to std::log10() as consecutive calls, but instead it consisted of hundreds of times as many calls to std::log10() interspersed across maybe 200 other functions. Of the two programs shown, the fastest C++ implementation on one test platform took less than one millisecond and the fastest Ada implementation took one minute and 31 seconds and 874 milliseconds on the same platform. Both g++ and gnatmake were from the same installation of GCC 4.1.2 20080704 (Red Hat 4.1.2-44). g++ -O3 -ffast-math logarithmic_work_in_CPlusPlus.cc -o logarithmic_work_in_CPlusPlus_with_-ffast-math time ./logarithmic_work_in_CPlusPlus_with_-ffast-math 6.34086e+08 real 0m0.005s user 0m0.000s sys 0m0.000s g++ -O3 logarithmic_work_in_CPlusPlus.cc -o logarithmic_work_in_CPlusPlus time ./logarithmic_work_in_CPlusPlus 6.34086e+08 real 0m46.443s user 0m46.435s sys 0m0.000s gnatmake -O3 -ffast-math logarithmic_work_in_Ada.adb -o logarithmic_work_in_Ada_compiled_by_GNAT_with_-ffast-math (which did gcc -c -O3 -ffast-math logarithmic_work_in_Ada.adb logarithmic_work_in_Ada.adb:4:11: warning: file name does not match unit name, should be "logarithmic_work_in_ada.adb" gnatbind -x logarithmic_work_in_Ada.ali gnatlink logarithmic_work_in_Ada.ali -ffast-math -o logarithmic_work_in_Ada_compiled_by_GNAT_with_-ffast-mat whereas trying -cargs resulted in no compilation) time ./logarithmic_work_in_Ada_compiled_by_GNAT_with_-ffast-math 6.34086408536266E+08 real 1m31.879s user 1m31.874s sys 0m0.000s gnatmake -O3 logarithmic_work_in_Ada.adb -o logarithmic_work_in_Ada_compiled_by_GNAT time ./logarithmic_work_in_Ada_compiled_by_GNAT 6.34086408536266E+08 real 1m33.338s user 1m33.338s sys 0m0.000s proprietary_compiler -O -m logarithmic_work_in_Ada.adb -o logarithmic_work_in_Ada_compiled_by_a_proprietary_compiler time ./logarithmic_work_in_Ada_compiled_by_a_proprietary_compiler 6.34086408605593E+08 real 1m41.966s user 1m41.954s sys 0m0.000s
From: Colin Paul Gloster on 15 Feb 2010 09:17 On Mon, 15 Feb 2010, John B. Matthews wrote: |--------------------------------------------------------------------| |"In article | |<alpine.LNX.2.00.1002151055530.17315(a)Bluewhite64.example.net>, | | Colin Paul Gloster <Colin_Paul_Gloster(a)ACM.org> wrote: | | | |> gnatmake -O3 logarithmic_work_in_Ada.adb -o | |> logarithmic_work_in_Ada_compiled_by_GNAT | |> | |> time ./logarithmic_work_in_Ada_compiled_by_GNAT | |> 6.34086408536266E+08 | |> | |> real 1m33.338s | |> user 1m33.338s | |> sys 0m0.000s | | | |I get a different answer: 698970 = 1000000 * (log10(50) - 1). | | | |$ make clean logada ; time ./logada | |rm -f *.o *.ali b~* core logada | |gnatmake logada -cargs -O3 -gnatwa -bargs -shared -largs -dead_strip| |gcc -c -O3 -gnatwa logada.adb | |gnatbind -shared -x logada.ali | |gnatlink logada.ali -shared-libgcc -dead_strip | | 6.98970004334243E+05 | | | |real 0m0.138s | |user 0m0.136s | |sys 0m0.002s | | | |-- | |John B. Matthews | |trashgod at gmail dot com | |<http://sites.google.com/site/drjohnbmatthews> " | |--------------------------------------------------------------------| Hi, Thanks for running it. The Ada program for timing had 500 statements in the body of the loop. I reproduced only the first and last verbatim: I showed a bash (Bourne Again SHell) line for producing all 500 statements in a comment in-between the the first statement and the last statement. You ran a program with 498 of the statements missing. Anyway, the answer produced by the program is not so much of concern as the relative speeds of different implementations. Did g++ produce a faster result for you than GNAT? It did for me for many versions of GCC today on a different platform than I used in the beginning of this thread... g++ -O3 -ffast-math logarithmic_work_in_CPlusPlus.cc -o logarithmic_work_in_CPlusPlus_compiled_by_64bit_GCC4.4.3_with_-ffast-math time ./logarithmic_work_in_CPlusPlus_compiled_by_64bit_GCC4.4.3_with_-ffast-math 6.34086e+08 real 0m0.012s user 0m0.008s sys 0m0.004s g++ -O3 -ffast-math logarithmic_work_in_CPlusPlus.cc -o logarithmic_work_in_CPlusPlus_compiled_by_64bit_GCC4.3.2_with_-ffast-math time ./logarithmic_work_in_CPlusPlus_compiled_by_64bit_GCC4.3.2_with_-ffast-math 6.34086e+08 real 0m0.012s user 0m0.012s sys 0m0.000s g++ -O3 logarithmic_work_in_CPlusPlus.cc -o logarithmic_work_in_CPlusPlus_compiled_by_64bit_GCC4.3.2 time ./logarithmic_work_in_CPlusPlus_compiled_by_64bit_GCC4.3.2 6.34086e+08 real 0m0.567s user 0m0.560s sys 0m0.000s g++ -O3 logarithmic_work_in_CPlusPlus.cc -o logarithmic_work_in_CPlusPlus_compiled_by_64bit_GCC4.4.3 time ./logarithmic_work_in_CPlusPlus_compiled_by_64bit_GCC4.4.3 6.34086e+08 real 0m0.566s user 0m0.564s sys 0m0.004s g++ -O3 logarithmic_work_in_CPlusPlus.cc -o logarithmic_work_in_CPlusPlus_compiled_by_64bit_GCC4.3.3 time ./logarithmic_work_in_CPlusPlus_compiled_by_64bit_GCC4.3.3 6.34086e+08 real 0m0.583s user 0m0.572s sys 0m0.004s gnatmake -O3 -ffast-math Logarithmic_Work_In_Ada.adb -o Logarithmic_Work_In_Ada_compiled_by_64bit_GCC4.4.3_with_-ffast-math time ./Logarithmic_Work_In_Ada_compiled_by_64bit_GCC4.4.3_with_-ffast-math 6.34086408536266E+08 real 0m31.618s user 0m31.618s sys 0m0.000s gnatmake -O3 Logarithmic_Work_In_Ada.adb -o Logarithmic_Work_In_Ada_compiled_by_64bit_GCC4.4.3 time ./Logarithmic_Work_In_Ada_compiled_by_64bit_GCC4.4.3 6.34086408606382E+08 real 0m32.750s user 0m32.746s sys 0m0.004s gnatmake -O3 Logarithmic_Work_In_Ada.adb -o Logarithmic_Work_In_Ada_compiled_by_64bit_GCC4.3.2 time ./Logarithmic_Work_In_Ada_compiled_by_64bit_GCC4.3.2 6.34086408606382E+08 real 0m33.537s user 0m33.506s sys 0m0.004s gnatmake -O3 Logarithmic_Work_In_Ada.adb -o Logarithmic_Work_In_Ada_compiled_by_64bit_GCC4.3.3 time ./Logarithmic_Work_In_Ada_compiled_by_64bit_GCC4.3.3 6.34086408606382E+08 real 0m33.717s user 0m33.718s sys 0m0.000s g++ -O3 logarithmic_work_in_CPlusPlus.cc -o logarithmic_work_in_CPlusPlus_compiled_by_64bit_GCC4.2.4 time ./logarithmic_work_in_CPlusPlus_compiled_by_64bit_GCC4.2.4 6.34086e+08 real 0m34.000s user 0m33.982s sys 0m0.000s gnatmake -O3 Logarithmic_Work_In_Ada.adb -o Logarithmic_Work_In_Ada_compiled_by_64bit_GCC4.2.4 time ./Logarithmic_Work_In_Ada_compiled_by_64bit_GCC4.2.4 6.34086408606382E+08 real 0m34.037s user 0m34.034s sys 0m0.004s gnatmake -O3 -ffast-math Logarithmic_Work_In_Ada.adb -o Logarithmic_Work_In_Ada_compiled_by_64bit_GCC4.3.2_with_-ffast-math time ./Logarithmic_Work_In_Ada_compiled_by_64bit_GCC4.3.2_with_-ffast-math 6.34086408536266E+08 real 0m35.093s user 0m35.090s sys 0m0.004s
From: Jeffrey R. Carter on 15 Feb 2010 18:04 Colin Paul Gloster wrote: > with Ada.Numerics.Generic_Elementary_Functions; > with Interfaces.C; > with Ada.Text_IO; > procedure Logarithmic_Work_In_Ada is > answer : Interfaces.C.double := 0.0; > package double_library is new > Ada.Numerics.Generic_Elementary_Functions(Interfaces.C.double); > package double_output_library is new > Ada.Text_IO.Float_IO(Interfaces.C.double); > begin > > for I in 1 .. 1_000_000 loop --I would not have this loop but one > compiler crashed if > --the source file was sized a few > megabytes. > > answer := Interfaces.C."+"(answer, > double_library.log(0.10000000000000000000 > , 10.0 )); > --Lines produced by > --for((i=1; i<=500; ++i)) do echo -n 'answer := Interfaces.C."+"(answer, > double_library.log(0' ; echo "$i/10" | bc -l ; echo ', 10.0 ));'; done > answer := Interfaces.C."+"(answer, > double_library.log(050.00000000000000000000 > , 10.0 )); > > > end loop; > > double_output_library.Put(answer); > end; IIUC, your Ada program is equivalent to with Ada.Numerics.Generic_Elementary_Functions; with Ada.Text_IO; with Interfaces.C; procedure Optimization_Test is subtype Double is Interfaces.C.Double; use type Double; package Math is new Ada.Numerics.Generic_Elementary_Functions (Double); Answer : Double := 0.0; begin -- Optimization_Test for I in 1 .. 1_000_000 loop Answer := Answer + Math.Log (0.1, 10.0); for J in 1 .. 500 loop Answer := Answer + Math.Log (Double (J) / 10.0, 10.0); end loop; Answer := Answer + Math.Log (50.0, 10.0); end loop; Ada.Text_IO.Put (Answer'Img); end Optimization_Test; On my machine I get something like jrcarter(a)jrcarter-acer-laptop:~/Code$ gnatmake -O2 -gnatnp optimization_test.adb gcc-4.3 -c -O2 -gnatnp optimization_test.adb gnatbind -x optimization_test.ali gnatlink optimization_test.ali jrcarter(a)jrcarter-acer-laptop:~/Code$ time ./optimization_test 6.34785378608078E+08 real 1m33.817s user 1m33.810s sys 0m0.000s Note that suppressing runtime checks (-gnatp) is needed to be sort of equivalent to C++. A good compiler could recognize that this adds a constant amount to Answer each time through the outer loop, and the outer loop is run a static constant number of times, and might optimize it to something like with Ada.Numerics.Generic_Elementary_Functions; with Ada.Text_IO; with Interfaces.C; procedure Optimization_Test is subtype Double is Interfaces.C.Double; use type Double; package Math is new Ada.Numerics.Generic_Elementary_Functions (Double); function Inner_Loop return Double is Result : Double := 0.0; begin -- Inner_Loop for J in 1 .. 500 loop Result := Result + Math.Log (Double (J) / 10.0, 10.0); end loop; return Result; end Inner_Loop; pragma Inline (Inner_Loop); Answer : constant Double := 1.0E6 * (Math.Log (0.1, 10.0) + Inner_Loop + Math.Log (50.0, 10.0) ); begin -- Optimization_Test Ada.Text_IO.Put (Answer'Img); end Optimization_Test; This, on my machine and with the same options, gives something like jrcarter(a)jrcarter-acer-laptop:~/Code$ time ./optimization_test 6.34785378539470E+08 real 0m0.009s user 0m0.000s sys 0m0.010s I suspect this optimization is what the C++ compiler is doing, and the Ada compilers are not. Whether a similar optimization may be applied to the calculations in your real program of interest is another question. -- Jeff Carter "Friends don't let friends program in C++." Ludovic Brenta 114
From: Colin Paul Gloster on 16 Feb 2010 10:24 On Tue, 16 Feb 2010, Colin Paul Gloster alleged: |-------------------------------------------------------------------------------------------------| |"[..] | | | |time ./Optimization_Test_with_many_assignments_to_Answer_compiled_by_GCC4.4.3_with_-ffast-math | | 6.34785378608078E+08 | | | |real 0m34.319s | |user 0m34.318s | |sys 0m0.004s | |here. | | | |[..] | | | |time ./Optimization_Test_with_a_single_assignment_to_Answer_compiled_by_GCC4.4.3_with_-ffast-math| | 6.34785378539470E+08 | | | |real 0m0.002s | |user 0m0.000s | |sys 0m0.008s | |here. | | | |[..]" | |-------------------------------------------------------------------------------------------------| I apologize, those two runs were actually with GCC4.2.4.
From: Colin Paul Gloster on 16 Feb 2010 09:54
On Mon, 15 Feb 2010, Jeffrey R. Carter sent: |--------------------------------------------------------------------------------| |"[..] | | | |IIUC, your Ada program is equivalent to | | | |with Ada.Numerics.Generic_Elementary_Functions; | |with Ada.Text_IO; | | | |with Interfaces.C; | | | |procedure Optimization_Test is | | subtype Double is Interfaces.C.Double; | | use type Double; | | | | package Math is new Ada.Numerics.Generic_Elementary_Functions (Double); | | | | Answer : Double := 0.0; | |begin -- Optimization_Test | | for I in 1 .. 1_000_000 loop | | Answer := Answer + Math.Log (0.1, 10.0); | | | | for J in 1 .. 500 loop | | Answer := Answer + Math.Log (Double (J) / 10.0, 10.0);" | |--------------------------------------------------------------------------------| Aside from not encouraging division instructions when not necessary, as I mentioned on HTTP://Programmer.97Things.OReilly.com/wiki/index.php/Execution_Speed_versus_Maintenance_Effort , I wanted to not have too high overhead obscuring the focus of measuring speeds of log implementations. |--------------------------------------------------------------------------------| |" end loop; | | | | Answer := Answer + Math.Log (50.0, 10.0); | | end loop; | | | | Ada.Text_IO.Put (Answer'Img); | |end Optimization_Test; | | | |On my machine I get something like | | | |jrcarter(a)jrcarter-acer-laptop:~/Code$ gnatmake -O2 -gnatnp optimization_test.adb| |gcc-4.3 -c -O2 -gnatnp optimization_test.adb | |gnatbind -x optimization_test.ali | |gnatlink optimization_test.ali | |jrcarter(a)jrcarter-acer-laptop:~/Code$ time ./optimization_test | | 6.34785378608078E+08 | | | |real 1m33.817s | |user 1m33.810s | |sys 0m0.000s" | |--------------------------------------------------------------------------------| time ./Optimization_Test_with_many_assignments_to_Answer_compiled_by_GCC4.4.3_with_-ffast-math 6.34785378608078E+08 real 0m34.319s user 0m34.318s sys 0m0.004s here. |---------------------------------------------------------------------------------| |"Note that suppressing runtime checks (-gnatp) is needed to be sort of equivalent| |to C++." | |---------------------------------------------------------------------------------| Thanks for the tip, but I do not program in Ada to really program in C++ with Ada syntax. |--------------------------------------------------------------------------------| |"A good compiler could recognize that this adds a constant amount to Answer each| |time through the outer loop, and the outer loop is run a static constant number | |of times, and might optimize it to something like | | | |with Ada.Numerics.Generic_Elementary_Functions; | |with Ada.Text_IO; | | | |with Interfaces.C; | | | |procedure Optimization_Test is | | subtype Double is Interfaces.C.Double; | | use type Double; | | | | package Math is new Ada.Numerics.Generic_Elementary_Functions (Double); | | | | function Inner_Loop return Double is | | Result : Double := 0.0; | | begin -- Inner_Loop | | for J in 1 .. 500 loop | | Result := Result + Math.Log (Double (J) / 10.0, 10.0); | | end loop; | | | | return Result; | | end Inner_Loop; | | pragma Inline (Inner_Loop); | | | | Answer : constant Double := | | 1.0E6 * (Math.Log (0.1, 10.0) + Inner_Loop + Math.Log (50.0, 10.0) ); | |begin -- Optimization_Test | | Ada.Text_IO.Put (Answer'Img); | |end Optimization_Test; | | | |This, on my machine and with the same options, gives something like | | | |jrcarter(a)jrcarter-acer-laptop:~/Code$ time ./optimization_test | |6.34785378539470E+08 | | | |real 0m0.009s | |user 0m0.000s | |sys 0m0.010s" | |--------------------------------------------------------------------------------| time ./Optimization_Test_with_a_single_assignment_to_Answer_compiled_by_GCC4.4.3_with_-ffast-math 6.34785378539470E+08 real 0m0.002s user 0m0.000s sys 0m0.008s here. |--------------------------------------------------------------------------------| |"I suspect this optimization is what the C++ compiler is doing, and the Ada | |compilers are not." | |--------------------------------------------------------------------------------| Actually, a C++ compiler optimized out calls to std::log10(). |--------------------------------------------------------------------------------| |"Whether a similar optimization may be applied to the calculations in your real | |program of interest is another question." | |--------------------------------------------------------------------------------| Such an optimization can not help the log10 calls like that in the real code because they are spread out in conditional statements between hundreds of other statements instead of being an analytically simple sequence of millions of calls to log, as per the following... G4double G4LogLogInterpolation::Calculate(G4double x, G4int bin, const G4DataVector& points, const G4DataVector& data) const { G4int nBins = data.size() - 1; G4double value = 0.; if (x < points[0]) { value = 0.; } else if (bin < nBins) { /*..*/ value = (std::log10(d1)*std::log10(e2/x) + std::log10(d2)*std::log10(x/e1)) /*..*/ /*..*/ } and G4double G4SemiLogInterpolation::Calculate(G4double x, G4int bin, const G4DataVector& points, const G4DataVector& data) const { G4int nBins = data.size() - 1; /*C++ programmers have a way of replicating statements across functions.*/ G4double value = 0.; if (x < points[0]) { value = 0.; } else if (bin < nBins) { /*..*/ value = (d1*std::log10(e2/x) + d2*std::log10(x/e1)) /*..*/ /*..*/ } and inline size_t G4PhysicsLogVector::FindBinLocation(G4double theEnergy) const { // For G4PhysicsLogVector, FindBinLocation is implemented using // a simple arithmetic calculation. // // Because this is a virtual function, it is accessed through a // pointer to the G4PhyiscsVector object for most usages. In this // case, 'inline' will not be invoked. However, there is a possibility // that the user access to the G4PhysicsLogVector object directly and // not through pointers or references. In this case, the 'inline' will // be invoked. (See R.B.Murray, "C++ Strategies and Tactics", Chap.6.6) return size_t( std::log10(theEnergy)//*Ugh, this division is unnecessary.*/dBin - baseBin ); } and G4double G4ProtonInelasticCrossSection:: GetCrossSection(G4double kineticEnergy, G4double atomicNumber, G4double nOfProtons) { /*..*/ if(nOfNeutrons>1.5) fac2=std::log((nOfNeutrons)); /*..*/ } and G4double G4LinLogLogInterpolation::Calculate(G4double x, G4int bin, const G4DataVector& points, const G4DataVector& data) const { G4int nBins = data.size() - 1; G4double value = 0.; if (x < points[0]) { value = 0.; } else if (bin < nBins) { /*..*/ /*C++ programmers have a way of replicating statements across functions...*/ if(d1 > 0.0 && d2 > 0.0) { /*... and not remembering to include checks for mathematical well-definededness in each code clone nor indeed for each call to std::log10() in each statement guarded by two checks for mathematical well-definededness.*/ value = (std::log10(d1)*std::log10(e2/x) + std::log10(d2)*std::log10(x/e1)) /*..*/ /*..*/ } |