Prev: Problems matching between FORTRAN COMMON and C struct definedin a dll
Next: pointer nullify and deallocate 101
From: Craig Powers on 31 Mar 2010 15:08 monir wrote: > > 2) Here's again an abbreviated sample code for easy reference: > (F77, g95) The problem with the abbreviated sample is that it's so abbreviated, it cuts out the problem (as has already been said multiple times by multiple people). I don't disagree with you that 22k-ish lines is not practical to post. However, there are a couple of things you can and *should* do: * Try running it with absolutely every check the compiler offers turned on. That's spelled "-check all" in ifort. Other compilers may offer a similar ability with a different spelling, or they may require you to specify multiple options; RTFM. (I see below you've tried to do this and it didn't get you anywhere... in that case, see next.) * Try to cut it down to a manageable size. In the process, maybe you'll discover what the problem is yourself. If it goes away when you take out a particular piece, that alone gives you an avenue to pursue in trying to find your problem. If you succeed in producing a manageable size example, well, now you've got something to post. > 3) There appears to be some confusion on when the (current) program > correctly works and when it doesn't. > Here's a summary for clarification: > (ref is to a SINGLE statement in the above abbreviated sample code) > > a) with "! pause" and "!! implicit none" NOT activated: > .......................... program returns x = NaN > c) with "! pause" NOT activated and "!! implicit none" Activated : > .......................... program returns x = -1.0676971 (correct) This is rather interesting. I don't think adding IMPLICIT NONE should change the meaning of a program that continues to compile successfully. Most compilers have an option that lets you produce assembly output; have you tried comparing the results for the routine in question with and without IMPLICIT NONE? > 5) Some have indicated that mismatched arguments could have caused the > error. > A very valid point, and I've been looking at this for some time now. > But think about it for a moment. If there are mismatched arguments, > how would/could inserting a "Pause" statement in one of the routines > or just adding "implicit none" in another (with no additional > declarations) correct the mismatch and force the algorithms to work > "perfectly" producing the correct results throughout ?? > This is the other part of the mystery! Heisenbugs happen when the effect of the bug is to write to and/or read from memory that isn't supposed to be written to and/or read from. In that case, it becomes important how variables are laid out in memory, what is in the memory before it is read from, and so on. Adding a statement may cause the code generator to change something which isn't visible to you and changes the manifestation of the bug.
From: dpb on 31 Mar 2010 15:17 monir wrote: .... > 2) Here's again an abbreviated sample code for easy reference: > (F77, g95) > > PROGRAM main > .................... > call dCpZeros() > ...................... > End main > ------------------------------------ > SUBROUTINE dCpZeros() > ..................... > do i=1, 9 > do j=1, 10 > do k=1, 30 > ..................... .... > .................. > Return > End Subroutine dCpZeros > ------------------------------------ > SUBROUTINE Polin2(w1, w2, w3, w4, val) > !! implicit none > .................... Which contains nary a single declaration making it totally useless for anybody to look at and see what might be the argument or type or dimension mismatch that is in at least moderately high likelihood the underlying culprit.... :( --
From: glen herrmannsfeldt on 31 Mar 2010 15:37 monir <monirg(a)mondenet.com> wrote: (big snip, including points 1 through 4) > 5) Some have indicated that mismatched arguments could have caused the > error. > A very valid point, and I've been looking at this for some time now. > But think about it for a moment. If there are mismatched arguments, > how would/could inserting a "Pause" statement in one of the routines > or just adding "implicit none" in another (with no additional > declarations) correct the mismatch and force the algorithms to work > "perfectly" producing the correct results throughout ?? > This is the other part of the mystery! (snip) Unfortuanately, fairly easily. Reminds me of a PL/I program that I wrote a loooong time ago, which used CONTROLLED variables. (PL/I equivalent to ALLOCATABLE.) The program was working fine until I changed something and then it didn't work right anymore. I don't remember how I tracked it down, but the result was that it was deallocating and later reallocating arrays of the same size and worked as long as they were reallocated in the same place! So I was lucky for a while... Argument mismatch can easily depend on values on the stack not changing at appropriate points. PAUSE likely does a subroutine call to the routine implementing the pause operation, which involves data on the stack. -- glen
From: aerogeek on 1 Apr 2010 01:39 On Apr 1, 12:08 am, Craig Powers <craig.pow...(a)invalid.invalid> wrote: > monir wrote: > > > 2) Here's again an abbreviated sample code for easy reference: > > (F77, g95) > > The problem with the abbreviated sample is that it's so abbreviated, it > cuts out the problem (as has already been said multiple times by > multiple people). > > I don't disagree with you that 22k-ish lines is not practical to post. > However, there are a couple of things you can and *should* do: > * Try running it with absolutely every check the compiler offers turned > on. That's spelled "-check all" in ifort. Other compilers may offer a > similar ability with a different spelling, or they may require you to > specify multiple options; RTFM. (I see below you've tried to do this > and it didn't get you anywhere... in that case, see next.) > * Try to cut it down to a manageable size. In the process, maybe you'll > discover what the problem is yourself. If it goes away when you take > out a particular piece, that alone gives you an avenue to pursue in > trying to find your problem. If you succeed in producing a manageable > size example, well, now you've got something to post. > > > 3) There appears to be some confusion on when the (current) program > > correctly works and when it doesn't. > > Here's a summary for clarification: > > (ref is to a SINGLE statement in the above abbreviated sample code) > > > a) with "! pause" and "!! implicit none" NOT activated: > > .......................... program returns x = NaN > > c) with "! pause" NOT activated and "!! implicit none" Activated : > > .......................... program returns x = -1.0676971 (correct) > > This is rather interesting. I don't think adding IMPLICIT NONE should > change the meaning of a program that continues to compile successfully. > Most compilers have an option that lets you produce assembly output; > have you tried comparing the results for the routine in question with > and without IMPLICIT NONE? > > > 5) Some have indicated that mismatched arguments could have caused the > > error. > > A very valid point, and I've been looking at this for some time now. > > But think about it for a moment. If there are mismatched arguments, > > how would/could inserting a "Pause" statement in one of the routines > > or just adding "implicit none" in another (with no additional > > declarations) correct the mismatch and force the algorithms to work > > "perfectly" producing the correct results throughout ?? > > This is the other part of the mystery! > > Heisenbugs happen when the effect of the bug is to write to and/or read > from memory that isn't supposed to be written to and/or read from. In > that case, it becomes important how variables are laid out in memory, > what is in the memory before it is read from, and so on. Adding a > statement may cause the code generator to change something which isn't > visible to you and changes the manifestation of the bug. I had this very specific problem. A non interfering statement like in your case pause, was causing the same problem for my code. This code was running perfectly well in windows system but i saw this problem once i tried the program on a linux system. So if possible can you try compiling and running your program on a different system. If possible. > Heisenbugs happen when the effect of the bug is to write to and/or read > from memory that isn't supposed to be written to and/or read from. In > that case, it becomes important how variables are laid out in memory, > what is in the memory before it is read from, and so on. Adding a > statement may cause the code generator to change something which isn't > visible to you and changes the manifestation of the bug. For me the problem had something to do with incorrect array bounds, which was not apparant and didn't come to notice untill i used dbx, the debugger. So get a debugger and run through the code via a debugger for the conditions its failing. I am sure you will get to the bottom of the problem. cheers
From: monir on 2 Apr 2010 15:09 On Apr 1, 1:39 am, aerogeek <sukhbinder.si...(a)gmail.com> wrote: > On Apr 1, 12:08 am, Craig Powers <craig.pow...(a)invalid.invalid> wrote: > > monir wrote: > > > 2) Here's again an abbreviated sample code for easy reference: > > > (F77, g95) > > The problem with the abbreviated sample is that it's so abbreviated, it > > cuts out the problem. > > I don't disagree with you that 22k-ish lines is not practical to post. > > However, there are a couple of things you can and *should* do: > > * Try running it with absolutely every check the compiler offers turned > > on. (I see below you've tried to do this > > and it didn't get you anywhere... in that case, see next.) > > * Try to cut it down to a manageable size. In the process, maybe you'll > > discover what the problem is yourself. If it goes away when you take > > out a particular piece, that alone gives you an avenue to pursue in > > trying to find your problem. If you succeed in producing a manageable > > size example, well, now you've got something to post. > > > monir wrote: > > > 3) There appears to be some confusion on when the (current) program > > > correctly works and when it doesn't. > > > Here's a summary for clarification: > > > (ref is to a SINGLE statement in the above abbreviated sample code) > > > a) with "! pause" and "!! implicit none" NOT activated: > > > .......................... program returns x = NaN > > > c) with "! pause" NOT activated and "!! implicit none" Activated : > > > .......................... program returns x = -1.0676971 (correct) > > This is rather interesting. I don't think adding IMPLICIT NONE should > > change the meaning of a program that continues to compile successfully. > > Most compilers have an option that lets you produce assembly output; > > have you tried comparing the results for the routine in question with > > and without IMPLICIT NONE? ......YES I have many times. ALL Routine works perfectly when tested in isolation. ......I got the assembly output (~ 2,000 pages), but not sure what to look for ? ......For example, at the top it displays: ......................................... .comm _abscisae_, 36000 # 36000 .comm _crt_, 496 # 484 .comm _d2cp_, 144000 # 144000 .comm _d9mach_, 160 # 152 ......................................... ......ARE the above pairs of numbers (bytes?) supposed to be the same or they're ref to something else ?? > > > monir wrote: > > > 8) Based on my rather limited knowledge of Fortran, here's a thought > > > for you experts to critique. > > > As indicated earlier, the code (work-in-progress, ~ 22,000 lines and ~ 80 > > > routines) is mostly in F77, but with some limited patches of F90, e.g.; > > > use of unlabeled loops, vectors & matrices & array operations, some new > > > intrinsic functions, one Contains and one explicit Interface, but no > > > modules, no dynamic arrays, no defined data types, no Pointers, no ... > > > I've always had some suspicions about such programming practice, even > > > though the g95 compiler never complained. But it seems reasonable to > > > expect at some point (depending on the complexity of the code and the > > > extent of the mix) that there would be a conflict that wouldn't be > > > detected/resolved by the compiler, leading to possible confusion or > > > misinterpretation or memory disruption or whatever. > > > The "g95" compiler, or any other comparable compiler for that matter, > > > can't possibly detect and resolve each and every conflict that might arise > > > from a mixed F77+F90 programming. Correct ?? > > > Just a thought! ... you don't have to take it seriously if you don't > > > want to! > > > 5) Some have indicated that mismatched arguments could have caused the > > > error. > > > A very valid point, and I've been looking at this for some time now. > > > But think about it for a moment. If there are mismatched arguments, > > > how would/could inserting a "Pause" statement in one of the routines > > > or just adding "implicit none" in another (with no additional > > > declarations) correct the mismatch and force the algorithms to work > > > "perfectly" producing the correct results throughout ?? > > > This is the other part of the mystery! > aerogeek wrote: > I had this very specific problem. A non interfering statement like in > your case pause, was causing the same problem for my code. > This code was running perfectly well in windows system but i saw this > problem once i tried the program on a linux system. > So if possible can you try compiling and running your program on a > different system. If possible. ..... UNFORTUNATELY, I don't have access to other systems. > For me the problem had something to do with incorrect array bounds, > which was not apparant and didn't come to notice untill i used dbx, > the debugger. > So get a debugger and run through the code via a debugger for the > conditions its failing. I am sure you will get to the bottom of the > problem. $$ ===================== $$ NOT being able so far to trap the problem or the code violation, if any, leaves me with couple of options: 1) POST the entire F77 code: as a zip file and include the input files to look at. It is a good idea, but with no documentation it would be extremely difficult even for you experts to follow the program logic. And reducing it to a meaningful size for posting while ensuring it still generates the NaN error is not an easy task, and would still be considered as an (extended) abbreviated version, and I might in the process cut out the source of the problem! 2) USE a modern debugger. In the past I used the MS Fortran metacommand "$DEBUG:" for debugging (I believe that what it was called!); by inserting it in the source code (could appear multiple times). It was part of the MS Fortran compiler. What modern Fortran Debugger would you recommend (Win XP OS) ?? Is there a connection between the Fortran compiler g95 and the debugger ? or it works independently ? Does it matter if the code is F77 or F90 or F77+F90 ?? (I hope it is free!) 3) BACK to the problem in hand. The general consensus among the responders is that the problem could be attributed to: a- declaration issues b- arrays out of bounds c- mismatched arguments d- data on the stack unexpectedly or unintentionally moved around as a result of a non-interfering statement such as "PAUSE" or "IMPLICIT NONE" e- any combinations of the above f- none of the above! I'm reasonably confident, after so much re-checking and testing, that it is NOT a- , b- or c- above, but I could be wrong! 4) I suggested earlier: >... it seems reasonable to expect at some point >(depending on the complexity of the code and the >extent of the mix) that there would be a conflict that wouldn't be >detected/resolved by the compiler, leading to possible confusion or >misinterpretation or memory disruption or whatever. >The "g95" compiler, or any other comparable compiler for that matter, >can't possibly detect and resolve each and every conflict that might arise >from a mixed F77+F90 programming. >Just a thought! ... you don't have to take it seriously if you don't want to! Richard Main and others responded: >>... I consider it incorrect to even label it as mixed f77+f90. >>Almost all of f77 is also part of f95. The very few exceptions are >>matters of mostly academic interest, as all f95 compilers do them anyway >>and they are *NOT* things that are prone to obscure interactions. So >>what you have is just f95 code. 5) OK. Here is my latest attempt: a- I took a version of the offended code Test1.FOR, and made sure NO "PAUSE" in Sub dCpzeros() and NO "IMPLICIT NONE" in Sub Polin2() b- re-compiled and ran the program ....got (as expected) ..... x = NaN c- renamed the source code (self-contained single file) as Test1F.F90 ....The MinGW-g95 manual states: " ... with F90 name extension, the source code is pre-processed with the C preprocessor." Not knowing exactly what that means, I took it to imply that something is done by the g95 compiler when using .F90 extension that otherwise is NOT done (with .FOR). Let me try it. d- changed the F77 style to F90 style throughout, namely: ....replaced "c" in col 1 by "!" ....added "&" for continuation lines and removed char from col 6 ....deleted blanks between digits (initially for easy reading/editing long numbers) ......e.g.; Data GaussWg ( 7) / 0.0910282619 8296364981 1497220702 892 d0 / ...........(which is allowed in *.FOR, but gave DATA syntax error in *.F90) .......... was changed to: ...........Data GaussWg ( 7) / 0.091028261982963649811497220702892d0 / That was all. Nothing else was changed. e- compiled: ....>g95 -fbounds-check -ftrace=full -o Test1F Test1F.F90 and ran. PROGRAM Works Fine!!!! returning: ............ x = -1.0676971 (correct) 6) THE above may or may not be the cure, since it does not directly supports or refutes the earlier suggestion (Item 4 above). Furthermore, it might be just temporarily masking the problem! PLEASE provide at your convenience the name of a modern debugger (Item 2 above) and will go through the code line-by-line to identify the culprit once and for all and get to the bottom of the problem in Test1.FOR. Thank you kindly for your patience! Monir
|
Next
|
Last
Pages: 1 2 3 Prev: Problems matching between FORTRAN COMMON and C struct definedin a dll Next: pointer nullify and deallocate 101 |