Prev: Parallel Compression 1.0 .....
Next: Why is Ada considered "too specialized" for scientific use
From: aminer on 3 Apr 2010 23:20 Hello, I have cleaned my previous posts, and here is my new post that include my 'ideas' etc... Srinivas Nayak wrote in comp.programming.threads: >Dear All, >Please suggest a good book that teaches in great details about the >theories behind the followings. >1. shared memory concurrent systems. >2. message passing concurrent systems. >3. mutual exclusion. >4. synchronization. >5. safety property. >6. liveness property. >7. fairness property. >8. systems with code interleaving (virtual concurrency). >9. systems with no code interleaving (true concurrency). >10. atomic operations. >11. critical sections. >12. how to code a concurrent system (about programming language >constructs available for it). >13. how to mathematically proof the properties. >14. how to mechanically verify the properties. >15. blocking synchronization. >16. non-blocking synchronization. >17. lock-freedom. >18. wait-freedom. >19. deadlock-freedom. >20. starvation-freedom. >21. livelock-freedom. >22. obstruction-freedom. >Not only the concepts but also that teaches with very simple >mathematical treatment; axiomatic or linear temporal logic. >Many of the books I came across are either emphasize one or two topic >or just provides a conceptual treatment, without mentioning how to >code a concurrent system, check if it is mathematically or manually >correct. >Please suggest any book or paper where these topics are >comprehensively covered in great details. Better if all these are >under a single cover that will be easy to understand under the roof of >a unifying theory. >Survey papers of these are also welcome. >With regards, >Srinivas Nayak For boundedness and deadlocks... - one of the most important properties .. you can use petri nets and reason about place invariants equations that you extract from the resolution of the following equation: Transpose(vector) * Incidence matrix = 0 and find your vectors...on wich you wil base your reasonning... you can do the same - place invariants equations... - and reason about lock and lock-free algorithms... And you can use also graph reduction techniques... As an example , suppose that you resolve your equation Transpose(vector) * Incidence matrix = 0 and find the following equations Note: P,Q,S,R are all the places in the system... equations1: 2 * M(P) M(Q) + M(S) = C1 (constant) and equestion2: M(P) + M(R) + M(S) = C2 (constant) Note also that vector f * M0 (initial marking) = 0 So, it follows - from the equations - that since M(P) + M(R) + M(S) = C1 , it means that M(P) <= C1 and M(R) <= C1 and M(S) <= C1 and, from the second invariant equation , we have that M(Q) <= C2 , this IMPLY that the system is structuraly bounded. That's the same thing for deadlocks , you reason about invariants equations to proove that there is no deadlock in the system... Now, if you follow good patterns , that's also good... And what's a good pattern ? It's like a THEOREM that we apply in programming... As an example: Suppose that or IF - we have two threads that want to aquire crititical sections, IF the first thread try to aquire critical section A and after that critical section B, and the second threads try to aquire B and A THEN you can have a deadlock in this system. you see ? it look like this: IF predicates are meet THEN somethings ... Now suppose there is many criticals sections... and the first thread try to aquire A ,B ,C ,D,E,F,G and second thread try to aquire A,G,C,D,E,F,B that's also a problem ... you can easily notice it by APPLYING the theorems that we call 'good patterns'. You see why good patterns - that looks like theorems - are also very powerfull ? That's what we call a good pattern - it's like a theorem , and it looks like this: IF predicates are meet THEN somethings ... There is also good patterns - like theorems - to follow for false sharing etc. Do you understand why I and others follow also good patterns - that look like theorems - ? MC also wrote: > Dear all, > Following on the post of Srinu. I am very beginner in multithreaded > programming. I have been looking for a good book to read about the > basic concepts of mutithreading, I recently bought Programming with > POSIX threads- by Butenhof. I didnt quite like that book, what I am > looking for a is a book which explains multithreaded programming > conceptually and also gives good concrete examples. Can anybody please > suggest me a book. > Thanks, I will just give an advice... To learn more about parallel programming, just read the old posts in comp.programming.threads and the other forums that discuss parallel programming.. read them carefully - as i did myself - and try to use LOGIC and REASON about them and try to EXTRACT the good patterns about parallel programming from them and understand them... Also, try to look at the parallel codes - example http://pages.videotron.com/aminer/ and other parallel toolkits ...- http://pages.videotron.com/aminer/threadpool.htm http://pages.videotron.com/aminer/parallelhashlist/queue.htm and read inside my parallel code: http://pages.videotron.com/aminer/ and the parallel code of others... and try to 'EXTRACT' and 'UNDERTAND' those good patterns to follow... Good patterns about parallel programming are like theorems: IF predicates are meet THEN something... As an example, take the following page: http://blogs.msdn.com/visualizeparallel/ As i said before, good patterns about parallel programming are like theorems: IF predicates are meet THEN something... So, read this: "It is critical to be able to spot data parallelism when you see it because data parallel algorithms allow the developer to more easily construct efficient and safe code. As opposed to the more complex solutions employed against task parallelism, data parallelism allows the programmer to perform the same operation on each piece of data concurrently without concern for race conditions and consequently, the need for synchronization, which results in significant performance overhead. Arguably, data parallel algorithms perform better (due to the lack of synchronization) and are easier for the developer to implement." So, tell me MC, what can you EXTRACT from this ? You can extract something like a theorem to follow, like this: [1] IF your algorithm exhibit much more data parallelism THEN it will be much more effcient - it will perform better- due to the lack of sychronization... Hence, if you follow theorem [1]: it will be a good pattern in parallel programming - to follow -. Do you undersand now ? You have to be smart and start to extract those theorems - good patterns to follow... - from all the programming codes, So, this theorem that i have extracted from the page is important, and it's a good pattern to follow... How can this theorem be understood by using mathematical equations ? Easy... If your algorithm exhibit more data parallelism THEN the proportion S - in percentage - will be smaller in the Amdahl equation: 1 / (S + (P/N)) - N: is the number of cores/processors - hence , the algorithm will scale better... And as you have noticed , this is what have stated theorem [1]: " [1] IF your algorithm exhibit much more data parallelism THEN it will be much more effcient - it will perform better- due to the lack of sychronization..." That's the same for the other theorems: on deadlock, false sharing etc. You have to be smart and start to extract those theorems - good patterns to follow... - from all the programming codes, articles and forums etc. Skybuck also wrote: > What if people wanna roll there own versions ? ;) > They would much better be "served" by algorithms/pseudo > code than real code which could be system/language specific ;) It's easy to EXTRACT algorithms from Object Pascal code... Look for example inside pbzip.pas, i am using this in the main body of my program: name:='msvcr100.dll'; It's the 'test' file that i am using - it's inside the zip file also - once you compile and execute pbzip.pas it will generate a file msvcr100.dll.bz. And as you have noticed i am using a - portable - compound filesystem, look at ParallelStructuredStorage.pas inside the zip file. After that i am opening it with: fstream1:=TFileStream.create(name, fmOpenReadWrite); and i am reading chunks of streams and 'distributing' them to my Thread Pool Engine to be compressed - in parallel - by myobj.BZipcompress method, look at: for i:=0 to e do begin if (i=e) and (r=0) then break; stream1:=TMemoryStream.create; if (r > 0) and (i=e) then stream1.copyfrom(fstream1,r) else stream1.copyfrom(fstream1,d); stream1.position:=0; obj:=TJob.create; obj.stream:=stream1; obj.directory:=directory; obj.compressionlevel:=9; obj.streamindex:=inttostr(i); obj.r:=r; obj.number:=e; TP.execute(myobj.BZipcompress,pointer(obj)); end; I am doing the same thing in PZlib.pas... http://pages.videotron.com/aminer/ And after that i am reading those compressed files from the compound filesystem - look inside pzlib.pas - and i am 'distributing' those compressed files, as streams, to my Thread Pool Engine to be decompressed - look inside pzlib.pas - by myobj.Zlibdecompress method, look at: ------------------------------------ names:=TStringlIST.create; storage.foldernames('/',names); len:=strtoint(names[0]); if r=0 then len:=len+ 1 ; for i:=0 to len do begin if (i=len) and (r=0) then break; obj:=TJob.create; obj.directory:=directory; obj.streamindex:=inttostr(i); obj.index:=i; obj.number:=e; obj.r:=r; TP.execute(myobj.Zlibdecompress,pointer(obj)); end; -------------------------------------------------- I wrote: > And as you have noticed i am using a portable > compound filesystem, look at ParallelStructuredStorage.pas > inside the zip file. Why ? Cause you can parallel compress your files and store those compound filesystem .zlb (zlib) or .bz (bzip) compressed files in a portable compound filesystem and after that you can distribute your compound filesystem... And of course you can uncompress files - or all the content of your compound file system - from your compound file system. And of course that's easy with Parallel Compression 1.0 :) Skybuvk wrote: >[...] an algorithm really ;) >What's so special about it ? Parallel bzip and zlib is not just pbzip.pas and pzlib.pas the parallel bzip and zlib algorithm includes my Thread Pool Engine algorithm + Parallel Queue algorithm ... I am calling it algorithm cause it uses a finite number of instructions and rules to resolve a problem - parallel compression and decompression - Do you understand ? And as i said you can parallel compress your files and store those compound filesystem .zlb (zlib) or .bz (bzip) compressed files in a portable compound filesystem and after that you can distribute your compound filesystem... And of course you can uncompress files - or all the content of your compound file system - from your compound file system. Skybuck wrote > I see a whole bunch of pascal/delphi files thrown together, >a whole bunch of dll's and god-forbid ms*.dll files... Those dlls are mandatory for now... and you can easily write a batch file etc. and reorganize ... > I see some "test programs" which are described as "modules" which they > simply are not... That's VERY easy to convert those pzlib.pas and pbzip.pas to units, and that's what i will do in the next step... Parallel Compression 1.0 will still be enhanced in the future... > It shouldn't be that hard... set your editor to "use tab character" (turn > tabs to spaces off) I am not using the delphi editor, just the notpad.exe or write.exe... and i am compiling from the dos prompt... >So far it seems like you are inserting your >threads/syncronizations >everywhere in single-thread-design algorithms ? No, it's not just insertting threads/syncronizations .. I have reasoned - and used logic - look for example at parallelhashlist.pas inside the zip file, i am using MEWs etc. carefully in the right places and i have also a little bit modified the serial code... and it uses a hash based method , with an array of MREW... The Thread Pool Engine Engine i have constructued it from zero - and i have used my ParallelQueue - an efficent lock-free queue - etc.... The parallel bzip and zlib, i have constructed it by using also my Thread Pool Engine construction etc... etc. That's not just 'inserting' threads/syncronizations. Skybuck wrote: >But my estimate would be that for now on low core systems... the >"compression" would take far more time... No. pbzlib.pas gave for example 3.3x on 4 cores... http://pages.videotron.com/aminer/ParallelCompression/parallelbzip.htm Skybuck wrote: > [...] or anything extraordinary... Don't be stupid Skybuck. It's in fact: 1- Useful 2 - A good thing for educational purpose. Skybuck wrote: >The thread pool concept is retarded. >Any good delphi programmer is capable of creating an array of threads. >So my advice to you: >1. Delete your thread pool, because it's junk. >2. Write a serious/big application that uses many threads, >and simply derive from TThread to see how easy it is. How can you be so stupid ? My Thread Pool Engine is not just an array of threads, it uses effient lock-free queues - example lock-free ParalleQueue - for each worker thread and it uses work-stealing - for more efficiency - etc ... And it easy the work for you - you can 'reuse' the TThreadPool Class... - and it is very useful... Please read again: http://pages.videotron.com/aminer/threadpool.htm Skybuck wrote in alt.comp.lang.borland-delphi: > My Thread Pool Engine is not just an array of threads, > " >> To me it is. You really don't know what you are talking about.. The principal threat to scalability in concurrent applications is the exclusive resource lock. And there are three ways to reduce lock contention: 1- Reduce the duration for which locks are held 2- Reduce the frequency with which locks are requested or 3- Replace exclusive locks with coordination mechanisms that permit greater concurrency. With low , moderate AND high contention, my ParallelQueue offer better scalability - and i am using it inside my Thread Pool Engine - . Because my ParallelQueue is using an hash based method - and lock striping - and using just a LockedInc() , so, i am REDUCING the duration for which locks are held AND REDUCING the frequency with which locks are requested, hence i am REDUCING A LOT the contention, so it's very efficient. And as I stated before , and this is a law or theorem to apply: [3] If there is LESS contention THEN the algorithm will scale better. Due to the fact that S (the serial part) become smaller with less contention , and as N become bigger, the result - the speed of the program/algorithm... - of the Amdahl's equation 1/(S+(P/N)) become bigger. It's why my ParallelQueue have scored 7 millions of pop() transactions per second... better than flqueue and RingBuffer look at: Http://pages.videotron.com/aminer/parallelqueue/parallelqueue.htm Also my Threadpool uses efficent lock-free queues - example lock-free ParallelQueue - for each worker thread - to reduce an minimize the contention - and it uses work-stealing so my Thread Pool Engine is very efficient... And it easy the work for you - you can 'reuse' the TThreadPool Class...- and it is very useful... So, don't be stupid skybuck... http://pages.videotron.com/aminer/ I wrote: > Because my ParallelQueue is using an hash based method > - and lock striping - and using just a LockedInc() , so, > i am REDUCING the duration for which locks are held AND REDUCING > the frequency with which locks are requested, hence i am > REDUCING A LOT the contention, so it's very efficient. With low , moderate AND high contention, my ParallelQueue offers better scalability - and i am using it inside my Thread Pool Engine - . And as you have noticed, i am using a low to medium contention on the following test: http://pages.videotron.com/aminer/parallelqueue/parallelqueue.htm But i predict that on HIGH contention the push() and pop() will score even better than that.. Why ? Because my ParallelQueue is using an hash based method - and lock striping - and using just a LockedInc() , so, i am REDUCING the duration for which locks are held AND REDUCING the frequency with which locks are requested, hence i am REDUCING A LOT the contention, so it's very efficient. And as I stated before , and this is a law or theorem to apply: [3] If there is LESS contention THEN the algorithm will scale better. Due to the fact that S (the serial part) become smaller with less contention , and as N become bigger, the result - the speed of the program/algorithm... - of the Amdahl's equation 1/(S+(P/N)) become bigger. ------------------------ Hello again, Now as i have stated before: [3] If there is LESS contention THEN the algorithm will scale better. Due to the fact that S (the serial part) become smaller with less contention , and as N become bigger, the result - the speed of the program/algorithm... - of the Amdahl's equation 1/(S+(P/N)) become bigger. And , as you have noticed , i have followed this theorem [3] when i have constructed my Thread Pool Engine etc... Now there is another theorem that i can state like this: [4] You have latency and bandwith , so, IF you use efficiently one or both of them - latency and bandwidth - your algorithm will be more efficient. It is why you have to not start too many threads in my Thread Pool Engine, so that you will not context switch a lot, cause, when you context switch a lot, the latency will grow and this is not good for efficiency .. You have to be smart. And as i have stated before: IF you follow and base your reasonning on those theorems - or laws or true propositions or good patterns , like theorem [1] , [2], [3],[4] ... - THEN your will construct a model that will be much more CORRECT and EFFICIENT. Take care... ----------------------------- Hello again, Sorry for my english , but i will continu to explain - my ideas etc. - using logic and reasonning... As you already know, we have those two notions: 'Time' - we have time cause there is movement of matter - and 'Space' And we have those two notions that we call 'Correctness' and 'Efficiency' And . as you have noticed, i have stated the following theorems... [1] IF your algorithm exhibit much more data parallelism THEN it will be much more efficient. 2] IF two or more processes or threads use the same critical sections THEN they - the processes or threads - must take them in the same order to avoid deadlock - in the system - . 3] If there is LESS contention THEN the algorithm will scale better. Due to the fact that S (the serial part) become smaller with less contention , and as N become bigger, the result - the speed of the program/algorithm... - of the Amdahl's equation 1/(S+(P/N)) become bigger. [4] You have latency and bandwidth , so, IF you use efficiently one or both of them - latency and bandwidth - THEN your algorithm will be more efficient. etc. Why am i calling them theorems ? You can also call them rules or true propositions, laws ... Now i can 'classify' theorem [2] in the set that i call 'correctness', and it states something on correctness.. And theorems [1] [3] [4] in the set that i call 'efficiency'. , and they states something on efficiency. But you have to be smart now.. If you have noticed, theorem [2] and [3] are in fact the same as theorem [4] But why am i calling them theorems ? You can call them rules,laws... if you want. And as i have stated before: IF you follow and base your reasonning on those theorems - or laws or true propositions or good patterns - like rules or theorems [1] , [2] , [3], [4]... - THEN your will construct a model that will be much more CORRECT and EFFICIENT. It is one of my preferred methodology in programming. Sincerely, Amine Moulay Ramdane ----------------------- Hello, I am still thinking and using logic... I can add the following rules also: [5] IF you are using a critical section or spinlock and there is a high contention- with many threads - on them THEN there is a possibility of a Lock convoy. Due to the fact that the thread entering the spinlock or critical section may context switch and this will add to the service time - and to the S (serial part) of the Amdahl's equation - and this will higher the contention and create a possibility of a Lock convoy and to a bad scalability. We can elevate the problem in [5] by using a Mutex or a Semaphore around the crital section or the spinlock... Another rule now.. [6] If there is contention on a lock - a critical section ... - and inside the locked sections you are the I/O - example logging a message to a file - this will lead the calling thread to block on the I/O and the operating system will deschedule the blocked thread until the I/O completes, thus this situation will lead to more context switching, and therefore to an increased service time , and longer service times, in this case, means more lock contention, and more lock contention means a bad scalability. there is also false sharing etc. IF you follow and base your reasonning on those theorems - or laws or true propositions or good patterns - like rules or theorems [1] , [2] , [3], [4] , [5], [6]... - THEN your will construct a model that will be much more CORRECT and EFFICIENT. And it is one of my preferred methodology in programming. I will try to add more of those rules , theorems , laws... next time... Sincerely, Amine Moulay Ramdane.
From: aminer on 4 Apr 2010 00:29 Hello, Now if you have noticed i am using 'logic' and it is logic that invented mathematics.. As an example, in logic we have the following law and tautology: ((p -> q) and (not(p) -> q )) is equivalent to q Now, like in logic, i have followed the same proof by deduction , and as an example i said: "Because my Parallel queue is using a hash based method - and lock striping - and using just a LockedInc() , so, i am REDUCING the duration for which locks are held AND REDUCING the frequency with which locks are requested, hence i am REDUCING A LOT the contention, so it's very efficient. And as I stated before , and this is a law or theorem to apply: [3] If there is LESS contention THEN the algorithm will scale better. Due to the fact that S (the serial part) become smaller with less contention , and as N become bigger, the result - the speed of the program/algorithm... - of the Amdahl's equation 1/(S+(P/N)) become bigger. It's why my ParallelQueue have scored 7 millions of pop() transactions per second... better than flqueue and RingBuffer " So, as you have noticed i am using the Amdhal's law to prove theorem [3], that's the same in a proof by deduction... IF you follow and base your reasonning on those theorems - or laws or true propositions or good patterns - like rules or theorems [1] , [2] , [3], [4] , [5], [6]... - THEN your will construct a model that will be much more CORRECT and EFFICIENT. And it is one of my preferred methodology in programming. Sincerely, Amine Moulay Ramdane
From: aminer on 4 Apr 2010 01:18 Hello, I have stated the following theorems and rules: [1] IF your algorithm exhibit much more data parallelism THEN it will be much more efficient. [2] IF two or more processes or threads use the same critical sections THEN they - the processes or threads - must take them in the same order to avoid deadlock - in the system - . [3] If there is LESS contention THEN the algorithm will scale better. Due to the fact that S (the serial part) become smaller with less contention , and as N become bigger, the result - the speed of the program/algorithm... - of the Amdahl's equation 1/(S+(P/N)) become bigger. [4] You have latency and bandwidth , so, IF you use efficiently one or both of them - latency and bandwidth - THEN your algorithm will be more efficient. [5] IF you are using a critical section or spinlock and there is a high contention- with many threads - on them THEN there is a possibility of a Lock convoy. Due to the fact that the thread entering the spinlock or critical section may context switch and this will add to the service time - and to the S (serial part) of the Amdahl's equation - and this will higher the contention and create a possibility of a Lock convoy and to a bad scalability. We can elevate the problem in [5] by using a Mutex or a Semaphore around the crital section or the spinlock... [6] If there is contention on a lock - a critical section ... - and inside the locked sections you are the I/O - example logging a message to a file - this will lead the calling thread to block on the I/O and the operating system will deschedule the blocked thread until the I/O completes, thus this situation will lead to more context switching, and therefore to an increased service time , and longer service times, in this case, means more lock contention, and more lock contention means a bad scalability. etc. So , and as in logic , you can reason by deduction like this: If [1] AND [3] THEN your algorithm is much more efficient. If [2] AND [5] THEN you have a deadlock and the possibility of Lock-convoy and bad scalability. If [6] THEN you have a bad scalability problem. etc. IF you follow and base your reasonning on those theorems - or laws or true propositions or good patterns - like rules or theorems [1] , [2] , [3], [4] , [5], [6]... - THEN your will construct a model that will be much more CORRECT and EFFICIENT. And it is one of my preferred methodology in programming. I will try to add more of those rules , theorems , laws... next time... Sincerely, Amine Moulay Ramdane.
From: aminer on 4 Apr 2010 03:43 I write: >[6] If there is contention on a lock - a critical section ... - > and inside the locked sections you are using the I/O - example > logging a message to a file - this will lead the calling thread > to block on the I/O and the operating system will deschedule > the blocked thread until the I/O completes, thus this situation > will lead to more context switching, and therefore to an increased > service time , and longer service times, in this case, means > more lock contention, and more lock contention means a bad > scalability. You can elevate problem [6] by using for example a lock-free queue - lock-free ParallelQueue or... - , with mutiple consumers pushing the messages and one worker thread doing the job - logging the messages to a file - ... Sincerely; Amine Moulay Ramdane.
From: aminer on 4 Apr 2010 10:26 I wrote: > You can elevate problem [6] by using for example a lock-free queue > - lock-free ParallelQueue or... - , with mutiple consumers pushing I mean multiple 'producers' pushing the messages - to the lock-free queue - and one consumer/worker doing the job - logging the messages to a file - ... > the > messages and one worker thread doing the job - logging the messages > to a file - ... On Apr 4, 3:42 am, aminer <ami...(a)videotron.ca> wrote: > I write: > > >[6] If there is contention on a lock - a critical section ... - > > and inside the locked sections you are using the I/O - example > > logging a message to a file - this will lead the calling thread > > to block on the I/O and the operating system will deschedule > > the blocked thread until the I/O completes, thus this situation > > will lead to more context switching, and therefore to an increased > > service time , and longer service times, in this case, means > > more lock contention, and more lock contention means a bad > > scalability. > > You can elevate problem [6] by using for example a lock-free queue > - lock-free ParallelQueue or... - , with mutiple consumers pushing > the > messages and one worker thread doing the job - logging the messages > to a file - ... > > Sincerely; > Amine Moulay Ramdane.
|
Next
|
Last
Pages: 1 2 Prev: Parallel Compression 1.0 ..... Next: Why is Ada considered "too specialized" for scientific use |