From: aminer on

Hello,


I have cleaned my previous posts, and here is my new post
that include my 'ideas' etc...

Srinivas Nayak wrote in comp.programming.threads:
>Dear All,
>Please suggest a good book that teaches in great details about the
>theories behind the followings.
>1. shared memory concurrent systems.
>2. message passing concurrent systems.
>3. mutual exclusion.
>4. synchronization.
>5. safety property.
>6. liveness property.
>7. fairness property.
>8. systems with code interleaving (virtual concurrency).
>9. systems with no code interleaving (true concurrency).
>10. atomic operations.
>11. critical sections.
>12. how to code a concurrent system (about programming language
>constructs available for it).
>13. how to mathematically proof the properties.
>14. how to mechanically verify the properties.
>15. blocking synchronization.
>16. non-blocking synchronization.
>17. lock-freedom.
>18. wait-freedom.
>19. deadlock-freedom.
>20. starvation-freedom.
>21. livelock-freedom.
>22. obstruction-freedom.
>Not only the concepts but also that teaches with very simple
>mathematical treatment; axiomatic or linear temporal logic.
>Many of the books I came across are either emphasize one or two topic
>or just provides a conceptual treatment, without mentioning how to
>code a concurrent system, check if it is mathematically or manually
>correct.
>Please suggest any book or paper where these topics are
>comprehensively covered in great details. Better if all these are
>under a single cover that will be easy to understand under the roof of
>a unifying theory.
>Survey papers of these are also welcome.
>With regards,
>Srinivas Nayak


For boundedness and deadlocks... - one of the most
important properties .. you can use petri nets and reason
about place invariants equations that you extract from the
resolution of the following equation:

Transpose(vector) * Incidence matrix = 0

and find your vectors...on wich you wil base your reasonning...

you can do the same - place invariants equations... - and
reason about lock and lock-free algorithms...

And you can use also graph reduction techniques...


As an example , suppose that you resolve your equation
Transpose(vector) * Incidence matrix = 0 and find the
following equations


Note: P,Q,S,R are all the places in the system...


equations1: 2 * M(P) M(Q) + M(S) = C1 (constant)


and


equestion2: M(P) + M(R) + M(S) = C2 (constant)


Note also that vector f * M0 (initial marking) = 0


So, it follows - from the equations - that since


M(P) + M(R) + M(S) = C1 , it means that
M(P) <= C1 and M(R) <= C1 and M(S) <= C1


and, from the second invariant equation , we have
that M(Q) <= C2 , this IMPLY that the system is
structuraly bounded.


That's the same thing for deadlocks , you reason
about invariants equations to proove that there is
no deadlock in the system...


Now, if you follow good patterns , that's also good...


And what's a good pattern ?


It's like a THEOREM that we apply in programming...


As an example:


Suppose that or IF - we have two threads that want to aquire
crititical sections, IF the first thread try to aquire critical
section A and
after that critical section B, and the second threads try to
aquire B and A THEN you can have a deadlock in this system.


you see ? it look like this: IF predicates are meet THEN
somethings ...


Now suppose there is many criticals sections... and the first
thread try to aquire A ,B ,C ,D,E,F,G and second thread try to
aquire A,G,C,D,E,F,B that's also a problem ... you can
easily notice it by APPLYING the theorems that we call
'good patterns'.


You see why good patterns - that looks like theorems -
are also very powerfull ?


That's what we call a good pattern - it's like a theorem ,


and it looks like this: IF predicates are meet THEN somethings ...


There is also good patterns - like theorems - to follow for false
sharing etc.


Do you understand why I and others follow also good patterns
- that look like theorems - ?



MC also wrote:
> Dear all,
> Following on the post of Srinu. I am very beginner in multithreaded
> programming. I have been looking for a good book to read about the
> basic concepts of mutithreading, I recently bought Programming with
> POSIX threads- by Butenhof. I didnt quite like that book, what I am
> looking for a is a book which explains multithreaded programming
> conceptually and also gives good concrete examples. Can anybody please
> suggest me a book.

> Thanks,



I will just give an advice...

To learn more about parallel programming, just read the old posts
in comp.programming.threads and the other forums that discuss
parallel programming.. read them carefully - as i did myself -
and try to use LOGIC and REASON about them and try to EXTRACT
the good patterns about parallel programming from them and understand
them...


Also, try to look at the parallel codes - example


http://pages.videotron.com/aminer/ and other parallel toolkits ...-

http://pages.videotron.com/aminer/threadpool.htm

http://pages.videotron.com/aminer/parallelhashlist/queue.htm

and read inside my parallel code:

http://pages.videotron.com/aminer/


and the parallel code of others...


and try to 'EXTRACT' and 'UNDERTAND' those good patterns
to follow...


Good patterns about parallel programming are like theorems:
IF predicates are meet THEN something...


As an example, take the following page:


http://blogs.msdn.com/visualizeparallel/


As i said before, good patterns about parallel programming
are like theorems: IF predicates are meet THEN something...
So, read this:


"It is critical to be able to spot data parallelism when you see it
because data parallel algorithms allow the developer to more easily
construct efficient and safe code. As opposed to the more complex
solutions employed against task parallelism, data parallelism allows
the programmer to perform the same operation on each piece of data
concurrently without concern for race conditions and consequently,
the need for synchronization, which results in significant
performance overhead. Arguably, data parallel algorithms perform
better (due to the lack of synchronization) and are easier for the
developer to implement."


So, tell me MC, what can you EXTRACT from this ?


You can extract something like a theorem to follow, like this:


[1] IF your algorithm exhibit much more data parallelism THEN
it will be much more effcient - it will perform better- due to
the lack of sychronization...


Hence, if you follow theorem [1]: it will be a good pattern in
parallel programming - to follow -.


Do you undersand now ?


You have to be smart and start to extract those theorems
- good patterns to follow... - from all the programming codes,


So, this theorem that i have extracted from the page is important,
and it's a good pattern to follow...


How can this theorem be understood by using mathematical
equations ?


Easy...


If your algorithm exhibit more data parallelism THEN the proportion
S - in percentage - will be smaller in the Amdahl equation:
1 / (S + (P/N)) - N: is the number of cores/processors - hence , the
algorithm will scale better...


And as you have noticed , this is what have stated theorem [1]:


" [1] IF your algorithm exhibit much more data parallelism THEN
it will be much more effcient - it will perform better- due to
the lack of sychronization..."


That's the same for the other theorems: on deadlock, false sharing
etc.


You have to be smart and start to extract those theorems
- good patterns to follow... - from all the programming codes,
articles and forums etc.


Skybuck also wrote:
> What if people wanna roll there own versions ? ;)
> They would much better be "served" by algorithms/pseudo
> code than real code which could be system/language specific ;)


It's easy to EXTRACT algorithms from Object Pascal code...

Look for example inside pbzip.pas, i am using this in the
main body of my program:


name:='msvcr100.dll';


It's the 'test' file that i am using - it's inside the
zip file also - once you compile and execute pbzip.pas it
will generate a file msvcr100.dll.bz. And as you have
noticed i am using a - portable - compound filesystem,
look at ParallelStructuredStorage.pas inside the zip file.


After that i am opening it with:


fstream1:=TFileStream.create(name, fmOpenReadWrite);


and i am reading chunks of streams and 'distributing' them
to my Thread Pool Engine to be compressed - in parallel -
by myobj.BZipcompress method, look at:


for i:=0 to e
do
begin


if (i=e) and (r=0) then break;
stream1:=TMemoryStream.create;
if (r > 0) and (i=e)
then stream1.copyfrom(fstream1,r)
else stream1.copyfrom(fstream1,d);
stream1.position:=0;
obj:=TJob.create;
obj.stream:=stream1;
obj.directory:=directory;
obj.compressionlevel:=9;
obj.streamindex:=inttostr(i);
obj.r:=r;
obj.number:=e;


TP.execute(myobj.BZipcompress,pointer(obj));


end;


I am doing the same thing in PZlib.pas...


http://pages.videotron.com/aminer/


And after that i am reading those compressed files
from the compound filesystem - look inside pzlib.pas -
and i am 'distributing' those compressed files, as streams,
to my Thread Pool Engine to be decompressed - look inside
pzlib.pas - by myobj.Zlibdecompress method, look at:


------------------------------------


names:=TStringlIST.create;
storage.foldernames('/',names);
len:=strtoint(names[0]);

if r=0 then len:=len+ 1 ;


for i:=0 to len

do
begin
if (i=len) and (r=0) then break;
obj:=TJob.create;
obj.directory:=directory;
obj.streamindex:=inttostr(i);
obj.index:=i;
obj.number:=e;
obj.r:=r;
TP.execute(myobj.Zlibdecompress,pointer(obj));
end;

--------------------------------------------------


I wrote:
> And as you have noticed i am using a portable
> compound filesystem, look at ParallelStructuredStorage.pas
> inside the zip file.


Why ?

Cause you can parallel compress your files and store
those compound filesystem .zlb (zlib) or .bz (bzip)
compressed files in a portable compound filesystem
and after that you can distribute your compound filesystem...


And of course you can uncompress files - or all the
content of your compound file system - from your compound
file system.


And of course that's easy with Parallel Compression 1.0 :)


Skybuvk wrote:
>[...] an algorithm really ;)
>What's so special about it ?


Parallel bzip and zlib is not just pbzip.pas and pzlib.pas
the parallel bzip and zlib algorithm includes my Thread Pool Engine
algorithm + Parallel Queue algorithm ...

I am calling it algorithm cause it uses a finite number of
instructions and rules to resolve a problem - parallel compression
and decompression -


Do you understand ?


And as i said you can parallel compress your files and store
those compound filesystem .zlb (zlib) or .bz (bzip)
compressed files in a portable compound filesystem
and after that you can distribute your compound filesystem...
And of course you can uncompress files - or all the
content of your compound file system - from your compound
file system.



Skybuck wrote
> I see a whole bunch of pascal/delphi files thrown together,
>a whole bunch of dll's and god-forbid ms*.dll files...


Those dlls are mandatory for now...

and you can easily write a batch file etc. and reorganize ...


> I see some "test programs" which are described as "modules" which they
> simply are not...


That's VERY easy to convert those pzlib.pas and pbzip.pas
to units, and that's what i will do in the next step...

Parallel Compression 1.0 will still be enhanced in the future...


> It shouldn't be that hard... set your editor to "use tab character" (turn
> tabs to spaces off)


I am not using the delphi editor, just the notpad.exe or write.exe...
and i am compiling from the dos prompt...


>So far it seems like you are inserting your
>threads/syncronizations
>everywhere in single-thread-design algorithms ?


No, it's not just insertting threads/syncronizations ..

I have reasoned - and used logic - look for example at
parallelhashlist.pas inside the zip file, i am using MEWs etc.
carefully in the right places and i have also a little bit
modified the serial code... and it uses a hash based method ,
with an array of MREW...


The Thread Pool Engine Engine i have constructued it from zero
- and i have used my ParallelQueue - an efficent lock-free queue -
etc....


The parallel bzip and zlib, i have constructed it by using
also my Thread Pool Engine construction etc...
etc.


That's not just 'inserting' threads/syncronizations.


Skybuck wrote:
>But my estimate would be that for now on low core systems... the
>"compression" would take far more time...


No. pbzlib.pas gave for example 3.3x on 4 cores...

http://pages.videotron.com/aminer/ParallelCompression/parallelbzip.htm



Skybuck wrote:
> [...] or anything extraordinary...


Don't be stupid Skybuck.

It's in fact:


1- Useful
2 - A good thing for educational purpose.


Skybuck wrote:
>The thread pool concept is retarded.
>Any good delphi programmer is capable of creating an array of threads.
>So my advice to you:
>1. Delete your thread pool, because it's junk.
>2. Write a serious/big application that uses many threads,
>and simply derive from TThread to see how easy it is.


How can you be so stupid ?

My Thread Pool Engine is not just an array of threads,
it uses effient lock-free queues - example lock-free ParalleQueue -
for each worker thread and it uses work-stealing - for more
efficiency - etc ...


And it easy the work for you - you can 'reuse' the TThreadPool
Class... - and it is very useful...

Please read again:


http://pages.videotron.com/aminer/threadpool.htm


Skybuck wrote in alt.comp.lang.borland-delphi:
> My Thread Pool Engine is not just an array of threads,
> "
>> To me it is.


You really don't know what you are talking about..

The principal threat to scalability in concurrent applications
is the exclusive resource lock.


And there are three ways to reduce lock contention:


1- Reduce the duration for which locks are held


2- Reduce the frequency with which locks are requested


or


3- Replace exclusive locks with coordination mechanisms that
permit greater concurrency.


With low , moderate AND high contention, my ParallelQueue
offer better scalability - and i am using it inside my
Thread Pool Engine - .


Because my ParallelQueue is using an hash based method
- and lock striping - and using just a LockedInc() , so,
i am REDUCING the duration for which locks are held AND REDUCING
the frequency with which locks are requested, hence i am
REDUCING A LOT the contention, so it's very efficient.


And as I stated before , and this is a law or theorem to apply:


[3] If there is LESS contention THEN the algorithm will
scale better. Due to the fact that S (the serial part)
become smaller with less contention , and as N become bigger,
the result - the speed of the program/algorithm... - of the
Amdahl's equation 1/(S+(P/N)) become bigger.


It's why my ParallelQueue have scored 7 millions of pop()
transactions per second... better than flqueue and RingBuffer


look at: Http://pages.videotron.com/aminer/parallelqueue/parallelqueue.htm


Also my Threadpool uses efficent lock-free queues -
example lock-free ParallelQueue - for each worker thread
- to reduce an minimize the contention - and it uses work-stealing
so my Thread Pool Engine is very efficient...


And it easy the work for you - you can 'reuse' the TThreadPool
Class...- and it is very useful...


So, don't be stupid skybuck...


http://pages.videotron.com/aminer/


I wrote:
> Because my ParallelQueue is using an hash based method
> - and lock striping - and using just a LockedInc() , so,
> i am REDUCING the duration for which locks are held AND REDUCING
> the frequency with which locks are requested, hence i am
> REDUCING A LOT the contention, so it's very efficient.


With low , moderate AND high contention, my ParallelQueue
offers better scalability - and i am using it inside my
Thread Pool Engine - .

And as you have noticed, i am using a low to medium contention
on the following test:


http://pages.videotron.com/aminer/parallelqueue/parallelqueue.htm


But i predict that on HIGH contention the push() and pop() will
score even better than that..


Why ?


Because my ParallelQueue is using an hash based method
- and lock striping - and using just a LockedInc() , so,
i am REDUCING the duration for which locks are held AND REDUCING
the frequency with which locks are requested, hence i am
REDUCING A LOT the contention, so it's very efficient.


And as I stated before , and this is a law or theorem to apply:


[3] If there is LESS contention THEN the algorithm will
scale better. Due to the fact that S (the serial part)
become smaller with less contention , and as N become bigger,
the result - the speed of the program/algorithm... - of the
Amdahl's equation 1/(S+(P/N)) become bigger.


------------------------

Hello again,


Now as i have stated before:


[3] If there is LESS contention THEN the algorithm will
scale better. Due to the fact that S (the serial part)
become smaller with less contention , and as N become bigger,
the result - the speed of the program/algorithm... - of the
Amdahl's equation 1/(S+(P/N)) become bigger.


And , as you have noticed , i have followed this theorem [3] when
i have constructed my Thread Pool Engine etc...


Now there is another theorem that i can state like this:


[4] You have latency and bandwith , so, IF you use efficiently
one or both of them - latency and bandwidth - your algorithm
will be more efficient.


It is why you have to not start too many threads in my
Thread Pool Engine, so that you will not context switch a lot,
cause, when you context switch a lot, the latency will grow and
this is not good for efficiency ..


You have to be smart.


And as i have stated before:


IF you follow and base your reasonning on those theorems
- or laws or true propositions or good patterns , like theorem [1] ,
[2], [3],[4] ... - THEN your will construct a model that will be
much more CORRECT and EFFICIENT.


Take care...

-----------------------------

Hello again,


Sorry for my english , but i will continu to explain - my ideas etc.
-
using logic and reasonning...


As you already know, we have those two notions:


'Time' - we have time cause there is movement of matter -


and


'Space'


And we have those two notions that we call 'Correctness' and
'Efficiency'


And . as you have noticed, i have stated the following theorems...


[1] IF your algorithm exhibit much more data parallelism THEN
it will be much more efficient.


2] IF two or more processes or threads use the same critical
sections THEN they - the processes or threads - must take
them in the same order to avoid deadlock - in the system - .


3] If there is LESS contention THEN the algorithm will
scale better. Due to the fact that S (the serial part)
become smaller with less contention , and as N become bigger,
the result - the speed of the program/algorithm... - of the
Amdahl's equation 1/(S+(P/N)) become bigger.


[4] You have latency and bandwidth , so, IF you use efficiently
one or both of them - latency and bandwidth - THEN your algorithm
will be more efficient.


etc.


Why am i calling them theorems ?


You can also call them rules or true propositions, laws ...


Now i can 'classify' theorem [2] in the set that i call
'correctness',
and it states something on correctness..


And theorems [1] [3] [4] in the set that i call 'efficiency'.


, and they states something on efficiency.


But you have to be smart now..


If you have noticed, theorem [2] and [3] are in fact
the same as theorem [4]


But why am i calling them theorems ?


You can call them rules,laws... if you want.


And as i have stated before:


IF you follow and base your reasonning on those theorems
- or laws or true propositions or good patterns - like rules or
theorems


[1] , [2] , [3], [4]... - THEN your will construct a model that will
be much more


CORRECT and EFFICIENT.


It is one of my preferred methodology in programming.


Sincerely,
Amine Moulay Ramdane

-----------------------


Hello,


I am still thinking and using logic...


I can add the following rules also:


[5] IF you are using a critical section or spinlock and there is
a high contention- with many threads - on them THEN there is a
possibility of a Lock convoy. Due to the fact that the thread
entering the spinlock or critical section may context switch
and this will add to the service time - and to the S (serial part)
of the Amdahl's equation - and this will higher the contention and
create a possibility of a Lock convoy and to a bad scalability.


We can elevate the problem in [5] by using a Mutex or a Semaphore
around the crital section or the spinlock...


Another rule now..


[6] If there is contention on a lock - a critical section ... -
and inside the locked sections you are the I/O - example
logging a message to a file - this will lead the calling thread
to block on the I/O and the operating system will deschedule
the blocked thread until the I/O completes, thus this situation
will lead to more context switching, and therefore to an increased
service time , and longer service times, in this case, means
more lock contention, and more lock contention means a bad
scalability.


there is also false sharing etc.


IF you follow and base your reasonning on those theorems
- or laws or true propositions or good patterns - like rules or
theorems [1] , [2] , [3], [4] , [5], [6]... - THEN your will
construct
a model that will be much more CORRECT and EFFICIENT.


And it is one of my preferred methodology in programming.


I will try to add more of those rules , theorems , laws...
next time...


Sincerely,
Amine Moulay Ramdane.


From: aminer on

Hello,

Now if you have noticed i am using 'logic'
and it is logic that invented mathematics..


As an example, in logic we have the following
law and tautology:

((p -> q) and (not(p) -> q )) is equivalent to q


Now, like in logic, i have followed the same proof by deduction ,
and as an example i said:

"Because my Parallel queue is using a hash based method
- and lock striping - and using just a LockedInc() , so,
i am REDUCING the duration for which locks are held AND
REDUCING the frequency with which locks are requested,
hence i am REDUCING A LOT the contention, so it's very efficient.

And as I stated before , and this is a law or theorem to apply:


[3] If there is LESS contention THEN the algorithm will
scale better. Due to the fact that S (the serial part)
become smaller with less contention , and as N become bigger,
the result - the speed of the program/algorithm... - of the
Amdahl's equation 1/(S+(P/N)) become bigger.


It's why my ParallelQueue have scored 7 millions of pop()
transactions per second... better than flqueue and RingBuffer "



So, as you have noticed i am using the Amdhal's law to
prove theorem [3], that's the same in a proof by deduction...


IF you follow and base your reasonning on those theorems
- or laws or true propositions or good patterns - like rules or
theorems [1] , [2] , [3], [4] , [5], [6]... - THEN your will
construct
a model that will be much more CORRECT and EFFICIENT.

And it is one of my preferred methodology in programming.


Sincerely,
Amine Moulay Ramdane


From: aminer on

Hello,


I have stated the following theorems and rules:


[1] IF your algorithm exhibit much more data parallelism THEN
it will be much more efficient.


[2] IF two or more processes or threads use the same critical
sections THEN they - the processes or threads - must take
them in the same order to avoid deadlock - in the system - .

[3] If there is LESS contention THEN the algorithm will
scale better. Due to the fact that S (the serial part)
become smaller with less contention , and as N become bigger,
the result - the speed of the program/algorithm... - of the
Amdahl's equation 1/(S+(P/N)) become bigger.

[4] You have latency and bandwidth , so, IF you use efficiently
one or both of them - latency and bandwidth - THEN your algorithm
will be more efficient.

[5] IF you are using a critical section or spinlock and there is
a high contention- with many threads - on them THEN there is a
possibility of a Lock convoy. Due to the fact that the thread
entering the spinlock or critical section may context switch
and this will add to the service time - and to the S (serial part)
of the Amdahl's equation - and this will higher the contention and
create a possibility of a Lock convoy and to a bad scalability.


We can elevate the problem in [5] by using a Mutex or a Semaphore
around the crital section or the spinlock...


[6] If there is contention on a lock - a critical section ... -
and inside the locked sections you are the I/O - example
logging a message to a file - this will lead the calling thread
to block on the I/O and the operating system will deschedule
the blocked thread until the I/O completes, thus this situation
will lead to more context switching, and therefore to an increased
service time , and longer service times, in this case, means
more lock contention, and more lock contention means a bad
scalability.

etc.


So , and as in logic , you can reason by deduction like this:

If [1] AND [3] THEN your algorithm is much more efficient.

If [2] AND [5] THEN you have a deadlock
and
the possibility of Lock-convoy
and bad scalability.


If [6] THEN you have a bad scalability problem.

etc.

IF you follow and base your reasonning on those theorems
- or laws or true propositions or good patterns - like rules or
theorems [1] , [2] , [3], [4] , [5], [6]... - THEN your will
construct
a model that will be much more CORRECT and EFFICIENT.


And it is one of my preferred methodology in programming.


I will try to add more of those rules , theorems , laws...
next time...


Sincerely,
Amine Moulay Ramdane.


From: aminer on

I write:
>[6] If there is contention on a lock - a critical section ... -
> and inside the locked sections you are using the I/O - example
> logging a message to a file - this will lead the calling thread
> to block on the I/O and the operating system will deschedule
> the blocked thread until the I/O completes, thus this situation
> will lead to more context switching, and therefore to an increased
> service time , and longer service times, in this case, means
> more lock contention, and more lock contention means a bad
> scalability.


You can elevate problem [6] by using for example a lock-free queue
- lock-free ParallelQueue or... - , with mutiple consumers pushing
the
messages and one worker thread doing the job - logging the messages
to a file - ...


Sincerely;
Amine Moulay Ramdane.





From: aminer on

I wrote:
> You can elevate problem [6] by using for example a lock-free queue
> - lock-free ParallelQueue or... - , with mutiple consumers pushing

I mean multiple 'producers' pushing the messages - to the lock-free
queue -
and one consumer/worker doing the job - logging the messages
to a file - ...


> the
> messages and one worker thread doing the job - logging the messages
> to a file - ...



On Apr 4, 3:42 am, aminer <ami...(a)videotron.ca> wrote:
> I write:
>
> >[6] If there is contention on a lock - a critical section ... -
> > and inside the locked sections you are using the I/O - example
> > logging a message to a file - this will lead the calling thread
> > to block on the I/O and the operating system will deschedule
> > the blocked thread until the I/O completes, thus this situation
> > will lead to more context switching, and therefore to an increased
> > service time , and longer service times, in this case, means
> > more lock contention, and more lock contention means a bad
> > scalability.
>
> You can elevate problem [6] by using for example a lock-free queue
> - lock-free ParallelQueue or... - , with mutiple consumers pushing
> the
> messages and one worker thread doing the job - logging the messages
> to a file - ...
>
> Sincerely;
> Amine Moulay Ramdane.