Monkey and a poet's work ? [Unix Programming]

Prev: #define _XOPEN_SOURCE 600 - where to define?
Next: WinGDB - debugging remote Linux/Unix, MinGW/Cygwin, embedded systems under Visual Studio

From: Richard Heathfield on 14 Mar 2010 04:52

mike wrote:
> In article <yKWdnfd7BvIrFgjWnZ2dnUVZ8txi4p2d(a)bt.com>,
> rjh(a)see.sig.invalid says...
>> Noob wrote:
>>> Richard Heathfield wrote:
>> <snip>
>>
>>>> We calculate this failure probability F by
>>>> raising (1-p) to the power of the number of match attempts:
>>>>
>>>> F = (1-p)^2 = 0.99609375^2 = 0.9922027587890625
>>> I don't think so. The two attempts are not independent.
>> It's a fair cop. Perhaps you could explain how to perform the
>> calculation correctly?
>>
> Unfortunately, the calculation depends on the specific text*. So for
> your binary version of Hamlet (or was it Othello) there is no simple
> answer.
>
> * to prove this, imagine that the works of Shakespeare can be expressed
> as a two character binary string (maybe the Condendsed Books version)

Readers Digest, eat your heart out. :-)

> and the random monkey has typed three symbols. If the condensed string
> is '00' then there is a 5/8 chance that it will not appear in a random
> three character string, but if the string is '01' then there is a 1/2
> chance it will not be found in a random three character string.

ITYM 6/8 rather than 5/8? But anyway, yes, fair enough. Let me, then,
ask the question a different way:

My calculation, albeit flawed, might reasonably be said to give a
ballpark figure. Is it possible to come up with a reasonably simple
calculation that gives at least an order of magnitude reduction in the
error level?

--
Richard Heathfield <http://www.cpax.org.uk>
Email: -http://www. +rjh@
"Usenet is a strange place" - dmr 29 July 1999
Sig line vacant - apply within

From: Ashton K on 14 Mar 2010 17:36

In comp.programming jellybean stonerfish <stonerfish(a)geocities.com> wrote:
> On Wed, 10 Mar 2010 02:29:05 -0800, Nick Keighley wrote:
>
>> On 8 Mar, 14:44, jellybean stonerfish <stonerf...(a)geocities.com> wrote:
>>> On Mon, 08 Mar 2010 01:56:54 -0800, Nick Keighley wrote:
>>
>>> > shakespere didn't generate his plays by random means.
>>>
>>> It was random that there even was a Shakespeare.
>>
>> only if you use an odd definition of "random"
>
> I used a pseudo-random definition of random.
>

I'll responde to this as I do in real life.

..... Jerk.

--Ashton

From: mike on 14 Mar 2010 18:58

In article <EL2dnXUKpJB6PgHWnZ2dnUVZ7qydnZ2d(a)bt.com>,
rjh(a)see.sig.invalid says...
> mike wrote:
> > In article <yKWdnfd7BvIrFgjWnZ2dnUVZ8txi4p2d(a)bt.com>,
> > rjh(a)see.sig.invalid says...
> >> Noob wrote:
> >>> Richard Heathfield wrote:
> >> <snip>
> >>
> >>>> We calculate this failure probability F by
> >>>> raising (1-p) to the power of the number of match attempts:
> >>>>
> >>>> F = (1-p)^2 = 0.99609375^2 = 0.9922027587890625
> >>> I don't think so. The two attempts are not independent.
> >> It's a fair cop. Perhaps you could explain how to perform the
> >> calculation correctly?
> >>
> > Unfortunately, the calculation depends on the specific text*. So for
> > your binary version of Hamlet (or was it Othello) there is no simple
> > answer.
> >
> > * to prove this, imagine that the works of Shakespeare can be expressed
> > as a two character binary string (maybe the Condendsed Books version)
>
> Readers Digest, eat your heart out. :-)
>
> > and the random monkey has typed three symbols. If the condensed string
> > is '00' then there is a 5/8 chance that it will not appear in a random
> > three character string, but if the string is '01' then there is a 1/2
> > chance it will not be found in a random three character string.
>
> ITYM 6/8 rather than 5/8? But anyway, yes, fair enough. Let me, then,
> ask the question a different way:

Presumably you missed one of '000' or '001' or '010'.
>
> My calculation, albeit flawed, might reasonably be said to give a
> ballpark figure. Is it possible to come up with a reasonably simple
> calculation that gives at least an order of magnitude reduction in the
> error level?
>
Imagine you are searching for the substring "to be or not to be" and at
some point you reach a substring "...zzzto be or not to", you are
obviously a bit more optimistic of finding the substring in the next few
characters than you would be if your latest characters were "...zzzto".
In the first case you have 15 out of 18 characters correct, whereas in
the second case you only have 2 out of 18.

You can use a Markov-chain type calculation to determine the probability
of switching between p out of n characters correct and q out of n
correct.

As a simple example, looking at the two letter binary Shakespeare you
can bild a matrix showing the chance of mopving from p correct to q
correct.

EG searching for '00'

M = [1/2 1/2 0 ] (with fixed-width fonts)
[1/2 0 1/2]
[0 0 1 ]

where M(p,q) (p = row, q = column) = the probability of transitioning
between p correct and q correct. So if you have so far found the string
'0' then you have p=1 letters correct and row 1 of the matrix (the
matrix is zero indexed so it starts at (0,0) and goes to (2,2)) shows
that you have 1/2 chance of ending up in column 0 (i.e you get a '1'
next so your string is now '01' and you have to start again from 0
corect), and a 1/2 chance of ending up in column 2 (i.e you get a '0' so
now your string is '00' with 2 correct letters). Note that the 1 in
position (2,2) just means that if you already have 2 characters correct
(row 2) then you already have the probability of 1 (100%) of
transitioning to 2 characters correct (column 2). Hope this all makes
sense.

Now, the probability of going from some p corect to some q correct in n
turns (i.e in reading the next n characters of your search string) is
just M^n. Starting from the beginning, with 0 characters correct, we can
just look at the value of M(0,2)^n to find the probability of getting 2
characters correct (somewhere within the n search characters).

So with M as above,

M^1 = 0.500 0.500 0.000
0.500 0.000 0.500
0.000 0.000 1.000

M^2 = 0.500 0.250 0.250
0.250 0.250 0.500
0.000 0.000 1.000

M^3 = 0.375 0.250 0.375
0.250 0.125 0.625
0.000 0.000 1.000

M^4 = 0.313 0.188 0.500
0.188 0.125 0.688
0.000 0.000 1.000

M^8 = 0.133 0.082 0.786
0.082 0.051 0.868
0.000 0.000 1.000

etc, and you now know that in a 1,2,3,4,8 character random binary string
you have probability 0.000, 0.250, 0.375, 0.500, 0.786 of 'finding' the
substring '00'.

You can do the same thing with the transition matrix for '01' which, if
I think will be

M = [1/2 1/2 0 ]
[1/2 1/2 0 ]
[0 0 1 ], or extend the same idea to longer and/or non binary
strings.

Cheers,
Mike

From: Richard Heathfield on 14 Mar 2010 20:14

mike wrote:
> In article <EL2dnXUKpJB6PgHWnZ2dnUVZ7qydnZ2d(a)bt.com>,
> rjh(a)see.sig.invalid says...
>> mike wrote:

<snip>

>>> and the random monkey has typed three symbols. If the condensed string
>>> is '00' then there is a 5/8 chance that it will not appear in a random
>>> three character string, but if the string is '01' then there is a 1/2
>>> chance it will not be found in a random three character string.
>> ITYM 6/8 rather than 5/8? But anyway, yes, fair enough. Let me, then,
>> ask the question a different way:
>
> Presumably you missed one of '000' or '001' or '010'.

No, because '010' doesn't contain '00' as a substring. Actually, it was
'100' that I missed! But yes, I missed one, obviously.

>> My calculation, albeit flawed, might reasonably be said to give a
>> ballpark figure. Is it possible to come up with a reasonably simple
>> calculation that gives at least an order of magnitude reduction in the
>> error level?
>>
> Imagine you are searching for the substring "to be or not to be" and at
> some point you reach a substring "...zzzto be or not to", you are
> obviously a bit more optimistic of finding the substring in the next few
> characters than you would be if your latest characters were "...zzzto".

(a) a miss is as good as a mile;
(b) I would actually start comparing at the /end/ of the substring - cf
Boyer-Moore;
(c) cf Bob Newhart: "Hey, Harry, I think Station 63 has something... I
think this is famous or something... 'To be or not to be, that is the
gzornenplatz'".

> In the first case you have 15 out of 18 characters correct, whereas in
> the second case you only have 2 out of 18.
>
> You can use a Markov-chain type calculation to determine the probability
> of switching between p out of n characters correct and q out of n
> correct.

You lost me already. Given n characters, x of them are correct, with 0
<= x <= n. If p != q, then either p == x or q == x or (p != x && q !=
x). You don't get to switch from p to q correct by some probability. I
am guessing that you meant to write something different, but (if so) I
can't glean your intent from what you've actually written. But I'll keep
reading in case it becomes clearer...

> As a simple example, looking at the two letter binary Shakespeare you
> can bild a matrix showing the chance of mopving from p correct to q
> correct.
>
> EG searching for '00'
>
> M = [1/2 1/2 0 ] (with fixed-width fonts)
> [1/2 0 1/2]
> [0 0 1 ]
>
> where M(p,q) (p = row, q = column) = the probability of transitioning
> between p correct and q correct.

Okay, I'm with you. q = p+delta, and you're calculating the probability
that you don't "go off track", so to speak, on the way from one to the
other. But if I've understood you correctly and if you're right (and I
lack the mathematical sophistication to allow me to determine whether or
not I have and whether or not you are), we run into a problem.
Shakespeare is around 40 Megabits; how big is the transition matrix
going to be? If I understand you correctly, it will be 40,000,000 by
40,000,000 entries. Just storing a matrix that size is going to be a
huge problem, and multiplying two MxM matrices requires MxM
multiplication operations. We're going to be talking bignums to keep any
degree of precision, and at the bignum level multiplication is O(xy)
where x and y are the digits in the two multiplicands. So, for one
transition, we have to perform 1,600,000,000,000,000 O(xy) operations.
Then, for 40,000,000 transitions, we have to raise that whole mess to
the 40,000,000th power. Using your method, the whole question changes to
whether your suggested program will run to completion before the monkeys
either hit the jackpot or give up in disgust. Whether it does in fact
yield the required order of magnitude error reduction seems, on the face
of it, to be a side issue.

--
Richard Heathfield <http://www.cpax.org.uk>
Email: -http://www. +rjh@
"Usenet is a strange place" - dmr 29 July 1999
Sig line vacant - apply within

From: mike on 14 Mar 2010 21:02

In article <Pt6dnb2Q14hw5gDWnZ2dnUVZ8rKdnZ2d(a)bt.com>,
rjh(a)see.sig.invalid says...
> mike wrote:
> > In article <EL2dnXUKpJB6PgHWnZ2dnUVZ7qydnZ2d(a)bt.com>,
> > rjh(a)see.sig.invalid says...
> >> mike wrote:
>
> <snip>
>
> >>> and the random monkey has typed three symbols. If the condensed string
> >>> is '00' then there is a 5/8 chance that it will not appear in a random
> >>> three character string, but if the string is '01' then there is a 1/2
> >>> chance it will not be found in a random three character string.
> >> ITYM 6/8 rather than 5/8? But anyway, yes, fair enough. Let me, then,
> >> ask the question a different way:
> >
> > Presumably you missed one of '000' or '001' or '010'.
>
> No, because '010' doesn't contain '00' as a substring. Actually, it was
> '100' that I missed! But yes, I missed one, obviously.
>
> >> My calculation, albeit flawed, might reasonably be said to give a
> >> ballpark figure. Is it possible to come up with a reasonably simple
> >> calculation that gives at least an order of magnitude reduction in the
> >> error level?
> >>
>
[ snipped bit ]
> You lost me already. Given n characters, x of them are correct, with 0
> <= x <= n. If p != q, then either p == x or q == x or (p != x && q !=
> x). You don't get to switch from p to q correct by some probability. I
> am guessing that you meant to write something different, but (if so) I
> can't glean your intent from what you've actually written. But I'll keep
> reading in case it becomes clearer...

Maybe I shoulkd have mentioned that for _this_particular_ Markov
process, q <= p+1. For example, if you are searching for '000000' then
either q = p+1 or q = 0. But if you are searching for '001000' then
while p=3 could only transition to q=4 or q=0, p=5 could transition to
q=6 (and you have found the string) or to q=3.
>
> > As a simple example, looking at the two letter binary Shakespeare you
> > can bild a matrix showing the chance of mopving from p correct to q
> > correct.
> >
> > EG searching for '00'
> >
> > M = [1/2 1/2 0 ] (with fixed-width fonts)
> > [1/2 0 1/2]
> > [0 0 1 ]
> >
> > where M(p,q) (p = row, q = column) = the probability of transitioning
> > between p correct and q correct.

So I lost you and then found you again.
>
> Okay, I'm with you. q = p+delta, and you're calculating the probability
> that you don't "go off track", so to speak, on the way from one to the
> other. But if I've understood you correctly and if you're right (and I
> lack the mathematical sophistication to allow me to determine whether or
> not I have and whether or not you are), we run into a problem.
> Shakespeare is around 40 Megabits; how big is the transition matrix
> going to be? If I understand you correctly, it will be 40,000,000 by
> 40,000,000 entries. Just storing a matrix that size is going to be a
> huge problem, and multiplying two MxM matrices requires MxM
> multiplication operations. We're going to be talking bignums to keep any
> degree of precision, and at the bignum level multiplication is O(xy)
> where x and y are the digits in the two multiplicands. So, for one
> transition, we have to perform 1,600,000,000,000,000 O(xy) operations.
> Then, for 40,000,000 transitions, we have to raise that whole mess to
> the 40,000,000th power. Using your method, the whole question changes to
> whether your suggested program will run to completion before the monkeys
> either hit the jackpot or give up in disgust. Whether it does in fact
> yield the required order of magnitude error reduction seems, on the face
> of it, to be a side issue.
>
Well you asked for a reasonably simple calculation that could
significantly reduce the error in your estimate. While my explanation
may not have been simple (I wasn't going to waste a lot of time editing
my post) the calculation is very simple (what could be simpler than
matrix multiplication?) and reduces the error to zero.

I never claimed that it was going to be practical*...

Comparisons with slightly longer binary strings shows that your estimate
could be very badly out. For example your calculation would predict that
the probability of not finding a 4 digit binary string in 128 random
digits would be 0.0003136 = (15/16)^(128-3), whereas the actual
probability could be as low as 0.0001806 (for string '0110') or as high
as 0.009712 for string '0000'.

* It is probably worth noting that the initial transition matrix is very
sparse. I would guess that for the complete works (in 7-bit ascii and
assuming 40 M characters), it would consist of a 40,000,001 by
40,000,001 element matrix with the top 40,000,000 entries of the first
column equal to 1/128, the diagonal (1,0), (2,1),(3,2)...
(40,000,001,40,000,000) also set to 1/128, and element
(40,000,001,40,000,001) set to 1. All the rest of the matrix would be 0,
so think of the memory you could save (for the first few multiplications
at least).

Mike

First | Prev | Next | Last
Pages: 1 2 3 4 5 6 7 8 9
Prev: #define _XOPEN_SOURCE 600 - where to define?
Next: WinGDB - debugging remote Linux/Unix, MinGW/Cygwin, embedded systems under Visual Studio