Collecting true randomness from natural language texts [Cryptography]

Prev: How is the statistic of autocorrelation test for randomness arrived at?
Next: Permutation Extrapolation Function (PXF) (you might find theend of the post useful)

From: Mok-Kong Shen on 5 Apr 2010 06:51

To my knowledge, true randomness is today commonly collected from the
natural physical environment, i.e. from ratioactive decay,
electromagnetic noises from atmosphere or from electronic circuits etc.,
exceptions being some attempts of exploiting irregularities in user's
keying etc.

It is generally acknowledged, however, that natural language texts have
fairly substantial entropy. So it's trivial to consider systematically
collecting true randomness from this highly abundant source. This has
certainly occasionally have been done in practice, if I don't err, by
subjecting such texts to compression or, better, some well-established
hashing algorithms.

I would appreciate to have some discussions on feasibility of employing
comparatively more simple (primitive) ways of collecting randomness
from natural language texts, because these would have the advantage of
being readily and simply implementable from scratch by oneself without
needing any external aid. I surmise that quite a number of good
possibilities exist.

I'll thus start by mentioning two methods, which seem to be viable from
my current humble knowledge. (To avoid eventual misunderstanding: no
novelty is hereby implied.) The first is to employ a polyalphabetic
substitution matrix of classical crypto (with each column being a
pseudo-random permutation of the alphabet) and to process two
independent chosen text streams in such a way that the characters of
one stream are considered as the 'key' (the other stream is the
'plaintext' as usual and the 'ciphertext' is the desired result). If one
wants to have the result in bits, one could code the characters in a
5-bit code (one may optionally utilize also some other symbols in the
source, e.g. punctuations, on that occasion). The second is, assuming
that one employs a 5-bit code and hence one has the source materials
in binary form, to do some non-linear combination in computer words.
It may be mentioned in this connection that I suggested previously one
such combination, namely Z = X*Y ^ X + Y (where ^ denotes xor), which
should be fairly efficient computationally.

Of course, such schemes could be employed in a recursive sense. That
is, given e.g. four input streams, one could first do a pairwise
combination, resulting in two streams, and then process these again to
obtain one single stream as the desired result.

If may also be noted that, if two communication partners agree on how
to choose the source texts, e.g. which parts of a particular newspaper
of the day or some new materials on the internet (which are abundant)
are to be used, then they could use the true randomness thus obtained
not only for such purposes as IV but also for purposes of proper
encryption processing operations. (Computing certain "MAC"s of the
true randomness obtained could be a way to ensure that both have
indeed collected the same material.)

Thanks.

M. K. Shen

From: Earl_Colby_Pottinger on 5 Apr 2010 16:10

Text is a VERY VERY non-random source.

From: unruh on 5 Apr 2010 21:17

On 2010-04-05, Earl_Colby_Pottinger <earlcolby.pottinger(a)sympatico.ca> wrote:
> Text is a VERY VERY non-random source.

Well, very very is perhaps an overstatement. Certainly it has a fair
amount of reduncancy, but estimates of 2-2.5 bits of randomness per
character is often quoted ( rather than the 8 bits/byte, or the 6
bits/character assuming only the ascii printable characters or so.
Thus if you squeeze the text down by a factor of 3 or so, you should get
pretty good randomness (ie, use MD5 on each 50 charactes or so, and use the
128 bit output as your random source).

From: Earl_Colby_Pottinger on 5 Apr 2010 22:03

On Apr 5, 8:17 pm, unruh <un...(a)wormhole.physics.ubc.ca> wrote:
> On 2010-04-05, Earl_Colby_Pottinger <earlcolby.pottin...(a)sympatico.ca> wrote:
>
> > Text is a VERY VERY non-random source.
>
> Well, very very is perhaps an overstatement. Certainly it has a fair
> amount of reduncancy, but estimates of 2-2.5 bits of randomness per
> character is often quoted ( rather than the 8 bits/byte, or the 6
> bits/character assuming only the ascii printable characters or so.
> Thus if you squeeze the text down by a factor of 3 or so, you should get
> pretty good randomness (ie, use MD5 on each 50 charactes or so, and use the
> 128 bit output as your random source).

That does not sound right to me, I thought English text when guessed
by humans have a far lower bit rate than that. Or am I
misunderstanding good MD5 will hash the input. It seems to me that
there are far less than 2 to power of 128 possible ways to arrange 50
characters of text that will be a valid English text (with trailing
and leading fragments).

Are there any other languages that are far more denser than English?

From: robertwessel2 on 5 Apr 2010 22:25

On Apr 5, 9:03 pm, Earl_Colby_Pottinger
<earlcolby.pottin...(a)sympatico.ca> wrote:
> On Apr 5, 8:17 pm, unruh <un...(a)wormhole.physics.ubc.ca> wrote:
>
> > On 2010-04-05, Earl_Colby_Pottinger <earlcolby.pottin...(a)sympatico.ca> wrote:
>
> > > Text is a VERY VERY non-random source.
>
> > Well, very very is perhaps an overstatement. Certainly it has a fair
> > amount of reduncancy, but estimates of 2-2.5 bits of randomness per
> > character is often quoted ( rather than the 8 bits/byte, or the 6
> > bits/character assuming only the ascii printable characters or so.
> > Thus if you squeeze the text down by a factor of 3 or so, you should get
> > pretty good randomness (ie, use MD5 on each 50 charactes or so, and use the
> > 128 bit output as your random source).
>
> That does not sound right to me, I thought English text when guessed
> by humans have a far lower bit rate than that. Or am I
> misunderstanding good MD5 will hash the input. It seems to me that
> there are far less than 2 to power of 128 possible ways to arrange 50
> characters of text that will be a valid English text (with trailing
> and leading fragments).

It's actually more along the lines of .6-1.5 bits/character, depending
on who did the estimate (Shannon measured .6-1.3).

| Next | Last
Pages: 1 2 3 4
Prev: How is the statistic of autocorrelation test for randomness arrived at?
Next: Permutation Extrapolation Function (PXF) (you might find theend of the post useful)