Prev: How is the statistic of autocorrelation test for randomness arrived at?
Next: Permutation Extrapolation Function (PXF) (you might find theend of the post useful)
From: Mok-Kong Shen on 5 Apr 2010 06:51 To my knowledge, true randomness is today commonly collected from the natural physical environment, i.e. from ratioactive decay, electromagnetic noises from atmosphere or from electronic circuits etc., exceptions being some attempts of exploiting irregularities in user's keying etc. It is generally acknowledged, however, that natural language texts have fairly substantial entropy. So it's trivial to consider systematically collecting true randomness from this highly abundant source. This has certainly occasionally have been done in practice, if I don't err, by subjecting such texts to compression or, better, some well-established hashing algorithms. I would appreciate to have some discussions on feasibility of employing comparatively more simple (primitive) ways of collecting randomness from natural language texts, because these would have the advantage of being readily and simply implementable from scratch by oneself without needing any external aid. I surmise that quite a number of good possibilities exist. I'll thus start by mentioning two methods, which seem to be viable from my current humble knowledge. (To avoid eventual misunderstanding: no novelty is hereby implied.) The first is to employ a polyalphabetic substitution matrix of classical crypto (with each column being a pseudo-random permutation of the alphabet) and to process two independent chosen text streams in such a way that the characters of one stream are considered as the 'key' (the other stream is the 'plaintext' as usual and the 'ciphertext' is the desired result). If one wants to have the result in bits, one could code the characters in a 5-bit code (one may optionally utilize also some other symbols in the source, e.g. punctuations, on that occasion). The second is, assuming that one employs a 5-bit code and hence one has the source materials in binary form, to do some non-linear combination in computer words. It may be mentioned in this connection that I suggested previously one such combination, namely Z = X*Y ^ X + Y (where ^ denotes xor), which should be fairly efficient computationally. Of course, such schemes could be employed in a recursive sense. That is, given e.g. four input streams, one could first do a pairwise combination, resulting in two streams, and then process these again to obtain one single stream as the desired result. If may also be noted that, if two communication partners agree on how to choose the source texts, e.g. which parts of a particular newspaper of the day or some new materials on the internet (which are abundant) are to be used, then they could use the true randomness thus obtained not only for such purposes as IV but also for purposes of proper encryption processing operations. (Computing certain "MAC"s of the true randomness obtained could be a way to ensure that both have indeed collected the same material.) Thanks. M. K. Shen
From: Earl_Colby_Pottinger on 5 Apr 2010 16:10 Text is a VERY VERY non-random source.
From: unruh on 5 Apr 2010 21:17 On 2010-04-05, Earl_Colby_Pottinger <earlcolby.pottinger(a)sympatico.ca> wrote: > Text is a VERY VERY non-random source. Well, very very is perhaps an overstatement. Certainly it has a fair amount of reduncancy, but estimates of 2-2.5 bits of randomness per character is often quoted ( rather than the 8 bits/byte, or the 6 bits/character assuming only the ascii printable characters or so. Thus if you squeeze the text down by a factor of 3 or so, you should get pretty good randomness (ie, use MD5 on each 50 charactes or so, and use the 128 bit output as your random source).
From: Earl_Colby_Pottinger on 5 Apr 2010 22:03 On Apr 5, 8:17 pm, unruh <un...(a)wormhole.physics.ubc.ca> wrote: > On 2010-04-05, Earl_Colby_Pottinger <earlcolby.pottin...(a)sympatico.ca> wrote: > > > Text is a VERY VERY non-random source. > > Well, very very is perhaps an overstatement. Certainly it has a fair > amount of reduncancy, but estimates of 2-2.5 bits of randomness per > character is often quoted ( rather than the 8 bits/byte, or the 6 > bits/character assuming only the ascii printable characters or so. > Thus if you squeeze the text down by a factor of 3 or so, you should get > pretty good randomness (ie, use MD5 on each 50 charactes or so, and use the > 128 bit output as your random source). That does not sound right to me, I thought English text when guessed by humans have a far lower bit rate than that. Or am I misunderstanding good MD5 will hash the input. It seems to me that there are far less than 2 to power of 128 possible ways to arrange 50 characters of text that will be a valid English text (with trailing and leading fragments). Are there any other languages that are far more denser than English?
From: robertwessel2 on 5 Apr 2010 22:25
On Apr 5, 9:03 pm, Earl_Colby_Pottinger <earlcolby.pottin...(a)sympatico.ca> wrote: > On Apr 5, 8:17 pm, unruh <un...(a)wormhole.physics.ubc.ca> wrote: > > > On 2010-04-05, Earl_Colby_Pottinger <earlcolby.pottin...(a)sympatico.ca> wrote: > > > > Text is a VERY VERY non-random source. > > > Well, very very is perhaps an overstatement. Certainly it has a fair > > amount of reduncancy, but estimates of 2-2.5 bits of randomness per > > character is often quoted ( rather than the 8 bits/byte, or the 6 > > bits/character assuming only the ascii printable characters or so. > > Thus if you squeeze the text down by a factor of 3 or so, you should get > > pretty good randomness (ie, use MD5 on each 50 charactes or so, and use the > > 128 bit output as your random source). > > That does not sound right to me, I thought English text when guessed > by humans have a far lower bit rate than that. Or am I > misunderstanding good MD5 will hash the input. It seems to me that > there are far less than 2 to power of 128 possible ways to arrange 50 > characters of text that will be a valid English text (with trailing > and leading fragments). It's actually more along the lines of .6-1.5 bits/character, depending on who did the estimate (Shannon measured .6-1.3). |