Prev: How is the statistic of autocorrelation test for randomness arrived at?
Next: Permutation Extrapolation Function (PXF) (you might find theend of the post useful)
From: J.D. on 5 Apr 2010 23:52 On Apr 5, 10:03 pm, Earl_Colby_Pottinger <earlcolby.pottin...(a)sympatico.ca> wrote: > > Are there any other languages that are far more denser than English? As far as I know, all natural languages have a fair amount of redundancy, both in their grammatical structure but more importantly in the ratio of morphemes to phonemes (i.e. there are enormously more possible combinations of sounds, even within the constraints of English phonology, than there are English words; serdly foon, shayep?). This redundancy is readily apparent in our ability to understand s###ences eve# #ver ver# #oisy ch##nels -- and I have never heard of a natural language that does not have similar redundancy (diachronic sound change alone should prevent such a language from ever arising naturally). There are invented languages that are expressly designed to minimize redundancy (supposedly to maximize transmission rate), but generally all of these projects turn out to be unlearnable -- as in, even their creators cannot ever actually use them conversationally.
From: unruh on 6 Apr 2010 01:57 On 2010-04-06, robertwessel2(a)yahoo.com <robertwessel2(a)yahoo.com> wrote: > On Apr 5, 9:03?pm, Earl_Colby_Pottinger ><earlcolby.pottin...(a)sympatico.ca> wrote: >> On Apr 5, 8:17?pm, unruh <un...(a)wormhole.physics.ubc.ca> wrote: >> >> > On 2010-04-05, Earl_Colby_Pottinger <earlcolby.pottin...(a)sympatico.ca> wrote: >> >> > > Text is a VERY VERY non-random source. >> >> > Well, very very is perhaps an overstatement. Certainly it has a fair >> > amount of reduncancy, but estimates of 2-2.5 bits of randomness per >> > character is often quoted ( rather than the 8 bits/byte, or the 6 >> > bits/character assuming only the ascii printable characters or so. >> > Thus if you squeeze the text down by a factor of 3 or so, you should get >> > pretty good randomness (ie, use MD5 on each ?50 charactes or so, and use the >> > 128 bit output as your random source). >> >> That does not sound right to me, I thought English text when guessed >> by humans have a far lower bit rate than that. ?Or am I >> misunderstanding good MD5 will hash the input. ?It seems to me that >> there are far less than 2 to power of 128 possible ways to arrange 50 >> characters of text that will be a valid English text (with trailing >> and leading fragments). > > > It's actually more along the lines of .6-1.5 bits/character, depending > on who did the estimate (Shannon measured .6-1.3). Fine make it 150 characters.
From: robertwessel2 on 6 Apr 2010 02:23 On Apr 6, 12:57 am, unruh <un...(a)wormhole.physics.ubc.ca> wrote: > On 2010-04-06, robertwess...(a)yahoo.com <robertwess...(a)yahoo.com> wrote: > > > > > > > On Apr 5, 9:03?pm, Earl_Colby_Pottinger > ><earlcolby.pottin...(a)sympatico.ca> wrote: > >> On Apr 5, 8:17?pm, unruh <un...(a)wormhole.physics.ubc.ca> wrote: > > >> > On 2010-04-05, Earl_Colby_Pottinger <earlcolby.pottin...(a)sympatico.ca> wrote: > > >> > > Text is a VERY VERY non-random source. > > >> > Well, very very is perhaps an overstatement. Certainly it has a fair > >> > amount of reduncancy, but estimates of 2-2.5 bits of randomness per > >> > character is often quoted ( rather than the 8 bits/byte, or the 6 > >> > bits/character assuming only the ascii printable characters or so. > >> > Thus if you squeeze the text down by a factor of 3 or so, you should get > >> > pretty good randomness (ie, use MD5 on each ?50 charactes or so, and use the > >> > 128 bit output as your random source). > > >> That does not sound right to me, I thought English text when guessed > >> by humans have a far lower bit rate than that. ?Or am I > >> misunderstanding good MD5 will hash the input. ?It seems to me that > >> there are far less than 2 to power of 128 possible ways to arrange 50 > >> characters of text that will be a valid English text (with trailing > >> and leading fragments). > > > It's actually more along the lines of .6-1.5 bits/character, depending > > on who did the estimate (Shannon measured .6-1.3). > > Fine make it 150 characters. But still, it's not very random in the cryptographic sense. Let's say there are on the order of a trillion English equivalent words published per day. If I know you got your entropy from some N byte sequence in Tuesdays collection of 1T words, that's at best about 40 bits worth. And if I can monitor the traffic to your PC, you couldn't retrieve more than about 1TB of source material per day even if you fully utilized a 100Mb/s link. FWIW, total Usenet traffic is about 20 million messages, and 20GB, per day. If you wanted to use an external stream as an entropy source, thats a highly available (and fairly high rate) source. That does contain a noteworthy binary component, of course. Cooking down each day's New York Times (or a Usenet feed) is probably a perfectly acceptable source of entropy for a simulation, but I would harbor severe doubts about its value if you need cryptographically secure random bits.
From: David Eather on 6 Apr 2010 06:56 On 6/04/2010 12:25 PM, robertwessel2(a)yahoo.com wrote: > On Apr 5, 9:03 pm, Earl_Colby_Pottinger > <earlcolby.pottin...(a)sympatico.ca> wrote: >> On Apr 5, 8:17 pm, unruh<un...(a)wormhole.physics.ubc.ca> wrote: >> >>> On 2010-04-05, Earl_Colby_Pottinger<earlcolby.pottin...(a)sympatico.ca> wrote: >> >>>> Text is a VERY VERY non-random source. >> >>> Well, very very is perhaps an overstatement. Certainly it has a fair >>> amount of reduncancy, but estimates of 2-2.5 bits of randomness per >>> character is often quoted ( rather than the 8 bits/byte, or the 6 >>> bits/character assuming only the ascii printable characters or so. >>> Thus if you squeeze the text down by a factor of 3 or so, you should get >>> pretty good randomness (ie, use MD5 on each 50 charactes or so, and use the >>> 128 bit output as your random source). >> >> That does not sound right to me, I thought English text when guessed >> by humans have a far lower bit rate than that. Or am I >> misunderstanding good MD5 will hash the input. It seems to me that >> there are far less than 2 to power of 128 possible ways to arrange 50 >> characters of text that will be a valid English text (with trailing >> and leading fragments). > > > It's actually more along the lines of .6-1.5 bits/character, depending > on who did the estimate (Shannon measured .6-1.3). I think shannon also pointed out that the amount of entropy also depends on the length of text - the entropy dropping as the text get longer.
From: Maaartin on 6 Apr 2010 07:31
On Apr 6, 12:56 pm, David Eather <eat...(a)tpg.com.au> wrote: > I think shannon also pointed out that the amount of entropy also depends > on the length of text - the entropy dropping as the text get longer. This all should be no problem as there's a lot of text available, so you can hash a couple of kilobytes to 128 bits. But a publicly known text can be no source of entropy, can it? Hashing the title page of a fixed internet newspaper could be enough, but for what purpose can I use it? Surely not as a secret key, since the attacker knows it. Maybe as a nonce, but for the nonce using a counter could be better (e.g., Salsa20 needs just a unique nonce, for CBC an encrypted counter works, right?). Pls give me an example, where using an internet text as a entropy source is advantageous. |