Prev: Using an Asymmetric cipher the opposite way round.
Next: Note on general application of Feistel's technique
From: MrD on 23 Jul 2010 03:50 Fiziwig wrote: > > Total Words 1,075,609 > Total Letters 2,476,824 > Average letters per word 2.30 > > This is not that much better than my hand made code that worked out > at 3.35 letters per word. 30% smaller is nothing to be sneezed at. > > As a code for a living language there also needs to be room for > expansion within the coding scheme, without breaking the already > coded words, which would mean salting the original word file with a > lot of nulls that could later be replaced by real words as that > became necessary. Have you said how many *distinct* words there are in your corpus? Sorry if you've already said, I couldn't find it. I believe Dr. Seuss deliberately chose a lexicon of just 300 words; that one can get by in English with about 1,000 words; most native English speakers manage all their lives with a lexicon of about 3,000 words, and that Winston Churchill commanded a lexicon of about 12,000 ("12,000 strong", as adacrypt might say). How do you handle stemming? E.g. is each form[1] of a verb treated as a distinct word? I can't see how else you'd do it (without some complicated grammar rules). So without a stemmer, I suppose the standard lexicon might grow to about 15,000 and the Churchill lexicon might grow to about 60,000. The Seuss lexicon probably wouldn't grow that much, because in addition to his miniature lexicon, Seuss also restricted himself to a pretty rudimentary grammar (everything is present-tense indicative, etc.) Actually I don't have the first clue how much a lexicon shrinks/grows by as a result of stemming; those numbers are unadulterated guesswork. But people who work on text-retrieval tools would know the facts. -- MrD. [1] MrD is stuck for the right word; he tried "declension", but knew that wasn't what he meant. MrD thinks Churchill would have known the right word.
From: Paulo Marques on 23 Jul 2010 08:25 Fiziwig wrote: > [...] > Having never programmed the Huffman Algorithm, I approached the > project with some trepidation. As it turned out, it was a piece of > cake. I used C++. My ONLY bugs were a couple of simple typos and one > uninitialized node pointer in the tree. Altogether it took me about an > hour to have it running, and another hour of fiddling with the format > of the output till I was happy with it. The only adaptation I had to > make to the basic Huffman algorithm was use an alphabet of 26 letters > instead of an alphabet of 2 letters. So instead of having two child > nodes, each parent node had a single list under it with 26 linked > entries. > > If you are curious, the whole completed tree is available at > http://fiziwig.com/temp_huff.txt Nice work! I'm afraid you still have a bug, though :( When building N-ary huffman trees you have to make sure the number of elements fill up a N-branched tree _completely_. This means that for a 26-ary tree, the number of starting elements needs to be "25 * N + 26" for some integer N. If they're not, you need to add dummy zero frequency elements to the end of the tree before start building it. For instance, if you have 12340 elements -> (12340 - 26) mod 25 = 14, so you need to add (25-14) = 11 dumy elements, before starting to collapse the tree. I'm doing this all from memory, so I hope I'm getting the math right... That is why in the end, the algorithm is only using letters from A to P as the first letter and not going up to Z. So, the end result after inserting the dummy elements should be even better than what you have now. > [...] > Note that with the genuine Huffman algorithm the sentence had 4 words > that used one-letter codes. My hand made codebook had no single-letter > codes. That is one thing I was curious about: if the optimized version would find words with such a large frequency that deserved their own single symbol. Thanks for satisfying my curiosity ;) > If I encoded the entire 1 million+ word Brown Corpus my results would > be > > Total Words 1,075,609 > Total Letters 2,476,824 > Average letters per word 2.30 > > This is not that much better than my hand made code that worked out at > 3.35 letters per word. The total "original letters including spaces" (i.e. roughly the total size of the corpus) would be a nice statistic too, so that we could appreciate the compression factor of both methods. Anyway, as MrD pointed out, 30% is not that bad. I could even bend the statistics the other way around and say your original method increased the size by 46% ;) > [...] > I also noticed that the Huffman code reached deeper into the alphabet > to do its coding. The highest code is PZZZ leaving around 176,000 > codes available for expansion. In my hand-made code book the highest > code I used was GYVF, with about 335,000 available for expansion. Both > are perfectly adequate. I think that if you want to have expansion capability you shouldn't leave it to chance. You can add "expansion symbols" explicitly to the end of the table to allow for expansion, with some appropriate "expected frequency" for them. The number of symbols available for expansion is another thing: if don't expect any expansion, but want to "be prepared" for it, you might even just reserve a single word for expansion and then, when that word appeared, it meant the next N chars described the expansion. With N=4 you would already be able to get 26^4 = 456976 expansion words, wasting only a single symbol on the table. > I have a hunch I could factor out morphemes from the word list and cut > it in half or better. That way words like "unfortunately" would be > coded as separate morphemes: "un + fortunate + ly" Of course that > would mean a minimum of 4 letters for that word (assuming at least 2 > for "fortunate"), while the present Huffman code for "unfortunately" > is "AUN". So I'm not sure morpheme factoring would help compression, > although it would certainly help dictionary size. > > I do like the genuine Huffman tree, though. It's clean and efficient. > Thanks for your help and suggestions. You're welcome. > I think next I'll play with alternating consonants and vowels so every > code word is pronounceable. :) It will certainly be bigger, though ;) -- Paulo Marques - www.grupopie.com "...so she told me it was either her or the ham radio, over."
From: Greg Rose on 23 Jul 2010 12:07 In article <xs-dndaPSY2iF9TRnZ2dnUVZ8nudnZ2d(a)novis.pt>, Paulo Marques <pmarques(a)grupopie.com> wrote: >The total "original letters including spaces" (i.e. roughly the total >size of the corpus) would be a nice statistic too, so that we could >appreciate the compression factor of both methods. Slightly out of context, I know, but I think it is worth mentioning that with Huffman coding you can also leave out the spaces. As you traverse the tree, arriving at a leaf determines the end of the word. That would make pronounceability a bit tougher though. Greg. --
From: Fiziwig on 23 Jul 2010 12:19 On Jul 23, 12:50 am, MrD <mrdemean...(a)jackpot.invalid> wrote: > > Have you said how many *distinct* words there are in your corpus? Sorry > if you've already said, I couldn't find it. No, I haven't worked that out. It would be nice to compress out the trivial duplicates like plurals. What's the sense of having both "apple" and "apples" as separate entries? > I believe Dr. Seuss deliberately chose a lexicon of just 300 words; that > one can get by in English with about 1,000 words; most native English > speakers manage all their lives with a lexicon of about 3,000 words, and > that Winston Churchill commanded a lexicon of about 12,000 ("12,000 > strong", as adacrypt might say). Just this morning I found a nifty list of the 2,000 "most essential" English words, and it even includes frequency data! http://jbauman.com/gsl.html > How do you handle stemming? E.g. is each form[1] of a verb treated as > a distinct word? I can't see how else you'd do it (without some > complicated grammar rules). So without a stemmer, I suppose the standard > lexicon might grow to about 15,000 and the Churchill lexicon might grow > to about 60,000. The Seuss lexicon probably wouldn't grow that much, > because in addition to his miniature lexicon, Seuss also restricted > himself to a pretty rudimentary grammar (everything is present-tense > indicative, etc.) I haven't addressed that yet. First I need to enumerate all the various ways that one word can be derived from another. (quick, quickly, quicker, quickest, quickness, quickosity, quickability, ... :) > Actually I don't have the first clue how much a lexicon shrinks/grows by > as a result of stemming; those numbers are unadulterated guesswork. But > people who work on text-retrieval tools would know the facts. I don't know either. It would an interesting topic to explore. But then, now we're really far afield from cryptography. --gary
From: Fiziwig on 23 Jul 2010 13:23
On Jul 23, 5:25 am, Paulo Marques <pmarq...(a)grupopie.com> wrote: > > I'm afraid you still have a bug, though :( > > When building N-ary huffman trees you have to make sure the number of > elements fill up a N-branched tree _completely_. > > This means that for a 26-ary tree, the number of starting elements needs > to be "25 * N + 26" for some integer N. If they're not, you need to add > dummy zero frequency elements to the end of the tree before start > building it. I looked it up. The number of starting elements needs to be congruent to 1 mod n-1, so it has to be of the form 25X + 1. I had 9315 words, so I added 11 DUMMY words with 0 probability to make 9326. The results were strange. Total Words 1075617 Total Letters 1246124 Average letters per word 1.16 The program claims an average of 1.16 letters per word, but that seemed suspiciously low to me so I looked at the numbers. The top 14 words are 1-letter codes. Those top 14 words make up 26% of a typical English text. The next top 225 words are 2-letter codes, and the whole 239 top words make up 62% of a typical English text. So with 26% * 1 + 36% * 2 and the remaining 38% being mostly 3-letter codes that works out to 2.12 letters per word (approximately since I didn't account for those few 4-letter codes) so I suspect there's something wrong with the calculation done by the program. Although I don't see what could be wrong. It's simply the total number of code letters used divided by the total of all word frequencies, and my original corpus was slightly more than a million words, so that number looks right. I'll double check it. Even so, 2.12 is about 8% better than the previous 2.30. Thanks for spotting my oversight. -gary |