Prev: Using an Asymmetric cipher the opposite way round.
Next: Note on general application of Feistel's technique
From: Paulo Marques on 22 Jul 2010 08:09 Fiziwig wrote: > [...] > I shall have to rebuild the dictionary using the Huffman algorithm and > see what it looks like. After tens of posts in sci.crypt trying to help people that didn't want any help and took every constructive criticism as insults, this was a real breath of fresh air. Thank you, Gary! If you need help building the huffman-optimized codebook, please send me the frequency table you're using and I'll try to help you out. I'm curious about what kind of improvement on the average letters per word value will the pure huffman give us. Thanks again, -- Paulo Marques - www.grupopie.com "All generalizations are false."
From: Fiziwig on 22 Jul 2010 09:17 On Jul 22, 5:09 am, Paulo Marques <pmarq...(a)grupopie.com> wrote: > Fiziwig wrote: > > [...] > > I shall have to rebuild the dictionary using the Huffman algorithm and > > see what it looks like. > > After tens of posts in sci.crypt trying to help people that didn't want > any help and took every constructive criticism as insults, this was a > real breath of fresh air. Thank you, Gary! > > If you need help building the huffman-optimized codebook, please send me > the frequency table you're using and I'll try to help you out. I'm > curious about what kind of improvement on the average letters per word > value will the pure huffman give us. > > Thanks again, You're welcome. My frequency data comes from the count of occurrences of each word in the 1 million word Brown Corpus. It contains about 14,000 words, after removing bogus words caused by typos in the original source material. I'm curious myself to see the results. I'm coding up a C++ program to accomplish the task. I started to do it by hand, but that turned out to be too big a job with 14,000 words! As an aside, it might be fun to let each ply of the tree alternate between consonant-headed subtrees and vowel-headed subtrees so that every code word would alternate a vowel with a consonant and be pronounceable. (UGABUGA) Then it could be a spoken code as well. Or maybe that's more properly called a language, in which we've left cryptography and entered linguistics. ;-) --gary
From: rossum on 22 Jul 2010 11:36 On Wed, 21 Jul 2010 16:25:38 -0700 (PDT), WTShaw <lurens1(a)gmail.com> wrote: >On Jul 21, 3:59 pm, rossum <rossu...(a)coldmail.com> wrote: >> On Tue, 20 Jul 2010 13:52:30 -0700 (PDT), Fiziwig <fizi...(a)gmail.com> >> wrote: >> >> >I wanted to keep it as simple as possible for a human being using a >> >hard copy code book and pencil and paper. >> >> Playfair. >> >> rossum > >Playfair, not so great. Yes, hard by hand, but vulnerable by machine. Yes it is breakable by machine but it is easy to implement with just pencil and paper. If I wanted an easy to implement computer cypher I would pick RC4 which is simple enough to be coded from memory. rossum
From: John Nagle on 22 Jul 2010 14:40 On 7/20/2010 10:22 AM, Fiziwig wrote: > Just for the fun of it (I'm retired with nothing better to do with my > time :) I've created a new code (not cipher) based on a variation of > Huffman coding but using the Roman alphabet instead of binary bits to > encode English words instead of bytes of data. The code words are > variable length, but self-segregating so decoding is unambiguous in > all cases. Some special-purpose handheld devices used to compress text with a codebook. This allowed low end devices to store an entire Bible or encyclopedia. I once encountered a handheld "Bible" device with an off-by-one error in the codebook. Long words were being replaced by different words which were alphabetically nearby the correct word. Very strange. John Nagle
From: Fiziwig on 22 Jul 2010 18:38
On Jul 22, 5:09 am, Paulo Marques <pmarq...(a)grupopie.com> wrote: > If you need help building the huffman-optimized codebook, please send me > the frequency table you're using and I'll try to help you out. I'm > curious about what kind of improvement on the average letters per word > value will the pure huffman give us. Having never programmed the Huffman Algorithm, I approached the project with some trepidation. As it turned out, it was a piece of cake. I used C++. My ONLY bugs were a couple of simple typos and one uninitialized node pointer in the tree. Altogether it took me about an hour to have it running, and another hour of fiddling with the format of the output till I was happy with it. The only adaptation I had to make to the basic Huffman algorithm was use an alphabet of 26 letters instead of an alphabet of 2 letters. So instead of having two child nodes, each parent node had a single list under it with 26 linked entries. If you are curious, the whole completed tree is available at http://fiziwig.com/temp_huff.txt Here's the sentence I used as a sample on my web page: Here is a sample of the code used to encrypt this very sentence. IH CL CG OYM CD CC WUG LA CF GXZU DE HO UXS <-- Original hand-made code = 31 letters AR PO E HXD I M CSF PXE F CSF ON CP DDI <-- Genuine Huffman algorithm = 27 letters = 13% less space (Note: My word list did not originally include "encrypt" so I cheated and used "code" again. I had added it by hand into the first code book at the end of the already used codes. That's not so easy to do with a Huffman-coded book.) Note that with the genuine Huffman algorithm the sentence had 4 words that used one-letter codes. My hand made codebook had no single-letter codes. If I encoded the entire 1 million+ word Brown Corpus my results would be Total Words 1,075,609 Total Letters 2,476,824 Average letters per word 2.30 This is not that much better than my hand made code that worked out at 3.35 letters per word. As a code for a living language there also needs to be room for expansion within the coding scheme, without breaking the already coded words, which would mean salting the original word file with a lot of nulls that could later be replaced by real words as that became necessary. I'm not sure how that would affect the efficiency of the coding. My hand-made system left expansion gaps (for example, all codes ending in "T", regardless of length, were unassigned and set aside for future expansion) I also noticed that the Huffman code reached deeper into the alphabet to do its coding. The highest code is PZZZ leaving around 176,000 codes available for expansion. In my hand-made code book the highest code I used was GYVF, with about 335,000 available for expansion. Both are perfectly adequate. I have a hunch I could factor out morphemes from the word list and cut it in half or better. That way words like "unfortunately" would be coded as separate morphemes: "un + fortunate + ly" Of course that would mean a minimum of 4 letters for that word (assuming at least 2 for "fortunate"), while the present Huffman code for "unfortunately" is "AUN". So I'm not sure morpheme factoring would help compression, although it would certainly help dictionary size. I do like the genuine Huffman tree, though. It's clean and efficient. Thanks for your help and suggestions. I think next I'll play with alternating consonants and vowels so every code word is pronounceable. :) --gary |