Prev: Cryptanalyis by Cloning Data from Regular Data.
Next: Nanoscale Random Number Circuit to Secure Future Chips
From: Mok-Kong Shen on 29 Jun 2010 09:39 Let's assume a 6-bit printable coding alphabet Q, e.g. { a-z, A-Z, 0-9, +, - }, and adopt the following convention for grouping of codewords: Group 1: two symbols, 1st symbol in Q\{0-9}, 2nd in Q. Group 2: three symbols, first symbol in {1-9}, 2nd and 3rd in Q. Group 3: four symbols, first symbol 0, 2nd symbol in Q\0, 3rd-4th in Q. The cardinality of the these three sets are 3456, 36864 and 258048 respectively, totalling 298368. Considering that Basic English has 850 core words and that a frequency count of books of Project Gutenberg involves some 40000 words (see http://en.wiktionary.org/wiki/Wiktionary:Frequency_lists), the above coding scheme should be fairly sufficient for a dictionary coding of English words in most practical common applications. For efficiency, the most frequently used words are certainly to be assigned to group 1 and the comparatively less frequent ones to group 2, with group 3 containing the very seldomly used words. Note that we have reserved the initial "00" for providing an adequate escape mechanism to do verbatim coding of all exceptional words that may be required. Very roughly I estimate that one could this way code with an average of about 14 bits per word. How would this compare with ASCII coding followed by a compression? Thanks. M. K. Shen |