From: Andrew Fabbro on 12 Jan 2010 00:34 I'm trying to devise a programmatic method to identify plaintext. One approach I'd like to try is to check candidate plaintext against tetragraphs that are extremely rare. For example, if XBMQ appears in the plaintext, then I will consider it non-English and move on to the next possible key. This method may not be perfect but I suspect it will work for my purposes. The question is...where/how to get such a list of rare tetragraphs? I have not been able to google anything. There are 456,976 possible tetragraphs. I built one from the Moby word lists, but it misses some important things...for example, the plaintext ATTACKATDAWN (I often don't know where the word boundaries are) contains "KATD", which does not appear in any of Moby's mwords or any dictionary word. Apparently, I'll need to process tetragraphs that cross word boundaries...I'm not sure if that invalidates the approach. Hmm. My next thought was to download a hundred plain text books from Project Gutenberg, string all the letters together, and process the resultant 4-character substrings...?
From: Maaartin on 12 Jan 2010 10:25 On Jan 12, 6:34 am, Andrew Fabbro <andrew.fab...(a)gmail.com> wrote: > I'm trying to devise a programmatic method to identify plaintext. One > approach I'd like to try is to check candidate plaintext against > tetragraphs that are extremely rare. For example, if XBMQ appears in > the plaintext, then I will consider it non-English and move on to the > next possible key. Sure, but IMHO you should try harder. The time spent should be of the plaintext recognition should be about of the same order of magnitude as the time spent on decryption. Without having tried it myself, I'd say you could look at all letters, di-, tri- and tetragrams in a decrypted piece of text in a shorter time than the decryption takes, thus minimizing the risk of failure. For example, the text "if XBMQ appears in the plaintext" is a valid plaintext, isn't it? Otherwise, you'll risk a false negative because of an acronym you don't know. Something like giving positive points for probable n-grams and negative for unprobable ones should work better. > This method may not be perfect but I suspect it will work for my > purposes. > > The question is...where/how to get such a list of rare tetragraphs? I > have not been able to google anything. There are 456,976 possible > tetragraphs. > > I built one from the Moby word lists, but it misses some important > things...for example, the plaintext ATTACKATDAWN (I often don't know > where the word boundaries are) contains "KATD", which does not appear > in any of Moby's mwords or any dictionary word. Apparently, I'll need > to process tetragraphs that cross word boundaries...I'm not sure if > that invalidates the approach. You need to get a table of all trigrams at the end of a work and combine it with all single letters at the beginning, etc. This is not perfect, as it ignores the frequency distribution of whole words, but IMHO it's good enough. > Hmm. My next thought was to download a hundred plain text books from > Project Gutenberg, string all the letters together, and process the > resultant 4-character substrings...? IMHO even in such a large text there're many possible tetragrams missing.
From: tms on 13 Jan 2010 17:18 On Jan 12, 12:34 am, Andrew Fabbro <andrew.fab...(a)gmail.com> wrote: > I'm trying to devise a programmatic method to identify plaintext. There is published work on this subject. For instance, Ravi Ganesan and Alan T. Sherman, "STATISTICAL TECHNIQUES FOR LANGUAGE RECOGNITION: AN INTRODUCTION AND GUIDE FOR CRYPTANALYSTS", Cryptologia 17: 4, 321 366. Try Google Scholar. > One > approach I'd like to try is to check candidate plaintext against > tetragraphs that are extremely rare. For example, if XBMQ appears in > the plaintext, then I will consider it non-English and move on to the > next possible key. Suppose XMBQ is an acronym, or a foreign word, or nulls added to confuse analysis?
From: David Eather on 13 Jan 2010 17:54 tms wrote: > On Jan 12, 12:34 am, Andrew Fabbro <andrew.fab...(a)gmail.com> wrote: >> I'm trying to devise a programmatic method to identify plaintext. > > There is published work on this subject. For instance, Ravi Ganesan > and Alan T. Sherman, "STATISTICAL TECHNIQUES FOR LANGUAGE RECOGNITION: > AN > INTRODUCTION AND GUIDE FOR CRYPTANALYSTS", Cryptologia 17: 4, 321 � > 366. Try Google Scholar. > >> One >> approach I'd like to try is to check candidate plaintext against >> tetragraphs that are extremely rare. For example, if XBMQ appears in >> the plaintext, then I will consider it non-English and move on to the >> next possible key. > > Suppose XMBQ is an acronym, or a foreign word, or nulls added to > confuse analysis? > Sinkov, "elementary cryptanalysis" is built entirely on the concept and application of statistical techniques for language recognition
|
Pages: 1 Prev: RC4 x 2 rounds Next: Is encrypting twice much more secure? |