From: Mac McTeague on 16 May 2010 13:36 First let me say that I am not a computer science student. So I am not trying to get help with my homework. I am an IT professional who supports unix, linux and some windows. Mostly on the enterprise server level. While I am between jobs, I decided to come up with some projects to help improve some of my scripting skills. While working on the one I will describe momentarily I come to consider that my scripting approach may be inadequate for the task. I am just curious about how people with more scripting and programming knowledge would approach the problem. What tools would they use? How would they make it more efficient and work correctly. The idea was to create a script that would automatically extract from text terms that were likely to be unknown to the average (college educated) reader. That is terms that might need to be added to a glossary or vocabulary list. So I got a list of the 10,000 most common words in English. Then I got a list of the 20,000 most common words appearing in all Guttenberg texts. I got rid of duplications and came up with a list of about 19,000 words. Since that is a fairly large list, I am going to assume that if a word appears in a text file that is not on it that many people may not know it or it might be technically specific. Therefore it might need to be put into a glossary for explanation Now I can certainly write a basic loop that looks to see if a word in a text file is also in the big word list. And if not to print it to a separate file. But here is the problem. A typical book might have 50,000 or more words. The list I have is almost 20,000 words. So that would mean a billion string comparisons in order to extract a list of potential words to be added to the glossary. That seems rather a bit much for grep to handle. Which was my original intention So I am just curious what approach actual programmers, which I am not, might take in addressing a problem of this type. Perhaps I should turn the big word list in to a database. Or perhaps some sort of hash comparison using perl would be good. I just don't think that my original idea of doing it as a shell script is sufficient considering the number or comparisons that need to be made. I might take all day to run the script. So what would be a more effective approach. Thanks for any opinions and advice
From: Moi on 16 May 2010 13:54 On Sun, 16 May 2010 10:36:29 -0700, Mac McTeague wrote: [snip] > > Now I can certainly write a basic loop that looks to see if a word in > a text file is also in the big word list. And if not to print it to a > separate file. > But here is the problem. A typical book might have 50,000 or more > words. The list I have is almost 20,000 words. So that would mean a > billion string comparisons in order to extract a list of potential words > to be added to the glossary. That seems rather a bit much for grep to > handle. Which was my original intention > > So I am just curious what approach actual programmers, which I am not, > might take in addressing a problem of this type. Perhaps I should turn > the big word list in to a database. Or perhaps some sort of hash > comparison using perl would be good. I just don't think that my > original idea of doing it as a shell script is sufficient considering > the number or comparisons that need to be made. I might take all day to > run the script. So what would be a more effective approach. Thanks for > any opinions and advice Hashing is the way to go. a patricia tree or trie could be another choice. Or skiplists. Hashing N items in N slots (using chaining) you'll need about 1.5 comparisons per item. A well tuned algorithm will effectively run at the speed of the disk I/O. I don't know what perl does. It might include some overhead when it resizes it's hash table or invokes GC etc. HTH, AvK
From: Ben Bacarisse on 16 May 2010 23:06 Mac McTeague <mcteague409(a)gmail.com> writes: <snip> > The idea was to create a script that would automatically extract from > text terms that were likely to be unknown to the average (college > educated) reader. That is terms that might need to be added to a > glossary or vocabulary list. So I got a list of the 10,000 most > common words in English. Then I got a list of the 20,000 most common > words appearing in all Guttenberg texts. I got rid of duplications > and came up with a list of about 19,000 words. Since that is a fairly > large list, I am going to assume that if a word appears in a text > file that is not on it that many people may not know it or it might be > technically specific. Therefore it might need to be put into a > glossary for explanation It can certainly be done with a script. In fact it is a one-liner. It is fun to work these things out so I won't show you my solution unless you ask. Use man to read up on tr, sort, uniq and comm. -- Ben.
From: Keith Thompson on 18 May 2010 21:29 Mac McTeague <mcteague409(a)gmail.com> writes: [...] > The idea was to create a script that would automatically extract from > text terms that were likely to be unknown to the average (college > educated) reader. That is terms that might need to be added to a > glossary or vocabulary list. So I got a list of the 10,000 most > common words in English. Then I got a list of the 20,000 most common > words appearing in all Guttenberg texts. I got rid of duplications > and came up with a list of about 19,000 words. Since that is a fairly > large list, I am going to assume that if a word appears in a text > file that is not on it that many people may not know it or it might be > technically specific. Therefore it might need to be put into a > glossary for explanation [...] What you're doing is almost identical to the Unix "spell" command. I've read an old paper describing its extremely clever design, but a couple of minutes of Googling haven't turned it up. -- Keith Thompson (The_Other_Keith) kst-u(a)mib.org <http://www.ghoti.net/~kst> Nokia "We must do something. This is something. Therefore, we must do this." -- Antony Jay and Jonathan Lynn, "Yes Minister"
|
Pages: 1 Prev: diffent results of make implicit rules Next: ANN: Seed7 Release 2010-05-16 |