Prev: interaction of mode 'r+', file.write(), and file.tell(): a bug or undefined behavior?
Next: Why am I getting this Error message
From: John Posner on 28 Jan 2010 12:23 On 1/28/2010 10:50 AM, evilweasel wrote: > I will make my question a little more clearer. I have close to 60,000 > lines of the data similar to the one I posted. There are various > numbers next to the sequence (this is basically the number of times > the sequence has been found in a particular sample). So, I would need > to ignore the ones containing '0' and write all other sequences > (excluding the number, since it is trivial) in a new text file, in the > following format: > >> seq59902 > TTTTTTTATAAAATATATAGT > >> seq59903 > TTTTTTTATTTCTTGGCGTTGT > >> seq59904 > TTTTTTTGGTTGCCCTGCGTGG > >> seq59905 > TTTTTTTGTTTATTTTTGGG > > The number next to 'seq' is the line number of the sequence. When I > run the above program, what I expect is an output file that is similar > to the above output but with the ones containing '0' ignored. But, I > am getting all the sequences printed in the file. > > Kindly excuse the 'newbieness' of the program. :) I am hoping to > improve in the next few months. Thanks to all those who replied. I > really appreciate it. :) Your program is a good first try. It contains a newbie error (looking for the number 0 instead of the string "0"). But more importantly, you're doing too much work yourself, rather than letting Python do the heavy lifting for you. These practices and tools make life a lot easier: * As others have noted, don't accumulate output in a list. Just write data to the output file line-by-line. * You don't need to initialize every variable at the beginning of the program. But there's no harm in it. * Use the enumerate() function to provide a line counter: for counter, line in enumerate(file1): This eliminates the need to accumulate output data in a list, then use the index variable "j" as the line counter. * Use string formatting. Each chunk of output is a two-line string, with the line-counter and the DNA sequence as variables: outformat = """seq%05d %s """ ... later, inside your loop ... resultsfile.write(outformat % (counter, sequence)) HTH, John
From: Jean-Michel Pichavant on 28 Jan 2010 12:49 evilweasel wrote: > I will make my question a little more clearer. I have close to 60,000 > lines of the data similar to the one I posted. There are various > numbers next to the sequence (this is basically the number of times > the sequence has been found in a particular sample). So, I would need > to ignore the ones containing '0' and write all other sequences > (excluding the number, since it is trivial) in a new text file, in the > following format: > > >> seq59902 >> > TTTTTTTATAAAATATATAGT > > >> seq59903 >> > TTTTTTTATTTCTTGGCGTTGT > > >> seq59904 >> > TTTTTTTGGTTGCCCTGCGTGG > > >> seq59905 >> > TTTTTTTGTTTATTTTTGGG > > The number next to 'seq' is the line number of the sequence. When I > run the above program, what I expect is an output file that is similar > to the above output but with the ones containing '0' ignored. But, I > am getting all the sequences printed in the file. > > Kindly excuse the 'newbieness' of the program. :) I am hoping to > improve in the next few months. Thanks to all those who replied. I > really appreciate it. :) > Using regexp may increase readability (if you are familiar with it). What about import re output = open("sequences1.txt", 'w') for index, line in enumerate(open(sys.argv[1], 'r')): match = re.match('(?P<sequence>[GATC]+)\s+1') if match: output.write('seq%s\n%s\n' % (index, match.group('sequence'))) Jean-Michel
From: D'Arcy J.M. Cain on 28 Jan 2010 13:03 On Thu, 28 Jan 2010 18:49:02 +0100 Jean-Michel Pichavant <jeanmichel(a)sequans.com> wrote: > Using regexp may increase readability (if you are familiar with it). If you have a problem and you think that regular expressions are the solution then now you have two problems. Regex is really overkill for the OP's problem and it certainly doesn't improve readability. -- D'Arcy J.M. Cain <darcy(a)druid.net> | Democracy is three wolves http://www.druid.net/darcy/ | and a sheep voting on +1 416 425 1212 (DoD#0082) (eNTP) | what's for dinner.
From: Jean-Michel Pichavant on 28 Jan 2010 13:21 D'Arcy J.M. Cain wrote: > On Thu, 28 Jan 2010 18:49:02 +0100 > Jean-Michel Pichavant <jeanmichel(a)sequans.com> wrote: > >> Using regexp may increase readability (if you are familiar with it). >> > > If you have a problem and you think that regular expressions are the > solution then now you have two problems. Regex is really overkill for > the OP's problem and it certainly doesn't improve readability. > > It depends on the reader ability to understand a *simple* regexp. It is also strange to get such answer after taking so much precautions, so let me quote myself: "Using regexp *may* increase readability (*if* you are *familiar* with it)." I honestly find it quite readable in the sample code I provided and spares all the if-len-startwith-strip logic, but If the OP does not agree, fine with me. But there's no need to get certain that I'm completly wrong. JM
From: Steven Howe on 28 Jan 2010 13:28
On 01/28/2010 09:49 AM, Jean-Michel Pichavant wrote: > evilweasel wrote: >> I will make my question a little more clearer. I have close to 60,000 >> lines of the data similar to the one I posted. There are various >> numbers next to the sequence (this is basically the number of times >> the sequence has been found in a particular sample). So, I would need >> to ignore the ones containing '0' and write all other sequences >> (excluding the number, since it is trivial) in a new text file, in the >> following format: >> >>> seq59902 >> TTTTTTTATAAAATATATAGT >> >>> seq59903 >> TTTTTTTATTTCTTGGCGTTGT >> >>> seq59904 >> TTTTTTTGGTTGCCCTGCGTGG >> >>> seq59905 >> TTTTTTTGTTTATTTTTGGG >> >> The number next to 'seq' is the line number of the sequence. When I >> run the above program, what I expect is an output file that is similar >> to the above output but with the ones containing '0' ignored. But, I >> am getting all the sequences printed in the file. >> >> Kindly excuse the 'newbieness' of the program. :) I am hoping to >> improve in the next few months. Thanks to all those who replied. I >> really appreciate it. :) > Using regexp may increase readability (if you are familiar with it). > What about > > import re > > output = open("sequences1.txt", 'w') > > for index, line in enumerate(open(sys.argv[1], 'r')): > match = re.match('(?P<sequence>[GATC]+)\s+1') > if match: > output.write('seq%s\n%s\n' % (index, match.group('sequence'))) > > > Jean-Michel Finally! After ready 8 or 9 messages about find a line ending with '1', someone suggests Regex. It was my first thought. Steven |