Prev: interaction of mode 'r+', file.write(), and file.tell(): a bug or undefined behavior?
Next: Why am I getting this Error message
From: D'Arcy J.M. Cain on 28 Jan 2010 10:44 On Thu, 28 Jan 2010 07:07:04 -0800 (PST) evilweasel <karthikramaswamy88(a)gmail.com> wrote: > I am a newbie to python, and I would be grateful if someone could Welcome. > point out the mistake in my program. Basically, I have a huge text > file similar to the format below: You don't say how it isn't working. As a first step you should read http://catb.org/~esr/faqs/smart-questions.html. > The text is nothing but DNA sequences, and there is a number next to > it. What I will have to do is, ignore those lines that have 0 in it, Your code doesn't completely ignore them. See below. > and print all other lines (excluding the number) in a new text file > (in a particular format called as FASTA format). This is the program I > wrote for that: > > seq1 = [] > list1 = [] > lister = [] > listers = [] > listers1 = [] > a = [] > d = [] > i = 0 > j = 0 > num = 0 This seems like an awful lot of variables for such a simple task. > > file1 = open(sys.argv[1], 'r') > for line in file1: This is good. You aren't trying to load the whole file into memory at once. If the file is huge as you say then that would have been bad. I would have made one small optimization that saves one assignment and one extra variable. for line in open(sys.argv[1], 'r'): > if not line.startswith('\n'): > seq1 = line.split() > if len(seq1) == 0: > continue This is redundant and perhaps not even correct at the end of the file. It assumes that the last line ends with a newline. Look at what '\n'.split() gives you and see if you can't improve the above code. Another small optimization - "if seq1" is better than "if len(seq1)". > > a = seq1[0] > list1.append(a) Aha! I may have found your bug. Are you mixing tabs and spaces? Don't do that. Either always use spaces or always use tabs. My suggestion is to use spaces and choose a short indent such as three or even two but that's a religious issue. > > d = seq1[1] > lister.append(d) You can also do "a, d = seq1". Of course you must be sure that you have two fields. Perhaps that's guaranteed for your input but a quick sanity test wouldn't hurt here. However, I don't understand all of the above. It may also be a source of problems. You say the files are huge. Are you filling up memory here? You did the smart thing reading the file but you lose it here. In any case, see below. > b = len(lister) > for j in range(0, b): Go lookup zip() > if lister[j] == 0: I think that you will find that lister[j] is "0", not 0. > listers.append(j) > else: > listers1.append(j) Why are you collecting the input? Just toss the '0' ones and write the others lines directly to the output. Hope this helps with this script and in further understanding the power and simplicity of Python. Good luck. -- D'Arcy J.M. Cain <darcy(a)druid.net> | Democracy is three wolves http://www.druid.net/darcy/ | and a sheep voting on +1 416 425 1212 (DoD#0082) (eNTP) | what's for dinner.
From: Krister Svanlund on 28 Jan 2010 10:39 On Thu, Jan 28, 2010 at 4:31 PM, Krister Svanlund <krister.svanlund(a)gmail.com> wrote: > On Thu, Jan 28, 2010 at 4:28 PM, Krister Svanlund > <krister.svanlund(a)gmail.com> wrote: >> On Thu, Jan 28, 2010 at 4:07 PM, evilweasel >> <karthikramaswamy88(a)gmail.com> wrote: >>> Hi folks, >>> >>> I am a newbie to python, and I would be grateful if someone could >>> point out the mistake in my program. Basically, I have a huge text >>> file similar to the format below: >>> >>> AAAAAGACTCGAGTGCGCGGA 0 >>> AAAAAGATAAGCTAATTAAGCTACTGG 0 >>> AAAAAGATAAGCTAATTAAGCTACTGGGTT 1 >>> AAAAAGGGGGCTCACAGGGGAGGGGTAT 1 >>> AAAAAGGTCGCCTGACGGCTGC 0 >>> >>> The text is nothing but DNA sequences, and there is a number next to >>> it. What I will have to do is, ignore those lines that have 0 in it, >>> and print all other lines (excluding the number) in a new text file >>> (in a particular format called as FASTA format). This is the program I >>> wrote for that: >>> >>> seq1 = [] >>> list1 = [] >>> lister = [] >>> listers = [] >>> listers1 = [] >>> a = [] >>> d = [] >>> i = 0 >>> j = 0 >>> num = 0 >>> >>> file1 = open(sys.argv[1], 'r') >>> for line in file1: >>> if not line.startswith('\n'): >>> seq1 = line.split() >>> if len(seq1) == 0: >>> continue >>> >>> a = seq1[0] >>> list1.append(a) >>> >>> d = seq1[1] >>> lister.append(d) >>> >>> >>> b = len(lister) >>> for j in range(0, b): >>> if lister[j] == 0: >>> listers.append(j) >>> else: >>> listers1.append(j) >>> >>> >>> print listers1 >>> resultsfile = open("sequences1.txt", 'w') >>> for i in listers1: >>> resultsfile.write('\n>seq' + str(i) + '\n' + list1[i] + '\n') >>> >>> But this isn't working. I am not able to find the bug in this. I would >>> be thankful if someone could point it out. Thanks in advance! >>> >>> Cheers! I'm trying this again: newlines = [] with open(sys.argv[1], 'r') as f: text = f.read(); for line in (l.strip() for l in text.splitlines()): if line: line_elem = line.split() if len(line_elem) == 2 and line_elem[1] == '1': newlines.append('seq'+line_elem[0]) with open(sys.argv[2], 'w') as f: f.write('\n'.join(newlines))
From: evilweasel on 28 Jan 2010 10:50 I will make my question a little more clearer. I have close to 60,000 lines of the data similar to the one I posted. There are various numbers next to the sequence (this is basically the number of times the sequence has been found in a particular sample). So, I would need to ignore the ones containing '0' and write all other sequences (excluding the number, since it is trivial) in a new text file, in the following format: >seq59902 TTTTTTTATAAAATATATAGT >seq59903 TTTTTTTATTTCTTGGCGTTGT >seq59904 TTTTTTTGGTTGCCCTGCGTGG >seq59905 TTTTTTTGTTTATTTTTGGG The number next to 'seq' is the line number of the sequence. When I run the above program, what I expect is an output file that is similar to the above output but with the ones containing '0' ignored. But, I am getting all the sequences printed in the file. Kindly excuse the 'newbieness' of the program. :) I am hoping to improve in the next few months. Thanks to all those who replied. I really appreciate it. :)
From: nn on 28 Jan 2010 11:13 On Jan 28, 10:50 am, evilweasel <karthikramaswam...(a)gmail.com> wrote: > I will make my question a little more clearer. I have close to 60,000 > lines of the data similar to the one I posted. There are various > numbers next to the sequence (this is basically the number of times > the sequence has been found in a particular sample). So, I would need > to ignore the ones containing '0' and write all other sequences > (excluding the number, since it is trivial) in a new text file, in the > following format: > > >seq59902 > > TTTTTTTATAAAATATATAGT > > >seq59903 > > TTTTTTTATTTCTTGGCGTTGT > > >seq59904 > > TTTTTTTGGTTGCCCTGCGTGG > > >seq59905 > > TTTTTTTGTTTATTTTTGGG > > The number next to 'seq' is the line number of the sequence. When I > run the above program, what I expect is an output file that is similar > to the above output but with the ones containing '0' ignored. But, I > am getting all the sequences printed in the file. > > Kindly excuse the 'newbieness' of the program. :) I am hoping to > improve in the next few months. Thanks to all those who replied. I > really appreciate it. :) People have already given you some pointers to your problem. In the end you will have to "tweak the details" because only you have access to the data not us. Just as example here is another way to do what you are doing: with open('dnain.dat') as infile, open('dnaout.dat','w') as outfile: partgen=(line.split() for line in infile) dnagen=(str(i+1)+'\n'+part[0]+'\n' for i,part in enumerate(partgen) if len(part)>1 and part[1]!='0') outfile.writelines(dnagen)
From: Arnaud Delobelle on 28 Jan 2010 12:00
nn <pruebauno(a)latinmail.com> writes: > On Jan 28, 10:50 am, evilweasel <karthikramaswam...(a)gmail.com> wrote: >> I will make my question a little more clearer. I have close to 60,000 >> lines of the data similar to the one I posted. There are various >> numbers next to the sequence (this is basically the number of times >> the sequence has been found in a particular sample). So, I would need >> to ignore the ones containing '0' and write all other sequences >> (excluding the number, since it is trivial) in a new text file, in the >> following format: >> >> >seq59902 >> >> TTTTTTTATAAAATATATAGT >> >> >seq59903 >> >> TTTTTTTATTTCTTGGCGTTGT >> >> >seq59904 >> >> TTTTTTTGGTTGCCCTGCGTGG >> >> >seq59905 >> >> TTTTTTTGTTTATTTTTGGG >> >> The number next to 'seq' is the line number of the sequence. When I >> run the above program, what I expect is an output file that is similar >> to the above output but with the ones containing '0' ignored. But, I >> am getting all the sequences printed in the file. >> >> Kindly excuse the 'newbieness' of the program. :) I am hoping to >> improve in the next few months. Thanks to all those who replied. I >> really appreciate it. :) > > People have already given you some pointers to your problem. In the > end you will have to "tweak the details" because only you have access > to the data not us. > > Just as example here is another way to do what you are doing: > > with open('dnain.dat') as infile, open('dnaout.dat','w') as outfile: > partgen=(line.split() for line in infile) > dnagen=(str(i+1)+'\n'+part[0]+'\n' > for i,part in enumerate(partgen) > if len(part)>1 and part[1]!='0') > outfile.writelines(dnagen) I think that generator expressions are overrated :) What's wrong with: with open('dnain.dat') as infile, open('dnaout.dat','w') as outfile: for i, line in enumerate(infile): parts = line.split() if len(parts) > 1 and parts[1] != '0': outfile.write(">seq%s\n%s\n" % (i+1, parts[0])) (untested) -- Arnaud |