Prev: interaction of mode 'r+', file.write(), and file.tell(): a bug or undefined behavior?
Next: Why am I getting this Error message
From: Mensanator on 28 Jan 2010 13:38 On Jan 28, 12:28 pm, Steven Howe <howe.ste...(a)gmail.com> wrote: > On 01/28/2010 09:49 AM, Jean-Michel Pichavant wrote: > > > > > > > evilweasel wrote: > >> I will make my question a little more clearer. I have close to 60,000 > >> lines of the data similar to the one I posted. There are various > >> numbers next to the sequence (this is basically the number of times > >> the sequence has been found in a particular sample). So, I would need > >> to ignore the ones containing '0' and write all other sequences > >> (excluding the number, since it is trivial) in a new text file, in the > >> following format: > > >>> seq59902 > >> TTTTTTTATAAAATATATAGT > > >>> seq59903 > >> TTTTTTTATTTCTTGGCGTTGT > > >>> seq59904 > >> TTTTTTTGGTTGCCCTGCGTGG > > >>> seq59905 > >> TTTTTTTGTTTATTTTTGGG > > >> The number next to 'seq' is the line number of the sequence. When I > >> run the above program, what I expect is an output file that is similar > >> to the above output but with the ones containing '0' ignored. But, I > >> am getting all the sequences printed in the file. > > >> Kindly excuse the 'newbieness' of the program. :) I am hoping to > >> improve in the next few months. Thanks to all those who replied. I > >> really appreciate it. :) > > Using regexp may increase readability (if you are familiar with it). > > What about > > > import re > > > output = open("sequences1.txt", 'w') > > > for index, line in enumerate(open(sys.argv[1], 'r')): > > match = re.match('(?P<sequence>[GATC]+)\s+1') > > if match: > > output.write('seq%s\n%s\n' % (index, match.group('sequence'))) > > > Jean-Michel > > Finally! > > After ready 8 or 9 messages about find a line ending with '1', someone > suggests Regex. > It was my first thought. And as a first thought, it is, of course, wrong. You don't want lines ending in '1', you want ANY non-'0' amount. Likewise, you don't want to exclude lines ending in '0' because you'll end up excluding counts of 10, 20, 30, etc. You need a regex that extracts ALL the numeric characters at the end of the line and exclude those that evaluate to 0. > > Steven
From: MRAB on 28 Jan 2010 14:59 Steven Howe wrote: > On 01/28/2010 09:49 AM, Jean-Michel Pichavant wrote: >> evilweasel wrote: >>> I will make my question a little more clearer. I have close to 60,000 >>> lines of the data similar to the one I posted. There are various >>> numbers next to the sequence (this is basically the number of times >>> the sequence has been found in a particular sample). So, I would need >>> to ignore the ones containing '0' and write all other sequences >>> (excluding the number, since it is trivial) in a new text file, in the >>> following format: >>> >>>> seq59902 >>> TTTTTTTATAAAATATATAGT >>> >>>> seq59903 >>> TTTTTTTATTTCTTGGCGTTGT >>> >>>> seq59904 >>> TTTTTTTGGTTGCCCTGCGTGG >>> >>>> seq59905 >>> TTTTTTTGTTTATTTTTGGG >>> >>> The number next to 'seq' is the line number of the sequence. When I >>> run the above program, what I expect is an output file that is similar >>> to the above output but with the ones containing '0' ignored. But, I >>> am getting all the sequences printed in the file. >>> >>> Kindly excuse the 'newbieness' of the program. :) I am hoping to >>> improve in the next few months. Thanks to all those who replied. I >>> really appreciate it. :) >> Using regexp may increase readability (if you are familiar with it). >> What about >> >> import re >> >> output = open("sequences1.txt", 'w') >> >> for index, line in enumerate(open(sys.argv[1], 'r')): >> match = re.match('(?P<sequence>[GATC]+)\s+1') >> if match: >> output.write('seq%s\n%s\n' % (index, match.group('sequence'))) >> >> >> Jean-Michel > > Finally! > > After ready 8 or 9 messages about find a line ending with '1', someone > suggests Regex. > It was my first thought. > I'm a great fan of regexes, but I never though of using them for this because it doesn't look like a regex type of problem to me.
From: nn on 28 Jan 2010 16:22 Arnaud Delobelle wrote: > nn <pruebauno(a)latinmail.com> writes: > > > On Jan 28, 10:50 am, evilweasel <karthikramaswam...(a)gmail.com> wrote: > >> I will make my question a little more clearer. I have close to 60,000 > >> lines of the data similar to the one I posted. There are various > >> numbers next to the sequence (this is basically the number of times > >> the sequence has been found in a particular sample). So, I would need > >> to ignore the ones containing '0' and write all other sequences > >> (excluding the number, since it is trivial) in a new text file, in the > >> following format: > >> > >> >seq59902 > >> > >> TTTTTTTATAAAATATATAGT > >> > >> >seq59903 > >> > >> TTTTTTTATTTCTTGGCGTTGT > >> > >> >seq59904 > >> > >> TTTTTTTGGTTGCCCTGCGTGG > >> > >> >seq59905 > >> > >> TTTTTTTGTTTATTTTTGGG > >> > >> The number next to 'seq' is the line number of the sequence. When I > >> run the above program, what I expect is an output file that is similar > >> to the above output but with the ones containing '0' ignored. But, I > >> am getting all the sequences printed in the file. > >> > >> Kindly excuse the 'newbieness' of the program. :) I am hoping to > >> improve in the next few months. Thanks to all those who replied. I > >> really appreciate it. :) > > > > People have already given you some pointers to your problem. In the > > end you will have to "tweak the details" because only you have access > > to the data not us. > > > > Just as example here is another way to do what you are doing: > > > > with open('dnain.dat') as infile, open('dnaout.dat','w') as outfile: > > partgen=(line.split() for line in infile) > > dnagen=(str(i+1)+'\n'+part[0]+'\n' > > for i,part in enumerate(partgen) > > if len(part)>1 and part[1]!='0') > > outfile.writelines(dnagen) > > I think that generator expressions are overrated :) What's wrong with: > > with open('dnain.dat') as infile, open('dnaout.dat','w') as outfile: > for i, line in enumerate(infile): > parts = line.split() > if len(parts) > 1 and parts[1] != '0': > outfile.write(">seq%s\n%s\n" % (i+1, parts[0])) > > (untested) > > -- > Arnaud Nothing really, After posting I was thinking I should have posted a more straightforward version like the one you wrote. Now there is! It probably is more efficient too. I just have a tendency to think in terms of pipes: "pipe this junk in here, then in here, get output". Probably damage from too much Unix scripting.Since I can't resist the urge to post crazy code here goes the bonus round (don't do this at work): open('dnaout.dat','w').writelines( 'seq%s\n%s\n'%(i+1,part[0]) for i,part in enumerate(line.split() for line in open('dnain.dat')) if len(part)>1 and part[1]!='0')
From: Arnaud Delobelle on 28 Jan 2010 16:42 nn <pruebauno(a)latinmail.com> writes: > After posting I was thinking I should have posted a more > straightforward version like the one you wrote. Now there is! It > probably is more efficient too. I just have a tendency to think in > terms of pipes: "pipe this junk in here, then in here, get output". > Probably damage from too much Unix scripting. This is funny, I did think *exactly* this when I saw your code :) -- Arnaud
From: Johann Spies on 29 Jan 2010 04:23
On Thu, Jan 28, 2010 at 07:07:04AM -0800, evilweasel wrote: > Hi folks, > > I am a newbie to python, and I would be grateful if someone could > point out the mistake in my program. Basically, I have a huge text > file similar to the format below: > > AAAAAGACTCGAGTGCGCGGA 0 > AAAAAGATAAGCTAATTAAGCTACTGG 0 > AAAAAGATAAGCTAATTAAGCTACTGGGTT 1 > AAAAAGGGGGCTCACAGGGGAGGGGTAT 1 > AAAAAGGTCGCCTGACGGCTGC 0 I know this is a python list but if you really want to get the job done quickly this is one method without writing python code: $ cat /tmp/y AAAAAGACTCGAGTGCGCGGA 0 AAAAAGATAAGCTAATTAAGCTACTGG 0 AAAAAGATAAGCTAATTAAGCTACTGGGTT 1 AAAAAGGGGGCTCACAGGGGAGGGGTAT 1 AAAAAGGTCGCCTGACGGCTGC 0 $ grep -v 0 /tmp/y > tmp/z $ cat /tmp/z AAAAAGATAAGCTAATTAAGCTACTGGGTT 1 AAAAAGGGGGCTCACAGGGGAGGGGTAT 1 Regards Johann -- Johann Spies Telefoon: 021-808 4599 Informasietegnologie, Universiteit van Stellenbosch "My son, if sinners entice thee, consent thou not." Proverbs 1:10 |