From: Mensanator on
On Jan 28, 12:28 pm, Steven Howe <howe.ste...(a)gmail.com> wrote:
> On 01/28/2010 09:49 AM, Jean-Michel Pichavant wrote:
>
>
>
>
>
> > evilweasel wrote:
> >> I will make my question a little more clearer. I have close to 60,000
> >> lines of the data similar to the one I posted. There are various
> >> numbers next to the sequence (this is basically the number of times
> >> the sequence has been found in a particular sample). So, I would need
> >> to ignore the ones containing '0' and write all other sequences
> >> (excluding the number, since it is trivial) in a new text file, in the
> >> following format:
>
> >>> seq59902
> >> TTTTTTTATAAAATATATAGT
>
> >>> seq59903
> >> TTTTTTTATTTCTTGGCGTTGT
>
> >>> seq59904
> >> TTTTTTTGGTTGCCCTGCGTGG
>
> >>> seq59905
> >> TTTTTTTGTTTATTTTTGGG
>
> >> The number next to 'seq' is the line number of the sequence. When I
> >> run the above program, what I expect is an output file that is similar
> >> to the above output but with the ones containing '0' ignored. But, I
> >> am getting all the sequences printed in the file.
>
> >> Kindly excuse the 'newbieness' of the program. :) I am hoping to
> >> improve in the next few months. Thanks to all those who replied. I
> >> really appreciate it. :)
> > Using regexp may increase readability (if you are familiar with it).
> > What about
>
> > import re
>
> > output = open("sequences1.txt", 'w')
>
> > for index, line in enumerate(open(sys.argv[1], 'r')):
> >    match = re.match('(?P<sequence>[GATC]+)\s+1')
> >    if match:
> >        output.write('seq%s\n%s\n' % (index, match.group('sequence')))
>
> > Jean-Michel
>
> Finally!
>
> After ready 8 or 9 messages about find a line ending with '1', someone
> suggests Regex.
> It was my first thought.

And as a first thought, it is, of course, wrong.

You don't want lines ending in '1', you want ANY non-'0' amount.

Likewise, you don't want to exclude lines ending in '0' because
you'll end up excluding counts of 10, 20, 30, etc.

You need a regex that extracts ALL the numeric characters at the end
of the
line and exclude those that evaluate to 0.

>
> Steven

From: MRAB on
Steven Howe wrote:
> On 01/28/2010 09:49 AM, Jean-Michel Pichavant wrote:
>> evilweasel wrote:
>>> I will make my question a little more clearer. I have close to 60,000
>>> lines of the data similar to the one I posted. There are various
>>> numbers next to the sequence (this is basically the number of times
>>> the sequence has been found in a particular sample). So, I would need
>>> to ignore the ones containing '0' and write all other sequences
>>> (excluding the number, since it is trivial) in a new text file, in the
>>> following format:
>>>
>>>> seq59902
>>> TTTTTTTATAAAATATATAGT
>>>
>>>> seq59903
>>> TTTTTTTATTTCTTGGCGTTGT
>>>
>>>> seq59904
>>> TTTTTTTGGTTGCCCTGCGTGG
>>>
>>>> seq59905
>>> TTTTTTTGTTTATTTTTGGG
>>>
>>> The number next to 'seq' is the line number of the sequence. When I
>>> run the above program, what I expect is an output file that is similar
>>> to the above output but with the ones containing '0' ignored. But, I
>>> am getting all the sequences printed in the file.
>>>
>>> Kindly excuse the 'newbieness' of the program. :) I am hoping to
>>> improve in the next few months. Thanks to all those who replied. I
>>> really appreciate it. :)
>> Using regexp may increase readability (if you are familiar with it).
>> What about
>>
>> import re
>>
>> output = open("sequences1.txt", 'w')
>>
>> for index, line in enumerate(open(sys.argv[1], 'r')):
>> match = re.match('(?P<sequence>[GATC]+)\s+1')
>> if match:
>> output.write('seq%s\n%s\n' % (index, match.group('sequence')))
>>
>>
>> Jean-Michel
>
> Finally!
>
> After ready 8 or 9 messages about find a line ending with '1', someone
> suggests Regex.
> It was my first thought.
>
I'm a great fan of regexes, but I never though of using them for this
because it doesn't look like a regex type of problem to me.
From: nn on


Arnaud Delobelle wrote:
> nn <pruebauno(a)latinmail.com> writes:
>
> > On Jan 28, 10:50 am, evilweasel <karthikramaswam...(a)gmail.com> wrote:
> >> I will make my question a little more clearer. I have close to 60,000
> >> lines of the data similar to the one I posted. There are various
> >> numbers next to the sequence (this is basically the number of times
> >> the sequence has been found in a particular sample). So, I would need
> >> to ignore the ones containing '0' and write all other sequences
> >> (excluding the number, since it is trivial) in a new text file, in the
> >> following format:
> >>
> >> >seq59902
> >>
> >> TTTTTTTATAAAATATATAGT
> >>
> >> >seq59903
> >>
> >> TTTTTTTATTTCTTGGCGTTGT
> >>
> >> >seq59904
> >>
> >> TTTTTTTGGTTGCCCTGCGTGG
> >>
> >> >seq59905
> >>
> >> TTTTTTTGTTTATTTTTGGG
> >>
> >> The number next to 'seq' is the line number of the sequence. When I
> >> run the above program, what I expect is an output file that is similar
> >> to the above output but with the ones containing '0' ignored. But, I
> >> am getting all the sequences printed in the file.
> >>
> >> Kindly excuse the 'newbieness' of the program. :) I am hoping to
> >> improve in the next few months. Thanks to all those who replied. I
> >> really appreciate it. :)
> >
> > People have already given you some pointers to your problem. In the
> > end you will have to "tweak the details" because only you have access
> > to the data not us.
> >
> > Just as example here is another way to do what you are doing:
> >
> > with open('dnain.dat') as infile, open('dnaout.dat','w') as outfile:
> > partgen=(line.split() for line in infile)
> > dnagen=(str(i+1)+'\n'+part[0]+'\n'
> > for i,part in enumerate(partgen)
> > if len(part)>1 and part[1]!='0')
> > outfile.writelines(dnagen)
>
> I think that generator expressions are overrated :) What's wrong with:
>
> with open('dnain.dat') as infile, open('dnaout.dat','w') as outfile:
> for i, line in enumerate(infile):
> parts = line.split()
> if len(parts) > 1 and parts[1] != '0':
> outfile.write(">seq%s\n%s\n" % (i+1, parts[0]))
>
> (untested)
>
> --
> Arnaud

Nothing really,
After posting I was thinking I should have posted a more
straightforward version like the one you wrote. Now there is! It
probably is more efficient too. I just have a tendency to think in
terms of pipes: "pipe this junk in here, then in here, get output".
Probably damage from too much Unix scripting.Since I can't resist the
urge to post crazy code here goes the bonus round (don't do this at
work):

open('dnaout.dat','w').writelines(
'seq%s\n%s\n'%(i+1,part[0])
for i,part in enumerate(line.split() for line in open('dnain.dat'))
if len(part)>1 and part[1]!='0')
From: Arnaud Delobelle on
nn <pruebauno(a)latinmail.com> writes:
> After posting I was thinking I should have posted a more
> straightforward version like the one you wrote. Now there is! It
> probably is more efficient too. I just have a tendency to think in
> terms of pipes: "pipe this junk in here, then in here, get output".
> Probably damage from too much Unix scripting.

This is funny, I did think *exactly* this when I saw your code :)

--
Arnaud

From: Johann Spies on
On Thu, Jan 28, 2010 at 07:07:04AM -0800, evilweasel wrote:
> Hi folks,
>
> I am a newbie to python, and I would be grateful if someone could
> point out the mistake in my program. Basically, I have a huge text
> file similar to the format below:
>
> AAAAAGACTCGAGTGCGCGGA 0
> AAAAAGATAAGCTAATTAAGCTACTGG 0
> AAAAAGATAAGCTAATTAAGCTACTGGGTT 1
> AAAAAGGGGGCTCACAGGGGAGGGGTAT 1
> AAAAAGGTCGCCTGACGGCTGC 0

I know this is a python list but if you really want to get the job
done quickly this is one method without writing python code:

$ cat /tmp/y
AAAAAGACTCGAGTGCGCGGA 0
AAAAAGATAAGCTAATTAAGCTACTGG 0
AAAAAGATAAGCTAATTAAGCTACTGGGTT 1
AAAAAGGGGGCTCACAGGGGAGGGGTAT 1
AAAAAGGTCGCCTGACGGCTGC 0
$ grep -v 0 /tmp/y > tmp/z
$ cat /tmp/z
AAAAAGATAAGCTAATTAAGCTACTGGGTT 1
AAAAAGGGGGCTCACAGGGGAGGGGTAT 1

Regards
Johann
--
Johann Spies Telefoon: 021-808 4599
Informasietegnologie, Universiteit van Stellenbosch

"My son, if sinners entice thee, consent thou not."
Proverbs 1:10