finding homopolymers in both directions [Python]

Prev: [ANN] pyxser-1.4.6r --- Python Object to XML serializer/deserializer
Next: regular expressions and the LOCALE flag

From: Lee Sander on 3 Aug 2010 13:34

Hi,
Suppose I have a string such as this
'aabccccccefggggghiiijkr'

I would like to print out all the positions that are flanked by a run
of symbols.
So for example, I would like to the output for the above input as
follows:

2 b 1 aa
2 b -1 cccccc
10 e -1 cccccc
11 f 1 ggggg
17 h 1 iii
17 h -1 ggggg

where the first column is the position of interest, the next column is
the entry at that position,
1 if the following column refers to a runs that come after and -1 if
the runs come before

I can do this easily for forward (shown below) but not clear how to do
this
backwards.

I would really appreciate it if someone can help with this problem.

I feel like a regex solution would be possible but I am not too good
with regex.

The code for forward is as follows:

def homopolymericSites(Seq):
Seq=Seq.upper()
i=0
len_seq=len(Seq)-1# hack to prevent boundary condition
while i < len_seq:
bi=Seq[i]
k=1
# go to the start of a homopolymer
while 1:
if i+k >= len_seq: break # no more sequence left
if bi==Seq[i+k]:
k+=1
else:
break
if k>1: # homopolymer length
i=i+k
id_of_chr_which_proceeds_homopolymer=Seq[i] # note not i+1
pos_of_chr_which_proceeds_homopolymer=i+1 # +1 to convert it to 1-
index notation
id_of_homopolymer=Seq[i-1]
length_of_homopolymer=k

print "%s\t%s/%s\t%s" %(pos_of_chr_which_proceeds_homopolymer,
id_of_chr_which_proceeds_homopolymer, id_of_homopolymer,
length_of_homopolymer)
else:
i+=1

From: Peter Otten on 3 Aug 2010 14:31

Lee Sander wrote:

> Hi,
> Suppose I have a string such as this
> 'aabccccccefggggghiiijkr'
>
> I would like to print out all the positions that are flanked by a run
> of symbols.
> So for example, I would like to the output for the above input as
> follows:
>
> 2 b 1 aa
> 2 b -1 cccccc
> 10 e -1 cccccc
> 11 f 1 ggggg
> 17 h 1 iii
> 17 h -1 ggggg
>
> where the first column is the position of interest, the next column is
> the entry at that position,
> 1 if the following column refers to a runs that come after and -1 if
> the runs come before

Trying to follow your spec I came up with

from itertools import groupby
from collections import namedtuple

Item = namedtuple("Item", "pos key size")

def compact(seq):
pos = 0
for key, group in groupby(seq):
size = len(list(group))
yield Item(pos, key, size)
pos += size

def window(items):
items = iter(items)
prev = None
cur = next(items)
for nxt in items:
yield prev, cur, nxt
prev = cur
cur = nxt
yield prev, cur, None

items = compact("aabccccccefggggghiiijkr")

for prev, cur, nxt in window(items):
if cur.size == 1:
if prev is not None:
if prev.size > 1:
print cur.pos, cur.key, -1, prev.key*prev.size
if nxt is not None:
if nxt.size > 1:
print cur.pos, cur.key, 1, nxt.key*nxt.size

However, this gives a slightly differenct output:

$ python homopolymers.py
2 b -1 aa
2 b 1 cccccc
9 e -1 cccccc
10 f 1 ggggg
16 h -1 ggggg
16 h 1 iii
20 j -1 iii

Peter

|
Pages: 1
Prev: [ANN] pyxser-1.4.6r --- Python Object to XML serializer/deserializer
Next: regular expressions and the LOCALE flag