From: ela on 10 Aug 2010 03:39 I'm new to database programming and just previously learnt to use loops to look up and enrich information using the following codes. However, when the tables are large, I find this process is very slow. Then, somebody told me I can build a database for one of the file real time and so no need to read the file from the beginning till the end again and again. However, perl DBI has a lot of sophisticated functions there and in fact my tables are only large but nothing special, linked by an ID. Is there any simple way to achieve the same purpose? I just wish the ID can be indexed and then everytime I access the record through memory and not through I/O... #!/usr/bin/perl my ($listfile, $format, $accfile, $infofile) = @ARGV; print '($listfile, $accfile, $infofile)'; <STDIN>; print "Working on $listfile...\n"; $outname = $listfile . "_" . $infofile . ".xls"; open (OFP, ">$outname"); open(FP, $listfile); $line = <FP>; chomp $line; if ($format ne "") { @fields = split(/\t/, $line); for ($i=0; $i<@fields; $i++) { ############## check fields ############################### if ( $fields[$i] =~ /accession/) { $acci = $i; } } } print OFP "$line\tgene info\n"; $nl = '\n'; while (<FP>) { $line = $_; if ($line eq "\n") { print OFP $line; next; } chomp $line; if ($format eq "") { @cells = split (/:/, $line); $tag = $cells[0]; } else { @cells = split (/\t/, $line); $tag = $cells[$acci]; } open(AFP, $accfile); while (<AFP>) { @cells = split (/\t/, $_); if ($cells[5] =~ /$tag/) { $des = $cells[1]; last; } } close AFP; if ($found == 0) { print OFP "$line\tNo gene info available\n"; } } close FP;
From: Jens Thoms Toerring on 10 Aug 2010 08:28 ela <ela(a)yantai.org> wrote: > I'm new to database programming and just previously learnt to use loops to > look up and enrich information using the following codes. However, when the > tables are large, Which tables? Do you mean 'files'? > I find this process is very slow. Then, somebody told me I > can build a database for one of the file real time and so no need to read > the file from the beginning till the end again and again. However, perl DBI > has a lot of sophisticated functions there and in fact my tables are only > large but nothing special, linked by an ID. Is there any simple way to > achieve the same purpose? I just wish the ID can be indexed and then > everytime I access the record through memory and not through I/O... > #!/usr/bin/perl Please, please use use strict; use warnings; It will tell you about a lot of potential problems. > my ($listfile, $format, $accfile, $infofile) = @ARGV; > print '($listfile, $accfile, $infofile)'; <STDIN>; What's that at end of the line good for? > print "Working on $listfile...\n"; > $outname = $listfile . "_" . $infofile . ".xls"; > open (OFP, ">$outname"); Better use the three-argument form of open and use normal variables for file handles, this isn't Perl 4 anymore... open my $ofp, '>', $outname or die "Can't open $outfile for writing\n"; Also checking that opening a file succeeded shouldn't be left out without very good reasons... > open(FP, $listfile); > $line = <FP>; > chomp $line; > if ($format ne "") { > @fields = split(/\t/, $line); > for ($i=0; $i<@fields; $i++) { > ############## check fields ############################### > if ( $fields[$i] =~ /accession/) { Are you aware that this will also match e.g. 'disaccession_123'? > $acci = $i; > } > } > } > print OFP "$line\tgene info\n"; > $nl = '\n'; > while (<FP>) { > $line = $_; Why don't you read directly into '$line' but instead do an additional copy? > if ($line eq "\n") { > print OFP $line; > next; > } > chomp $line; > if ($format eq "") { > @cells = split (/:/, $line); > $tag = $cells[0]; > } else { > @cells = split (/\t/, $line); > $tag = $cells[$acci]; > } > open(AFP, $accfile); > while (<AFP>) { > @cells = split (/\t/, $_); > if ($cells[5] =~ /$tag/) { > $des = $cells[1]; > last; > } > } > close AFP; > if ($found == 0) { > print OFP "$line\tNo gene info available\n"; > } Huh? '$found' is nowhere else used in your program. With 'use warnings' you would have gotten a warning that you use the value of an uninitialized variable... > } > close FP; The probably most time-consuming part of your program is that for each line of the file with the name '$listfile' you read in at least a certain portion on '$accfile', again and again. To get around that you don't need a database, you just have to read it in only once and store the relevant information e.g. in a hash. If you would do something like open my $afp, '<', $accfile) or die "Can't open $accfile for reading\n"; my %ahash; while ( my line = <$afp> ) { my @cells = split /\t/, $line; $ahash{ $cells[ 5 ] } = $cells[ 1 ]; } close $afp; somewhere at the begining then you would have all the infor- mation you use from the '$accfile' file in the %ahash hash and there would be no need to read the file again and again: while ( my $line = <$fp> ) { if ( $line eq "\n" ) { print $ofp "\n"; next; } chomp $line; if ( $format eq "" ) { @cells = split /:/, $line; $tag = $cells[ 0 ]; } else { @cells = split /\t/, $line; $tag = $cells[ $acci ]; } $des = $ahash{ $tag } if exists $ahash{ $tag }; } close $fp; Putting things in a database won't do too much good here since, unless you have an in-memory database, also the database will put the information on the disk and has to retrieve it from there (but for sure a lot faster then re-reading a file for a bit of information lots of times;-) The only case I can think of where using a database may be beneficial here is when the '$accfile' is extremely large and the '%ahash' would use up all the memory you have. In that case putting things in a database (on disk then of course) for relatively fast finding the value for a key (i.e. what you have in the '$tag' variable) might be a rea- sonable alternative. Regards, Jens -- \ Jens Thoms Toerring ___ jt(a)toerring.de \__________________________ http://toerring.de
From: wolf on 10 Aug 2010 08:48 ela schrieb: > I'm new to database programming and just previously learnt to use loops to > look up and enrich information using the following codes. However, when the > tables are large, I find this process is very slow. Then, somebody told me I > can build a database for one of the file real time and so no need to read > the file from the beginning till the end again and again. However, perl DBI > has a lot of sophisticated functions there and in fact my tables are only > large but nothing special, linked by an ID. Is there any simple way to > achieve the same purpose? I just wish the ID can be indexed and then > everytime I access the record through memory and not through I/O... > > > #!/usr/bin/perl > > my ($listfile, $format, $accfile, $infofile) = @ARGV; > print '($listfile, $accfile, $infofile)'; <STDIN>; > > print "Working on $listfile...\n"; > $outname = $listfile . "_" . $infofile . ".xls"; > > open (OFP, ">$outname"); > > open(FP, $listfile); > $line = <FP>; > chomp $line; > > if ($format ne "") { > @fields = split(/\t/, $line); > for ($i=0; $i<@fields; $i++) { > ############## check fields ############################### > if ( $fields[$i] =~ /accession/) { > $acci = $i; > } > } > } > > print OFP "$line\tgene info\n"; > > $nl = '\n'; > > while (<FP>) { > $line = $_; > if ($line eq "\n") { > print OFP $line; > next; > } > chomp $line; > > if ($format eq "") { > @cells = split (/:/, $line); > $tag = $cells[0]; > } else { > @cells = split (/\t/, $line); > $tag = $cells[$acci]; > } > > open(AFP, $accfile); > > while (<AFP>) { > @cells = split (/\t/, $_); > if ($cells[5] =~ /$tag/) { > $des = $cells[1]; > last; > } > } > close AFP; > > if ($found == 0) { > print OFP "$line\tNo gene info available\n"; > } > } > close FP; > > Hi ela, without going too deeply into your code, let's just say that you should always start you perl scripts with #!/usr/bin/perl use warnings; use strict; and if you can't make it run with these restrictions there is something seriously flaky about the way you are persuing. Apart from the perl aspect, there are some serious information issues you need to address. From what i can gather of your description, you are reading in a file that contains some kind of gene information, and you want to index that information so that retrieval of information is much faster rather than iterating SEQUENTIALLY over the whole file(or series of files) every time you need an answer. Is my assumption thus far right ? But to assess that, some real life info on what you are actually trying to do is needed :p How big is/are the files - that is .. how big will that index be ? What is the actual index gonna be .. etc. Only after that part becomes clear a solution is possible. And you need to communicate that. cheers, wolf
From: J�rgen Exner on 10 Aug 2010 09:17 "ela" <ela(a)yantai.org> wrote: > > >I'm new to database programming and just previously learnt to use loops to >look up and enrich information using the following codes. However, when the >tables are large, I find this process is very slow. Then, somebody told me I >can build a database for one of the file real time and so no need to read >the file from the beginning till the end again and again. What I gathered from your code without going into details is that for each line in OFP your are opening, reading through, and closing AFP. I/O operations are by far the slowest operations and there is a trivial solution that will probably speed up your program dramatically: instead of reading AFP again and again and again just read it into an array once at the beginning of your program and then loop over that array instead of over the file. Only if AFP is too large for that (serveral GB) then you may need to look for a better algorithmic solution. This requires knowledge and experience and a database may or may not help, depending upon what you actually are trying to achive. jue
From: ccc31807 on 10 Aug 2010 09:43 On Aug 10, 3:39 am, "ela" <e...(a)yantai.org> wrote: > I'm new to database programming and just previously learnt to use loops to > look up and enrich information using the following codes. However, when the > tables are large, I find this process is very slow. Then, somebody told me I > can build a database for one of the file real time and so no need to read > the file from the beginning till the end again and again. However, perl DBI > has a lot of sophisticated functions there and in fact my tables are only > large but nothing special, linked by an ID. Is there any simple way to > achieve the same purpose? I just wish the ID can be indexed and then > everytime I access the record through memory and not through I/O... You have input, which you want to process and turn into output. Your input consists of data contained in some kind of file. This is exactly the kind of task that Perl excels at. You have two choices: (1) you can use a database to store and query your data, or (2) you can use your computer's memory to store and query your data. If you have a large amount of permanent data that you need to add to, delete from, and change, your best strategy is to use a database. Read your data file into your database. Most databases have external commands (i.e., not SQL) for doing that, so it should be straightforward and easy -- note that you do not use Perl for this, and probably shouldn't. If you have a small to moderate amount of data, whether permanent or temporary, that you don't need to add to, delete from, or modify, your best strategy is to use your computer's memory to store and query your data. Simply open the file, read each line, destructure each line into a key and value, and stuff it into a hash. For example, suppose your data looks like this: 12345,George,Washington,First 23456,John,Adams,Second 34567,Thomas,Jefferson,Third 45678,James,Madison,Fourth You can do this: my %pres; open PRES, '<', 'data.csv' or die "$!"; while(<PRES>) { chomp; my ($id, $first, $last, $place) = split /,/; $pres{$place} = "$id, $first, $last"; } close PRES; If you need a multilevel data structure, see documentation, starting maybe with lists of lists. CC.
|
Next
|
Last
Pages: 1 2 3 Prev: Need Google AdSense Account Next: FAQ 4.58 How do I look up a hash element by value? |