From: Kyp on 31 Jan 2010 11:22 I have a dir with a large # of files that I need to perform operations on, but only needing to access a subset of the files, i.e. the first 100 files. Using glob is very slow, so I ran across iglob, which returns an iterator, which seemed just like what I wanted. I could iterate over the files that I wanted, not having to read the entire dir. So the iglob was faster, but accessing the first file took about the same time as glob.glob. Here's some code to compare glob vs. iglob performance, it outputs the time before/after a glob.iglob('*.*') files.next() sequence and a glob.glob('*.*') sequence. #!/usr/bin/env python import glob,time print '\nTest of glob.iglob' print 'before iglob:', time.asctime() files = glob.iglob('*.*') print 'after iglob:',time.asctime() print files.next() print 'after files.next():', time.asctime() print '\nTest of glob.glob' print 'before glob:', time.asctime() files = glob.glob('*.*') print 'after glob:',time.asctime() Here are the results: Test of glob.iglob before iglob: Sun Jan 31 11:09:08 2010 after iglob: Sun Jan 31 11:09:08 2010 foo.bar after files.next(): Sun Jan 31 11:09:59 2010 Test of glob.glob before glob: Sun Jan 31 11:09:59 2010 after glob: Sun Jan 31 11:10:51 2010 The results are about the same for the 2 approaches, both took about 51 seconds. Am I doing something wrong with iglob? Is there a way to get the first X # of files from a dir with lots of files, that does not take a long time to run? thanx, mark
From: Skip Montanaro on 31 Jan 2010 12:59 > So the iglob was faster, but accessing the first file took about the > same time as glob.glob. I'll wager most of the time required to access the first file is due to filesystem overhead, not any inherent limitation in Python. Skip Montanaro
From: John Bokma on 31 Jan 2010 13:06 Kyp <kyp(a)stsci.edu> writes: > Is there a way to get the first X # of files from a dir with lots of > files, that does not take a long time to run? Assuming Linux: what does time ls thedir | head give? with thedir the name of the actual dir Also how many is many files? -- John Bokma j3b Hacking & Hiking in Mexico - http://johnbokma.com/ http://castleamber.com/ - Perl & Python Development
From: Peter Otten on 31 Jan 2010 14:44 Kyp wrote: > I have a dir with a large # of files that I need to perform operations > on, but only needing to access a subset of the files, i.e. the first > 100 files. > > Using glob is very slow, so I ran across iglob, which returns an > iterator, which seemed just like what I wanted. I could iterate over > the files that I wanted, not having to read the entire dir. > > So the iglob was faster, but accessing the first file took about the > same time as glob.glob. > > Here's some code to compare glob vs. iglob performance, it outputs > the time before/after a glob.iglob('*.*') files.next() sequence and a > glob.glob('*.*') sequence. > > #!/usr/bin/env python > > import glob,time > print '\nTest of glob.iglob' > print 'before iglob:', time.asctime() > files = glob.iglob('*.*') > print 'after iglob:',time.asctime() > print files.next() > print 'after files.next():', time.asctime() > > print '\nTest of glob.glob' > print 'before glob:', time.asctime() > files = glob.glob('*.*') > print 'after glob:',time.asctime() > > > Here are the results: > > Test of glob.iglob > before iglob: Sun Jan 31 11:09:08 2010 > after iglob: Sun Jan 31 11:09:08 2010 > foo.bar > after files.next(): Sun Jan 31 11:09:59 2010 > > Test of glob.glob > before glob: Sun Jan 31 11:09:59 2010 > after glob: Sun Jan 31 11:10:51 2010 > > The results are about the same for the 2 approaches, both took about > 51 seconds. Am I doing something wrong with iglob? No, but iglob() being lazy is pointless in your case because it uses os.listdir() and fnmatch.filter() underneath which both read the whole directory before returning anything. > Is there a way to get the first X # of files from a dir with lots of > files, that does not take a long time to run? Here's my attempt. It turned out to be more work than expected, so I cut a few corners. It's Linux-only "works on my machine" code, but may give you some hints on how to proceed. from ctypes import * import fnmatch import glob import os import re from itertools import ifilter, imap class dirent(Structure): "works on my machine ;)" _fields_ = [ ("d_ino", c_long), ("d_off", c_long), ("d_reclen", c_ushort), ("d_type", c_ubyte), ("d_name", c_char*256)] direntp = POINTER(dirent) LIBC = "libc.so.6" cdll.LoadLibrary(LIBC) libc = CDLL(LIBC) libc.readdir.restype = direntp def diriter(dir): "lazy partial replacement for os.listdir()" # errors? what errors? dirp = libc.opendir(dir) if not dirp: return try: while True: ep = libc.readdir(dirp) if not ep: break yield ep.contents.d_name finally: libc.closedir(dirp) def filter(names, pattern): "lazy partial replacement for fnmatch.filter()" import posixpath pattern = os.path.normcase(pattern) r = fnmatch.translate(pattern) r = re.compile(r) if os.path is not posixpath: names = imap(os.path.normcase, names) return ifilter(r.match, names) def globiter(path): "lazy partial replacement for glob.glob()" dir, filename = os.path.split(path) if glob.has_magic(dir): raise ValueError("wildcards in directory not supported") return filter(diriter(dir), filename) if __name__ == "__main__": import sys [pattern] = sys.argv[1:] for name in globiter(pattern): print name Peter
From: Benjamin Peterson on 31 Jan 2010 16:30 Kyp <kyp <at> stsci.edu> writes: > So the iglob was faster, but accessing the first file took about the > same time as glob.glob. That would be because glob is implemented in terms of iglob.
|
Next
|
Last
Pages: 1 2 Prev: pyjon: pythonic javascript interpreter Next: What's the Scoop on \\ for Paths? (Win) |