From: John Nagle on 12 Aug 2010 16:40 (Repost with better indentation) I'm reading a URL which is a .gz file, and decompressing it. This works, but it seems far too complex. Yet none of the "wrapping" you might expect to work actually does. You can't wrap a GzipFile around an HTTP connection, because GzipFile, reasonably enough, needs random access, and tries to do "seek" and "tell". Nor is the output descriptor from gzip general; it fails on "readline", but accepts "read". (No good reason for that.) So I had to make a second copy. John Nagle def readurl(url) : if url.endswith(".gz") : nd = urllib2.urlopen(url,timeout=TIMEOUTSECS) td1 = tempfile.TemporaryFile() # compressed file td1.write(nd.read()) # fetch and copy file nd.close() # done with network td2 = tempfile.TemporaryFile() # decompressed file td1.seek(0) # rewind gd = gzip.GzipFile(fileobj=td1, mode="rb") # wrap unzip td2.write(gd.read()) # decompress file td1.close() # done with compressed copy td2.seek(0) # rewind return(td2) # return file object for compressed object else : return(urllib2.urlopen(url,timeout=TIMEOUTSECS))
From: Thomas Jollans on 12 Aug 2010 17:40 On Thursday 12 August 2010, it occurred to John Nagle to exclaim: > (Repost with better indentation) Good, good. > > def readurl(url) : > if url.endswith(".gz") : The file name could be anything. You should be checking the reponse Content- Type header -- that's what it's for. > nd = urllib2.urlopen(url,timeout=TIMEOUTSECS) > td1 = tempfile.TemporaryFile() # compressed file You can keep the whole thing in memory by using StringIO. > td1.write(nd.read()) # fetch and copy file You're reading the entire fire into memory anyway ;-) > nd.close() # done with network > td2 = tempfile.TemporaryFile() # decompressed file Okay, maybe there is somthing missing from GzipFile -- but still you could use StringIO again, I expect. > Nor is the output descriptor from gzip general; it fails > on "readline", but accepts "read". >>> from gzip import GzipFile >>> GzipFile.readline <unbound method GzipFile.readline> >>> GzipFile.readlines <unbound method GzipFile.readlines> >>> GzipFile.__iter__ <unbound method GzipFile.__iter__> >>> What exactly is it that's failing, and how? > td1.seek(0) # rewind > gd = gzip.GzipFile(fileobj=td1, mode="rb") # wrap unzip > td2.write(gd.read()) # decompress file > td1.close() # done with compressed copy > td2.seek(0) # rewind > return(td2) # return file object for compressed object > else : > return(urllib2.urlopen(url,timeout=TIMEOUTSECS))
From: Aahz on 12 Aug 2010 23:24 In article <4c645c39$0$1595$742ec2ed(a)news.sonic.net>, John Nagle <nagle(a)animats.com> wrote: > >I'm reading a URL which is a .gz file, and decompressing it. This >works, but it seems far too complex. Yet none of the "wrapping" >you might expect to work actually does. You can't wrap a GzipFile >around an HTTP connection, because GzipFile, reasonably enough, needs >random access, and tries to do "seek" and "tell". Nor is the output >descriptor from gzip general; it fails on "readline", but accepts >"read". (No good reason for that.) So I had to make a second copy. Also consider using zlib directly. -- Aahz (aahz(a)pythoncraft.com) <*> http://www.pythoncraft.com/ "...if I were on life-support, I'd rather have it run by a Gameboy than a Windows box." --Cliff Wells
|
Pages: 1 Prev: EXOR or symmetric difference for the Counter class Next: Floating numbers |