From: Robbo on 19 Feb 2010 18:11 On Tue, 09 Feb 2010 20:52:13 +0000, Ikke wrote: > Hi everybody, > > I'm stuck with a little problem, and I can't seem to find the right way > to solve it. > > Basically, I need to process a few million files and gather some data > from them. Processing them is not a problem, storing them is a big > problem on the other hand. > > Stuffing all files in one and the same directory isn't an option: I > don't know whether or not NTFS can handle millions of files in one > directory, and even if it could, I've got a few directories with > thousands of files in them, and reading from these is sloooow.... > > What I need now is a way to store these files in an orderly fashion, but > I don't have any metadata to go on. All files have the same extention, > and all are named the same way (a number indicating the sequence, and > the extention ".xml"). > > I've though about converting the number into a string (or a hex string), > and using subparts of this string as the directory. For example, if the > number was 15874532, then the full path would become c:\data\15\87\45\32 > or c:\data\00\F2\39\E4, with the filename being 15874532.xml . The > second option would yield larger directory capacity, but would be less > human readable. > > Does anyone have any other suggestions? I've though about using a > database, but I don't like the overhead that would cause, and it would > mean adding an additional tool for someone who just wants to check a > single file. Having them on a file system also means these files are > instantly accessible. > > Thanks, > > Ikke I have been programming with small xml files for some time now so here are a few things that I have learned. Option 1 - use Oracle XMLDB to store your xml data (its tuned for processing xml data), I think that the free version of oracle has a cut down version of xmldb for you to try!! Option 2 - use mounted file systems to store several of your xml files and then have several of these 'mounts'. what this gives you is a large file (say 100MB) which contains a lot of xml files (say 1,000). when you want to access the files within this file, you simply mount it (I admit, I am not sure if you can do this on windows!! Option 3 - comes from option 2, use zip or tar files to hold the smaller xml files instead and in a similar fashion, use whichever api to access the data held within!! Undoubtedly, option 2/3 have some overhead, but you may find that this is negligible when compared with the impact of exceeding your file-system parameters. Optino 4 - use a file system that is designed to handle small files (eg ZFS - Zettabyte File System designed by Sun Microsystems and under the open source flag!!). Again I admit, I am not sure what file systems are available for windows, which may highlight the advantages of programming on non ms os's!! I hope this is of some use to you!! Robbo
First
|
Prev
|
Pages: 1 2 3 4 Prev: compiling C program containing Xutil functions Next: Warning to newbies |