From: Robbo on
On Tue, 09 Feb 2010 20:52:13 +0000, Ikke wrote:

> Hi everybody,
>
> I'm stuck with a little problem, and I can't seem to find the right way
> to solve it.
>
> Basically, I need to process a few million files and gather some data
> from them. Processing them is not a problem, storing them is a big
> problem on the other hand.
>
> Stuffing all files in one and the same directory isn't an option: I
> don't know whether or not NTFS can handle millions of files in one
> directory, and even if it could, I've got a few directories with
> thousands of files in them, and reading from these is sloooow....
>
> What I need now is a way to store these files in an orderly fashion, but
> I don't have any metadata to go on. All files have the same extention,
> and all are named the same way (a number indicating the sequence, and
> the extention ".xml").
>
> I've though about converting the number into a string (or a hex string),
> and using subparts of this string as the directory. For example, if the
> number was 15874532, then the full path would become c:\data\15\87\45\32
> or c:\data\00\F2\39\E4, with the filename being 15874532.xml . The
> second option would yield larger directory capacity, but would be less
> human readable.
>
> Does anyone have any other suggestions? I've though about using a
> database, but I don't like the overhead that would cause, and it would
> mean adding an additional tool for someone who just wants to check a
> single file. Having them on a file system also means these files are
> instantly accessible.
>
> Thanks,
>
> Ikke

I have been programming with small xml files for some time now so here
are a few things that I have learned.

Option 1 - use Oracle XMLDB to store your xml data (its tuned for
processing xml data), I think that the free version of oracle has a cut
down version of xmldb for you to try!!

Option 2 - use mounted file systems to store several of your xml files
and then have several of these 'mounts'. what this gives you is a large
file (say 100MB) which contains a lot of xml files (say 1,000). when you
want to access the files within this file, you simply mount it (I admit,
I am not sure if you can do this on windows!!

Option 3 - comes from option 2, use zip or tar files to hold the smaller
xml files instead and in a similar fashion, use whichever api to access
the data held within!!

Undoubtedly, option 2/3 have some overhead, but you may find that this is
negligible when compared with the impact of exceeding your file-system
parameters.

Optino 4 - use a file system that is designed to handle small files (eg
ZFS - Zettabyte File System designed by Sun Microsystems and under the
open source flag!!). Again I admit, I am not sure what file systems are
available for windows, which may highlight the advantages of programming
on non ms os's!!

I hope this is of some use to you!!

Robbo