From: Ikke on
Hi everybody,

I'm stuck with a little problem, and I can't seem to find the right way
to solve it.

Basically, I need to process a few million files and gather some data
from them. Processing them is not a problem, storing them is a big
problem on the other hand.

Stuffing all files in one and the same directory isn't an option: I don't
know whether or not NTFS can handle millions of files in one directory,
and even if it could, I've got a few directories with thousands of files
in them, and reading from these is sloooow....

What I need now is a way to store these files in an orderly fashion, but
I don't have any metadata to go on. All files have the same extention,
and all are named the same way (a number indicating the sequence, and the
extention ".xml").

I've though about converting the number into a string (or a hex string),
and using subparts of this string as the directory. For example, if the
number was 15874532, then the full path would become c:\data\15\87\45\32
or c:\data\00\F2\39\E4, with the filename being 15874532.xml . The second
option would yield larger directory capacity, but would be less human
readable.

Does anyone have any other suggestions? I've though about using a
database, but I don't like the overhead that would cause, and it would
mean adding an additional tool for someone who just wants to check a
single file. Having them on a file system also means these files are
instantly accessible.

Thanks,

Ikke
From: Andrew Poelstra on
On 2010-02-09, Ikke <ikke(a)hier.be> wrote:
> Hi everybody,
>
> I'm stuck with a little problem, and I can't seem to find the right way
> to solve it.
>
> Basically, I need to process a few million files and gather some data
> from them. Processing them is not a problem, storing them is a big
> problem on the other hand.
>
> Stuffing all files in one and the same directory isn't an option: I don't
> know whether or not NTFS can handle millions of files in one directory,
> and even if it could, I've got a few directories with thousands of files
> in them, and reading from these is sloooow....
>
> What I need now is a way to store these files in an orderly fashion, but
> I don't have any metadata to go on. All files have the same extention,
> and all are named the same way (a number indicating the sequence, and the
> extention ".xml").
>
> I've though about converting the number into a string (or a hex string),
> and using subparts of this string as the directory. For example, if the
> number was 15874532, then the full path would become c:\data\15\87\45\32
> or c:\data\00\F2\39\E4, with the filename being 15874532.xml . The second
> option would yield larger directory capacity, but would be less human
> readable.
>
> Does anyone have any other suggestions? I've though about using a
> database, but I don't like the overhead that would cause, and it would
> mean adding an additional tool for someone who just wants to check a
> single file. Having them on a file system also means these files are
> instantly accessible.
>

If you've got millions of files, they aren't going to be
human-accessible anyway.

And any database software is going to be faster than NTFS, so
I wouldn't worry about overhead. Can you store the entire files
in the database instead of using the filesystem at all?

(Would sqlite work? It's very lightweight and very fast.)

From: Patricia Shanahan on
Ikke wrote:
> Hi everybody,
>
> I'm stuck with a little problem, and I can't seem to find the right way
> to solve it.
....
> What I need now is a way to store these files in an orderly fashion, but
> I don't have any metadata to go on. All files have the same extention,
> and all are named the same way (a number indicating the sequence, and the
> extention ".xml").
....
>
> Does anyone have any other suggestions? I've though about using a
> database, but I don't like the overhead that would cause, and it would
> mean adding an additional tool for someone who just wants to check a
> single file. Having them on a file system also means these files are
> instantly accessible.

Although you don't want to use a database, the problem cries out for the
sort of indexing data structures databases use.

For example, maybe the solution is a B-tree with each leaf a directory,
rather than a disk block. The B-tree, and an associated application,
would provide fast ways of finding a file by name, inserting and
removing files. A user could still use "find" or operating system
indexing to locate a file directly.

You may be able to use a simpler structure if the set of files does not
change, but unchanging sets of a million actively used anythings are rare.

Patricia
From: Ikke on
Andrew Poelstra <apoelstra(a)localhost.localdomain> wrote in
news:slrnhn3jv1.akl.apoelstra(a)localhost.localdomain:

<snip>
> If you've got millions of files, they aren't going to be
> human-accessible anyway.

I don't see why not - when someone wants to check a file, he/she knows
the id of the file (as in the example: 15874532). So he/she simply
navigates to folder 15/87/45/32, and opens the file.

There are almost a million files on my home computer, I've never had any
problem accessing those, so it's possible that I don't quite understand
what you mean.

> And any database software is going to be faster than NTFS, so
> I wouldn't worry about overhead. Can you store the entire files
> in the database instead of using the filesystem at all?

At work, we've conducted some tests in the past, and databases are rarely
faster than a file system when it comes to uncached items.

As for the overhead, I meant the overhead that comes with writing the
software to insert/update/delete the files in the database (not that much
work), and building an interface for the users to work with files in the
database (which would require more work). Enabling the users to access
the files directly would be easier, hence the file system.

> (Would sqlite work? It's very lightweight and very fast.)

I've never tried sqlite, the only databases I've ever used are Oracle and
MySQL, neither of which are suited for this purpose. Oracle for example
requires too much processing power (the system these files will reside on
is a very old machine with a big hard disk).

Thanks,

Ikke
From: James Harris on
On 9 Feb, 20:52, Ikke <i...(a)hier.be> wrote:
> Hi everybody,
>
> I'm stuck with a little problem, and I can't seem to find the right way
> to solve it.
>
> Basically, I need to process a few million files and gather some data
> from them. Processing them is not a problem, storing them is a big
> problem on the other hand.
>
> Stuffing all files in one and the same directory isn't an option: I don't
> know whether or not NTFS can handle millions of files in one directory,
> and even if it could, I've got a few directories with thousands of files
> in them, and reading from these is sloooow....
>
> What I need now is a way to store these files in an orderly fashion, but
> I don't have any metadata to go on. All files have the same extention,
> and all are named the same way (a number indicating the sequence, and the
> extention ".xml").
>
> I've though about converting the number into a string (or a hex string),
> and using subparts of this string as the directory. For example, if the
> number was 15874532, then the full path would become c:\data\15\87\45\32
> or c:\data\00\F2\39\E4, with the filename being 15874532.xml . The second
> option would yield larger directory capacity, but would be less human
> readable.
>
> Does anyone have any other suggestions? I've though about using a
> database, but I don't like the overhead that would cause, and it would
> mean adding an additional tool for someone who just wants to check a
> single file. Having them on a file system also means these files are
> instantly accessible.

Your own suggestion sounds most appropriate. In the decimal you might
want to remove the last directory level so rather than

c:\data\15\87\45\32\15874532.xml

you'd have

c:\data\15\87\45\15874532.xml

Then you'd have up to 100 files in one folder, and up to 100 folders
in each higher folder. I'm making some assumptions on things you
haven't told us but given what you have this sounds like a good
solution.

James