From: Sjouke Burry on
Ikke wrote:
> Hi everybody,
>
> I'm stuck with a little problem, and I can't seem to find the right way
> to solve it.
>
> Basically, I need to process a few million files and gather some data
> from them. Processing them is not a problem, storing them is a big
> problem on the other hand.
>
> Stuffing all files in one and the same directory isn't an option: I don't
> know whether or not NTFS can handle millions of files in one directory,
> and even if it could, I've got a few directories with thousands of files
> in them, and reading from these is sloooow....
>
> What I need now is a way to store these files in an orderly fashion, but
> I don't have any metadata to go on. All files have the same extention,
> and all are named the same way (a number indicating the sequence, and the
> extention ".xml").
>
> I've though about converting the number into a string (or a hex string),
> and using subparts of this string as the directory. For example, if the
> number was 15874532, then the full path would become c:\data\15\87\45\32
> or c:\data\00\F2\39\E4, with the filename being 15874532.xml . The second
> option would yield larger directory capacity, but would be less human
> readable.
>
> Does anyone have any other suggestions? I've though about using a
> database, but I don't like the overhead that would cause, and it would
> mean adding an additional tool for someone who just wants to check a
> single file. Having them on a file system also means these files are
> instantly accessible.
>
> Thanks,
>
> Ikke
When I had to manage about 100GB data, I packed it with pkzip.
On processing I unpacked parts of it, and after processing, deleted
the unpacked files.
Because the files were ASCII tekst files they packed to about 90 %,
lots of spacechars and repeating figure strings.
When you pack them in lots of 100---1000 files per zipfile, the number
of files wont be a problem, only 1 lot unpacked at a time.
From: Sjouke Burry on
Ikke wrote:
> Hi everybody,
>
> I'm stuck with a little problem, and I can't seem to find the right way
> to solve it.
>
> Basically, I need to process a few million files and gather some data
> from them. Processing them is not a problem, storing them is a big
> problem on the other hand.
>
> Stuffing all files in one and the same directory isn't an option: I don't
> know whether or not NTFS can handle millions of files in one directory,
> and even if it could, I've got a few directories with thousands of files
> in them, and reading from these is sloooow....
>
> What I need now is a way to store these files in an orderly fashion, but
> I don't have any metadata to go on. All files have the same extention,
> and all are named the same way (a number indicating the sequence, and the
> extention ".xml").
>
> I've though about converting the number into a string (or a hex string),
> and using subparts of this string as the directory. For example, if the
> number was 15874532, then the full path would become c:\data\15\87\45\32
> or c:\data\00\F2\39\E4, with the filename being 15874532.xml . The second
> option would yield larger directory capacity, but would be less human
> readable.
>
> Does anyone have any other suggestions? I've though about using a
> database, but I don't like the overhead that would cause, and it would
> mean adding an additional tool for someone who just wants to check a
> single file. Having them on a file system also means these files are
> instantly accessible.
>
> Thanks,
>
> Ikke
An other option that occurs to me, is to store each file into one
meta file, where each file will have a header with the old name and
a line or byte count count, and for redundancy a closing line with
some closing label.
That will leave you with one file, or only a small number of them
and with a small program easily converted back to the original files,
or directly process them from the meta file.
From: Andrew Poelstra on
On 2010-02-09, Ikke <ikke(a)hier.be> wrote:
> Andrew Poelstra <apoelstra(a)localhost.localdomain> wrote in
> news:slrnhn3jv1.akl.apoelstra(a)localhost.localdomain:
>
><snip>
>> If you've got millions of files, they aren't going to be
>> human-accessible anyway.
>
> I don't see why not - when someone wants to check a file, he/she knows
> the id of the file (as in the example: 15874532). So he/she simply
> navigates to folder 15/87/45/32, and opens the file.
>

Perhaps whatever gave them that ID number would also be able to
load the file from the database, by copying it to a temporary
file or something.

> There are almost a million files on my home computer, I've never had any
> problem accessing those, so it's possible that I don't quite understand
> what you mean.
>

I suspect that if went into your system directories where
these hundreds of thousands of files live, you'll find that
directory listings start slowing down.

>> And any database software is going to be faster than NTFS, so
>> I wouldn't worry about overhead. Can you store the entire files
>> in the database instead of using the filesystem at all?
>
> At work, we've conducted some tests in the past, and databases are rarely
> faster than a file system when it comes to uncached items.
>

Hmm. Well, I can't really argue with evidence, but there are
good uses for filesystems and good uses for databases, and
these seems an awful lot like the latter.

> As for the overhead, I meant the overhead that comes with writing the
> software to insert/update/delete the files in the database (not that much
> work), and building an interface for the users to work with files in the
> database (which would require more work). Enabling the users to access
> the files directly would be easier, hence the file system.
>

Fair enough. But what kind of files are these that you've got
millions of them and nothing already in place to edit them? If
you've got an application to work with the files, that program
should be in charge of finding and loading them.

>> (Would sqlite work? It's very lightweight and very fast.)
>
> I've never tried sqlite, the only databases I've ever used are Oracle and
> MySQL, neither of which are suited for this purpose. Oracle for example
> requires too much processing power (the system these files will reside on
> is a very old machine with a big hard disk).
>

See
http://www.sqlite.org/whentouse.html

Good luck.

From: BGB / cr88192 on

"Ikke" <ikke(a)hier.be> wrote in message
news:Xns9D1ADE8234107ikkehierbe(a)69.16.176.253...
> Hi everybody,
>
> I'm stuck with a little problem, and I can't seem to find the right way
> to solve it.
>
> Basically, I need to process a few million files and gather some data
> from them. Processing them is not a problem, storing them is a big
> problem on the other hand.
>
> Stuffing all files in one and the same directory isn't an option: I don't
> know whether or not NTFS can handle millions of files in one directory,
> and even if it could, I've got a few directories with thousands of files
> in them, and reading from these is sloooow....
>
> What I need now is a way to store these files in an orderly fashion, but
> I don't have any metadata to go on. All files have the same extention,
> and all are named the same way (a number indicating the sequence, and the
> extention ".xml").
>
> I've though about converting the number into a string (or a hex string),
> and using subparts of this string as the directory. For example, if the
> number was 15874532, then the full path would become c:\data\15\87\45\32
> or c:\data\00\F2\39\E4, with the filename being 15874532.xml . The second
> option would yield larger directory capacity, but would be less human
> readable.
>

deep nesting is not likely to help much either:
the FS typically has some per-directory overhead as well, and so a lot of
directories would strain the FS code as well.

better may to use a shallower nesting, for example:
bi-level 10-bit key-hashes;
this could give a capacity of around 1024*1024*1024 files (or around 1
billion), while keeping the average directory size <1000 (it is unlikely for
all the space to be used).

this then uses 2 directory levels.
3 levels may also make sense...


as others have noted, deflating the data and packing it into another big
flat file may be a lot more efficient, where one can then use a B-Tree or
AVL tree for managing the data (AVL trees are much simpler, but B-Trees
scale to larger numbers of entities better). (using deflate for on-disk data
not only saves space, but can actually make it faster, since the speed of
compressing/decompressing is often faster than the disk IO speed).

granted, if there is a lot of data, it may make sense to use multiple data
files for storage (or even multiple tree-files).

B-Trees are typically a little better than AVL trees for databases because
it is easier to work with a tree which is partly in RAM and partly on-disk
(matters more for large trees), whereas AVL trees typically work better if
one can read/write the whole tree at once (reading/writing the tree in terms
of individual nodes is likely to be less efficient).

AVLs are, however, typically much easier to work with (keep sorted and
balanced, step over leaves, ...).

so, there are tradeoffs here...


> Does anyone have any other suggestions? I've though about using a
> database, but I don't like the overhead that would cause, and it would
> mean adding an additional tool for someone who just wants to check a
> single file. Having them on a file system also means these files are
> instantly accessible.
>

well, filesystem lookup time is likely to be the main issue here, however,
reading/writing files directly tends to be much faster than, for example,
listing them with Explorer, as explorer itself seems to be terribly slow...

so, one may find that their directory with 10000-20000 files works fairly
well when being accessed by the app, but takes a very long time to open in
explorer...

I suspect this is actually mostly the fault of explorer, rather than the
NTFS driver...


it may be worth trying and seeing though...



From: Gene on
On Feb 9, 4:28 pm, Patricia Shanahan <p...(a)acm.org> wrote:
> Ikke wrote:
> > Hi everybody,
>
> > I'm stuck with a little problem, and I can't seem to find the right way
> > to solve it.
> ...
> > What I need now is a way to store these files in an orderly fashion, but
> > I don't have any metadata to go on. All files have the same extention,
> > and all are named the same way (a number indicating the sequence, and the
> > extention ".xml").
> ...
>
> > Does anyone have any other suggestions? I've though about using a
> > database, but I don't like the overhead that would cause, and it would
> > mean adding an additional tool for someone who just wants to check a
> > single file. Having them on a file system also means these files are
> > instantly accessible.
>
> Although you don't want to use a database, the problem cries out for the
> sort of indexing data structures databases use.
>
> For example, maybe the solution is a B-tree with each leaf a directory,
> rather than a disk block. The B-tree, and an associated application,
> would provide fast ways of finding a file by name, inserting and
> removing files. A user could still use "find" or operating system
> indexing to locate a file directly.
>
> You may be able to use a simpler structure if the set of files does not
> change, but unchanging sets of a million actively used anythings are rare..
>

Isn't this re-inventing the wheel? Each NTFS folder is already a B+
tree. Has anyone actually done a test to verify the assertion that a
big folder behaves poorly in NTFS? I routinely deal with folders that
have 70,000 files on servers. File access there is very fast, assuming
you have enough RAM to prevent thrashing duing mutiple accesses. which
will also happen in a DBMS with cache starvation.

I _have_ seen NTFS performance go way south on big folders due a very
well-known and -used AV checker that monitors disk activity. Turned
off the AV checker, and performance was instantly back.