From: BGB / cr88192 on

"Gene" <gene.ressler(a)gmail.com> wrote in message
news:5fec9639-3709-4e65-b4ab-31fb87ad1ceb(a)f5g2000vbm.googlegroups.com...
On Feb 9, 4:28 pm, Patricia Shanahan <p...(a)acm.org> wrote:
> Ikke wrote:
> > Hi everybody,
>
> > I'm stuck with a little problem, and I can't seem to find the right way
> > to solve it.
> ...
> > What I need now is a way to store these files in an orderly fashion, but
> > I don't have any metadata to go on. All files have the same extention,
> > and all are named the same way (a number indicating the sequence, and
> > the
> > extention ".xml").
> ...
>
> > Does anyone have any other suggestions? I've though about using a
> > database, but I don't like the overhead that would cause, and it would
> > mean adding an additional tool for someone who just wants to check a
> > single file. Having them on a file system also means these files are
> > instantly accessible.
>
> Although you don't want to use a database, the problem cries out for the
> sort of indexing data structures databases use.
>
> For example, maybe the solution is a B-tree with each leaf a directory,
> rather than a disk block. The B-tree, and an associated application,
> would provide fast ways of finding a file by name, inserting and
> removing files. A user could still use "find" or operating system
> indexing to locate a file directly.
>
> You may be able to use a simpler structure if the set of files does not
> change, but unchanging sets of a million actively used anythings are rare.
>

<--
Isn't this re-inventing the wheel? Each NTFS folder is already a B+
tree. Has anyone actually done a test to verify the assertion that a
big folder behaves poorly in NTFS? I routinely deal with folders that
have 70,000 files on servers. File access there is very fast, assuming
you have enough RAM to prevent thrashing duing mutiple accesses. which
will also happen in a DBMS with cache starvation.

I _have_ seen NTFS performance go way south on big folders due a very
well-known and -used AV checker that monitors disk activity. Turned
off the AV checker, and performance was instantly back.
-->

as noted elsewhere in the thread, the problem may well be with Explorer, and
so it may well make sense to test raw-access to the directories (and
possibly ignore the case of explorer being slow...).


admitted, explorer can be damn slow at times, noting how absurdly slow its
ZIP file browsing is, for example, ...

OTOH, I wrote a ZIP lib which can do full read/write random access on a ZIP
file, and scaled up fairly well in tests into the 1000's of files range...

granted, the ZIP format is itself not ideally suited to this use pattern (an
actual FS-like format would be better), but ZIP's worked for my uses...

internally though, the lib used AVL trees for things like space management
(and was partly limited in terms of effective max sub-file sizes due to the
general lack of fragmenting files in "common" ZIP, and internally the need
to keep entire files buffered in memory when working on them...).

on load/save, it also converted the ZIP-style flat-listings into a more
conventional directory-based structure (on saving, it would then re-write
the central directory), ... the lib was also designed in such a way that
failing to commit the central directory was a recoverable operation (it
would re-scan the file and locate the contained files, ...).

note that this is actually because the design of ZIP file format is actually
a bit more robust than most of the tools around would seem to give it credit
(most of which simply fail if there is no central directory, ...).



From: Rod Pemberton on

"James Harris" <james.harris.1(a)googlemail.com> wrote in message
news:0b3e5bd8-dec4-4745-bbc3-16b0571efe12(a)v25g2000yqk.googlegroups.com...
> On 9 Feb, 20:52, Ikke <i...(a)hier.be> wrote:
> > [snip]
>
> ... In the decimal you might
> want to remove the last directory level so rather than
>
> c:\data\15\87\45\32\15874532.xml
>
> you'd have
>
> c:\data\15\87\45\15874532.xml
>
> Then you'd have up to 100 files in one folder, and up to 100 folders
> in each higher folder. I'm making some assumptions on things you
> haven't told us but given what you have this sounds like a good
> solution.
>

He did mention hex. At two characters per dir name, that'd give 256 per
dir. Still reasonable.


RP


From: Rod Pemberton on
"Ikke" <ikke(a)hier.be> wrote in message
news:Xns9D1ADE8234107ikkehierbe(a)69.16.176.253...
>
> ... I need to process a few million files ...

That's when you buy a miniframe...

Although, it seems clusters, and esp. clusters of Xbox 360's lately, are
popular.

What to you mean by "process" exactly? E.g., reports? computation?
archiving? indexing? searching? etc.

> Stuffing all files in one and the same directory isn't an option:

It'd be a bad choice too. A linear search for data would likely take
forever. Just separating the data by a single character can drastically
reduce the time needed to sort data. I.e., you're "indexing" the data by a
single character...

> What I need now is a way to store these files in an orderly fashion, but
> I don't have any metadata to go on. All files have the same extention,
> and all are named the same way (a number indicating the sequence, and the
> extention ".xml").
>
> I've though about converting the number into a string (or a hex string),
> and using subparts of this string as the directory. For example, if the
> number was 15874532, then the full path would become c:\data\15\87\45\32
> or c:\data\00\F2\39\E4, with the filename being 15874532.xml . The second
> option would yield larger directory capacity, but would be less human
> readable.

You're attempting to use the filesystem's directory as a database and to
index. While that can be fast, I'm not sure that'd be optimal. You'd
definately want a ramdisk (or solid-state disk drives) big enough to load
and store "a few million files", otherwise you may have hardware related
throughput bottlenecks when reading/writing repeatedly to the filesystem.
I'd guess that a database would be a better choice. It should implement
efficient hashing, data indexing, data compression, etc.

> Does anyone have any other suggestions? I've though[t] about using a
> database,

....

> but I don't like the overhead that would cause

What overhead? Are you saying a random access, indexed search has more time
overhead than a linear search of a file? (FYI, I'm having a hard time
believing the overhead of a database is that great.)

> ... it would
> mean adding an additional tool for someone who just wants to check a
> single file.

1) How do they (your users) locate the single file they need? The name you
presented has no meaning - at least to me. I.e., you'd need a database just
to tell them what file to look in for whatever they're looking for... If
there was date and time information encoded into the numeric name, then it'd
make sense that someone might be accessing the file directly by name. E.g.,
WK021220101615 - Week ending Feb. 12 2010 4:15pm. It's also wise to encode
the timezone. If you're recording dated and timed transactions, the prior
day's entries may be off by an hour if you're using typical integer seconds
from GMT or UTC to record the time _and_ you're using Daylight Saving Time.

2) Some filesystems support links. The links could be named anything that
is meaningful to the user (or database), while the file name can be a
cryptic mess, buried in a directory nested many levels deep. Although, for
faster access, I'd recommend keeping them as close to the top or system
directory as is possible (but not in the top dir for security reasons).

3) Do both. You've already got the data as files, yes? Just keep them.
But, also shove them into a database for faster random data access.

> Having them on a file system also means these files are
> instantly accessible.

What? Didn't you just state that you found filesystem access was
exceptionally "sloow"..


Rod Pemberton


From: rossum on
On Tue, 09 Feb 2010 22:39:19 GMT, Ikke <ikke(a)hier.be> wrote:

>I've never tried sqlite, the only databases I've ever used are Oracle and
>MySQL, neither of which are suited for this purpose. Oracle for example
>requires too much processing power (the system these files will reside on
>is a very old machine with a big hard disk).
If you use a database then you are probably better off with more
smaller disks than one large disk. At the very least the files should
be on one spindle and the indexes on another.

rossum

From: Moi on
On Tue, 09 Feb 2010 20:52:13 +0000, Ikke wrote:

> Hi everybody,
>
> I'm stuck with a little problem, and I can't seem to find the right way
> to solve it.
>
> Basically, I need to process a few million files and gather some data
> from them. Processing them is not a problem, storing them is a big
> problem on the other hand.
>
> Stuffing all files in one and the same directory isn't an option: I
> don't know whether or not NTFS can handle millions of files in one
> directory, and even if it could, I've got a few directories with
> thousands of files in them, and reading from these is sloooow....
>
> What I need now is a way to store these files in an orderly fashion, but
> I don't have any metadata to go on. All files have the same extention,
> and all are named the same way (a number indicating the sequence, and
> the extention ".xml").

Only people who store XML into files seem to have this kind of problems.

>
> I've though about converting the number into a string (or a hex string),
> and using subparts of this string as the directory. For example, if the
> number was 15874532, then the full path would become c:\data\15\87\45\32
> or c:\data\00\F2\39\E4, with the filename being 15874532.xml . The
> second option would yield larger directory capacity, but would be less
> human readable.
>
> Does anyone have any other suggestions? I've though about using a
> database, but I don't like the overhead that would cause, and it would
> mean adding an additional tool for someone who just wants to check a
> single file. Having them on a file system also means these files are
> instantly accessible.

The standard way is to use the first few letters of the filename as a directory name.

e.g.
1/5/8/15874532.xml
or
15/87/45/15874532.xml
to have your directories more pupulated.

The advantage of this method is, that the filename-characters are guaranteed
to be valid directory-name characters as well.

You may have to change your filename-convention to use leading zeros.
You may or may not use a fixed number of directory levels.

HTH,
AvK