From: Ikke on 9 Feb 2010 15:52 Hi everybody, I'm stuck with a little problem, and I can't seem to find the right way to solve it. Basically, I need to process a few million files and gather some data from them. Processing them is not a problem, storing them is a big problem on the other hand. Stuffing all files in one and the same directory isn't an option: I don't know whether or not NTFS can handle millions of files in one directory, and even if it could, I've got a few directories with thousands of files in them, and reading from these is sloooow.... What I need now is a way to store these files in an orderly fashion, but I don't have any metadata to go on. All files have the same extention, and all are named the same way (a number indicating the sequence, and the extention ".xml"). I've though about converting the number into a string (or a hex string), and using subparts of this string as the directory. For example, if the number was 15874532, then the full path would become c:\data\15\87\45\32 or c:\data\00\F2\39\E4, with the filename being 15874532.xml . The second option would yield larger directory capacity, but would be less human readable. Does anyone have any other suggestions? I've though about using a database, but I don't like the overhead that would cause, and it would mean adding an additional tool for someone who just wants to check a single file. Having them on a file system also means these files are instantly accessible. Thanks, Ikke
From: Andrew Poelstra on 9 Feb 2010 16:15 On 2010-02-09, Ikke <ikke(a)hier.be> wrote: > Hi everybody, > > I'm stuck with a little problem, and I can't seem to find the right way > to solve it. > > Basically, I need to process a few million files and gather some data > from them. Processing them is not a problem, storing them is a big > problem on the other hand. > > Stuffing all files in one and the same directory isn't an option: I don't > know whether or not NTFS can handle millions of files in one directory, > and even if it could, I've got a few directories with thousands of files > in them, and reading from these is sloooow.... > > What I need now is a way to store these files in an orderly fashion, but > I don't have any metadata to go on. All files have the same extention, > and all are named the same way (a number indicating the sequence, and the > extention ".xml"). > > I've though about converting the number into a string (or a hex string), > and using subparts of this string as the directory. For example, if the > number was 15874532, then the full path would become c:\data\15\87\45\32 > or c:\data\00\F2\39\E4, with the filename being 15874532.xml . The second > option would yield larger directory capacity, but would be less human > readable. > > Does anyone have any other suggestions? I've though about using a > database, but I don't like the overhead that would cause, and it would > mean adding an additional tool for someone who just wants to check a > single file. Having them on a file system also means these files are > instantly accessible. > If you've got millions of files, they aren't going to be human-accessible anyway. And any database software is going to be faster than NTFS, so I wouldn't worry about overhead. Can you store the entire files in the database instead of using the filesystem at all? (Would sqlite work? It's very lightweight and very fast.)
From: Patricia Shanahan on 9 Feb 2010 16:28 Ikke wrote: > Hi everybody, > > I'm stuck with a little problem, and I can't seem to find the right way > to solve it. .... > What I need now is a way to store these files in an orderly fashion, but > I don't have any metadata to go on. All files have the same extention, > and all are named the same way (a number indicating the sequence, and the > extention ".xml"). .... > > Does anyone have any other suggestions? I've though about using a > database, but I don't like the overhead that would cause, and it would > mean adding an additional tool for someone who just wants to check a > single file. Having them on a file system also means these files are > instantly accessible. Although you don't want to use a database, the problem cries out for the sort of indexing data structures databases use. For example, maybe the solution is a B-tree with each leaf a directory, rather than a disk block. The B-tree, and an associated application, would provide fast ways of finding a file by name, inserting and removing files. A user could still use "find" or operating system indexing to locate a file directly. You may be able to use a simpler structure if the set of files does not change, but unchanging sets of a million actively used anythings are rare. Patricia
From: Ikke on 9 Feb 2010 17:39 Andrew Poelstra <apoelstra(a)localhost.localdomain> wrote in news:slrnhn3jv1.akl.apoelstra(a)localhost.localdomain: <snip> > If you've got millions of files, they aren't going to be > human-accessible anyway. I don't see why not - when someone wants to check a file, he/she knows the id of the file (as in the example: 15874532). So he/she simply navigates to folder 15/87/45/32, and opens the file. There are almost a million files on my home computer, I've never had any problem accessing those, so it's possible that I don't quite understand what you mean. > And any database software is going to be faster than NTFS, so > I wouldn't worry about overhead. Can you store the entire files > in the database instead of using the filesystem at all? At work, we've conducted some tests in the past, and databases are rarely faster than a file system when it comes to uncached items. As for the overhead, I meant the overhead that comes with writing the software to insert/update/delete the files in the database (not that much work), and building an interface for the users to work with files in the database (which would require more work). Enabling the users to access the files directly would be easier, hence the file system. > (Would sqlite work? It's very lightweight and very fast.) I've never tried sqlite, the only databases I've ever used are Oracle and MySQL, neither of which are suited for this purpose. Oracle for example requires too much processing power (the system these files will reside on is a very old machine with a big hard disk). Thanks, Ikke
From: James Harris on 9 Feb 2010 17:58 On 9 Feb, 20:52, Ikke <i...(a)hier.be> wrote: > Hi everybody, > > I'm stuck with a little problem, and I can't seem to find the right way > to solve it. > > Basically, I need to process a few million files and gather some data > from them. Processing them is not a problem, storing them is a big > problem on the other hand. > > Stuffing all files in one and the same directory isn't an option: I don't > know whether or not NTFS can handle millions of files in one directory, > and even if it could, I've got a few directories with thousands of files > in them, and reading from these is sloooow.... > > What I need now is a way to store these files in an orderly fashion, but > I don't have any metadata to go on. All files have the same extention, > and all are named the same way (a number indicating the sequence, and the > extention ".xml"). > > I've though about converting the number into a string (or a hex string), > and using subparts of this string as the directory. For example, if the > number was 15874532, then the full path would become c:\data\15\87\45\32 > or c:\data\00\F2\39\E4, with the filename being 15874532.xml . The second > option would yield larger directory capacity, but would be less human > readable. > > Does anyone have any other suggestions? I've though about using a > database, but I don't like the overhead that would cause, and it would > mean adding an additional tool for someone who just wants to check a > single file. Having them on a file system also means these files are > instantly accessible. Your own suggestion sounds most appropriate. In the decimal you might want to remove the last directory level so rather than c:\data\15\87\45\32\15874532.xml you'd have c:\data\15\87\45\15874532.xml Then you'd have up to 100 files in one folder, and up to 100 folders in each higher folder. I'm making some assumptions on things you haven't told us but given what you have this sounds like a good solution. James
|
Next
|
Last
Pages: 1 2 3 4 Prev: compiling C program containing Xutil functions Next: Warning to newbies |