Prev: Help using dd
Next: flashplayer-nonfree and multiple cores
From: Nix on 21 Dec 2009 17:24 On 6 Dec 2009, Martin Gregorie verbalised: > I run my own PostgreSQL-based mail archive, automatically fed from > Postfix via the magic of the 'always_bcc' directive. Its benefits are: > - spam isn't archived > - effectively unlimited archival storage > - fast searching and retrieving by any combination of address, subject, > date range and body text. > > I'm planning to make it available shortly: details can be found at > www.libelle-systems.com This looks very neat. I'm halfway through an INN backend to do this to newsfeeds: maybe I can arrange to use a compatible schema...
From: Martin Gregorie on 21 Dec 2009 18:19
On Mon, 21 Dec 2009 22:24:20 +0000, Nix wrote: > On 6 Dec 2009, Martin Gregorie verbalised: >> I run my own PostgreSQL-based mail archive, automatically fed from >> Postfix via the magic of the 'always_bcc' directive. Its benefits are: >> - spam isn't archived >> - effectively unlimited archival storage - fast searching and >> retrieving by any combination of address, subject, >> date range and body text. >> >> I'm planning to make it available shortly: details can be found at >> www.libelle-systems.com > > This looks very neat. I'm halfway through an INN backend to do this to > newsfeeds: maybe I can arrange to use a compatible schema... The schema should work as it stands, since by definition it can handle a pure text (non-MIME) mail message which is indexed by the address(es), date sent and subject. Searches operate on these three terms plus the plaintext part of the body, so searching and message retrieval should also work with few if any changes. Header parsing is handled by JavaMail. I know that also handles NNTP traffic but have no understanding of the detail of how it does that, since I'm not currently planning to handle NNTP. If you go the same way I have the loader path will need modification and so will the retrieval method: my MTA Bcc's all mail to a POP3 mailbox and a cron job batch loads that into the MailArchive database. A search and retrieve tool selects matching messages for inspection and retrieves interesting ones by mailing them to the search user. Both database operations are quick. Last night the loader scanned 117 messages in 10 seconds, loading 83 of them and discarding the rest - the loader discards anything that SA has marked as spam and anything that doesn't pass a set of configurable address filters, e.g. I don't archive system messages such as logwatch, archive loader or backup reports. A single forename body text search over all 73,600 messages in the database) took 36 seconds to pull all 30 matches out. That's searching a PostgreSQL database on a 866 MHz single P3 box with the search tool on a 1.6HGz CoreDuo on the other side of my 100mbit LAN. -- martin@ | Martin Gregorie gregorie. | Essex, UK org |