Prev: Converting a LIFO into FIFO
Next: Heathfield: You're wrong: here's proof: (WAS: Re: Welcome To The Word Wide Mirror Sci.Math)
From: Taliesin Nuin on 13 Nov 2009 11:51 Hi, I've been out of proper programming for a while (ended up doing mostly database design work) and am attempting to put together a system to model some complex networks (in fact, relationships between some business entities). I need to find the best way to model these networks which are essentially hundreds to thousands of nodes with a potentially unlimited number of lines between any of them of different types (in practice, likely between 0 and 20 between any given two nodes, and the number of nodes connected too *typically* between 1 and 30). Queries on this data will commonly take the form of picking a subject node and tracing all connected nodes out to a chosen depth where the connecting line is of a chosen type. It's a little hard to describe this, but harder still to draw diagrams in ASCII. ;) Hopefully the scenario is clear from the above. I'm trying to find the best way of modelling this. The model will be read much more often than it is written to, but be able to add and remove nodes and edges in reasonable time nonetheless. And obviously it needs to be persistent. The only solution I am aware of are to model this is in a relational database as an Adjaceny List. I have serious concerns about the performance of this approach, but at least I understand it (well, I have a copy of Joe Celko's Trees and Hierarchies in SQL next to me ;) and it offers the persistence and interfaces that I need. I have a vague awareness that there are such things as GraphML. I have a concern that such methods can't deal with the data piece-meal however (i.e. read part of the graph into memory from the file and work with it without having to instantiate the entire monster) which I can do with the database approach. The lines will have attached information which also points toward the database approach, but this has to scale and a relational database doesn't seem to be suited to this task. I will be caching the results of as many queries as possible but I obviously can't cache them all and these will need to be rebuilt after each update also. So that's the size of the task I've bitten off. I'm obviously not looking for implementation details, but any suggestions people can make about what technologies are appropriate or what is the right approach would be hugely useful. I am aware of how abstract this question is, but that's because unfortunately I may not know enough to ask the right questions, yet. For preference, technologies accessible from C++ or Python are ideal for me, but I'll muddle through in whatever is necessary. I could code something up myself in theory, but I am sure this is a a known problem that better programmers have found the best answers to before me. Thanks very much for any replies. Taliesin.
From: Gene on 13 Nov 2009 22:48
On Nov 13, 11:51 am, Taliesin Nuin <tn...(a)bath.ac.uk> wrote: > Hi, > > I've been out of proper programming for a while (ended up doing mostly > database design work) and am attempting to put together a system to > model some complex networks (in fact, relationships between some > business entities). I need to find the best way to model these networks > which are essentially hundreds to thousands of nodes with a potentially > unlimited number of lines between any of them of different types (in > practice, likely between 0 and 20 between any given two nodes, and the > number of nodes connected too *typically* between 1 and 30). > > Queries on this data will commonly take the form of picking a subject > node and tracing all connected nodes out to a chosen depth where the > connecting line is of a chosen type. > > It's a little hard to describe this, but harder still to draw diagrams > in ASCII. ;) Hopefully the scenario is clear from the above. > > I'm trying to find the best way of modelling this. The model will be > read much more often than it is written to, but be able to add and > remove nodes and edges in reasonable time nonetheless. And obviously it > needs to be persistent. > > The only solution I am aware of are to model this is in a relational > database as an Adjaceny List. I have serious concerns about the > performance of this approach, but at least I understand it (well, I have > a copy of Joe Celko's Trees and Hierarchies in SQL next to me ;) and > it offers the persistence and interfaces that I need. > > I have a vague awareness that there are such things as GraphML. I have a > concern that such methods can't deal with the data piece-meal however > (i.e. read part of the graph into memory from the file and work with it > without having to instantiate the entire monster) which I can do with > the database approach. > > The lines will have attached information which also points toward the > database approach, but this has to scale and a relational database > doesn't seem to be suited to this task. I will be caching the results of > as many queries as possible but I obviously can't cache them all and > these will need to be rebuilt after each update also. > > So that's the size of the task I've bitten off. I'm obviously not > looking for implementation details, but any suggestions people can make > about what technologies are appropriate or what is the right approach > would be hugely useful. I am aware of how abstract this question is, but > that's because unfortunately I may not know enough to ask the right > questions, yet. > > For preference, technologies accessible from C++ or Python are ideal for > me, but I'll muddle through in whatever is necessary. I could code > something up myself in theory, but I am sure this is a a known problem > that better programmers have found the best answers to before me. You didn't say if persistence needs to be continuous. If so, the relational DB is about the best you will do. Associate an adjacency list with each node (by indexed join) giving the list of edges (represesented as node pairs). You can get _much_ better speed if by persistence you mean that it's possible to freeze the graph by writing it to a file and rebuild it later. In this case the data structure is the same: nodes containing adjacency lists of edges, which are pairs of pointers of nodes. It's easy to write such a graph to disk in XML or some other convenient format by using a depth-first-search algorithm to discover all the nodes and edges exactly once. Use unique node and edge IDs to facilitate. One convenient idea is to keep nodes and edges in arrays and use the array index as the respective ID. Then reading the graph back in is absolutely straightforward. |