Prev: Calling Internal Method using Reflection
Next: hexedit assembly - snk, strong name, sing assembly question
From: guillaume on 7 Jun 2010 05:10 My question is related to the discussion "about the hashtable and equals" where I learnt a few interesting things. In my company we deal with big volume of data indexed by unique integer ids (for background information, we deal with road traffic data and and the road network is modelled by a set of edges representing road sections). The edges' ids are ordered from 0 to the total number of edges (ie around 20 millions for the French network). However we often only deal with a subset of the road network of only 100000, 1 million or 10 millions of ids which are uniformly spread between 0 and 20 millions. We usually use C# Dictionary with integer keys to store any data by edges' ids (road section characteristics or archived data, ...) and it scales very well until several millions of ids. However using a generic hashing algorithm with 32bits integers (potentially stretching from -2.14 x 10^9 to +2.14 x 10^9) whereas the repartition of our keys are much coarser always seemed to me as an overkill. One optimization I came up with was just using a dense array of the maximal size where the key of each object is in fact its index inside this array. It works pretty well : for example when initializing a dictionary with 10 millions of keys randomly chosen between 0 and 20 millions, there is a gain of more than 200Mb not using a 10 millions key/value Dictionary in spite of the fact that our array contains actually 20 millions values. In addition, even if the Dictionary access time theoretically approaches O(1), in my example random key access was in average 2 times quicker using the array. Memory-wise, this approach is also very good when I want to access read-only data modelled by a C# struct object, as I can pre-allocate a big array of structs on the stack and removing all the overhead creating by C# objects (pointers, GC reference and so on...). I even pushed my idea forward by implementing a generic ArrayDictionary<int,T> object inheriting C# IDictionary<int,T> interface in order to switch at runtime between the two implementation. So all in all, my questions are : - What is the hashing algorithm used by C# Dictionary? Is it generic to all type of data once the GetHashCode function is provided? - Is my optimization sounds good or should I have studied more the possibilities provided by the existing Dictionary (using another hashing algorithm adapted to my needs?). Sorry for the long message, and thanks you if you read through it! Guillaume
From: Rick Lones on 7 Jun 2010 07:14 guillaume wrote: > My question is related to the discussion "about the hashtable and > equals" where I learnt a few interesting things. > > In my company we deal with big volume of data indexed by unique > integer ids (for background information, we deal with road traffic > data and and the road network is modelled by a set of edges > representing road sections). The edges' ids are ordered from 0 to the > total number of edges (ie around 20 millions for the French network). > However we often only deal with a subset of the road network of only > 100000, 1 million or 10 millions of ids which are uniformly spread > between 0 and 20 millions. > > We usually use C# Dictionary with integer keys to store any data by > edges' ids (road section characteristics or archived data, ...) and it > scales very well until several millions of ids. However using a > generic hashing algorithm with 32bits integers (potentially stretching > from -2.14 x 10^9 to +2.14 x 10^9) whereas the repartition of our keys > are much coarser always seemed to me as an overkill. > > One optimization I came up with was just using a dense array of the > maximal size where the key of each object is in fact its index inside > this array. It works pretty well : for example when initializing a > dictionary with 10 millions of keys randomly chosen between 0 and 20 > millions, there is a gain of more than 200Mb not using a 10 millions > key/value Dictionary in spite of the fact that our array contains > actually 20 millions values. In addition, even if the Dictionary > access time theoretically approaches O(1), in my example random key > access was in average 2 times quicker using the array. Memory-wise, > this approach is also very good when I want to access read-only data > modelled by a C# struct object, as I can pre-allocate a big array of > structs on the stack and removing all the overhead creating by C# > objects (pointers, GC reference and so on...). I even pushed my idea > forward by implementing a generic ArrayDictionary<int,T> object > inheriting C# IDictionary<int,T> interface in order to switch at > runtime between the two implementation. > > So all in all, my questions are : > - What is the hashing algorithm used by C# Dictionary? Is it generic > to all type of data once the GetHashCode function is provided? > - Is my optimization sounds good or should I have studied more the > possibilities provided by the existing Dictionary (using another > hashing algorithm adapted to my needs?). > > Sorry for the long message, and thanks you if you read through it! > > Guillaume If your n keys are in fact each unique and span the range from 1 to n (or similar) then I don't see how you can do better than a dense array, even theoretically. A hash table helps you most when your keys are sparse and not inherently ordered. As for the dictionary: I believe that the underlying structure is a red-black tree, but that also will not perform as well either space or time -wise as a straight indexed lookup for a dense well-ordered set of keys. The dictionary lookup would be O(log n), not O(1). Hash table access is O(1), but with overhead to maintain the bucket structure and deal with collisions. The array access should be O(1) with almost no overhead except for things outside your control (e.g., paging) which affect the other methods as well. >> - What is the hashing algorithm used by C# Dictionary? Is it generic >> to all type of data once the GetHashCode function is provided? Where are you coming from with this question? Dictionaries don't hash their keys, they use them as is - they just have to be an IComparable type so they can be properly placed within the tree. GetHashCode would have nothing to do with dictionaries, AFAIK. HTH, -rick-
From: Jeroen Mostert on 7 Jun 2010 13:25 On 2010-06-07 13:14, Rick Lones wrote: <snip> > As for the dictionary: I believe that the underlying structure is a > red-black tree, but that also will not perform as well either space or > time -wise as a straight indexed lookup for a dense well-ordered set of > keys. > Careful. SortedDictionary<> uses a red-black tree; Dictionary<> implements a hash table. The names of SortedDictionary and SortedList are poorly chosen, but there you go. -- J.
From: Rick Lones on 7 Jun 2010 20:42 Jeroen Mostert wrote: > On 2010-06-07 13:14, Rick Lones wrote: > <snip> >> As for the dictionary: I believe that the underlying structure is a >> red-black tree, but that also will not perform as well either space or >> time -wise as a straight indexed lookup for a dense well-ordered set of >> keys. >> > Careful. SortedDictionary<> uses a red-black tree; Dictionary<> > implements a hash table. The names of SortedDictionary and SortedList > are poorly chosen, but there you go. > Yes, thank you. SortedDictionary was the one I meant. And I was assuming the OP also meant that, maybe incorrectly, because it would be the best case for him. -rick-
From: Arne Vajhøj on 7 Jun 2010 21:36
On 07-06-2010 05:10, guillaume wrote: > My question is related to the discussion "about the hashtable and > equals" where I learnt a few interesting things. > > In my company we deal with big volume of data indexed by unique > integer ids (for background information, we deal with road traffic > data and and the road network is modelled by a set of edges > representing road sections). The edges' ids are ordered from 0 to the > total number of edges (ie around 20 millions for the French network). > However we often only deal with a subset of the road network of only > 100000, 1 million or 10 millions of ids which are uniformly spread > between 0 and 20 millions. > > We usually use C# Dictionary with integer keys to store any data by > edges' ids (road section characteristics or archived data, ...) and it > scales very well until several millions of ids. However using a > generic hashing algorithm with 32bits integers (potentially stretching > from -2.14 x 10^9 to +2.14 x 10^9) whereas the repartition of our keys > are much coarser always seemed to me as an overkill. > > One optimization I came up with was just using a dense array of the > maximal size where the key of each object is in fact its index inside > this array. It works pretty well : for example when initializing a > dictionary with 10 millions of keys randomly chosen between 0 and 20 > millions, there is a gain of more than 200Mb not using a 10 millions > key/value Dictionary in spite of the fact that our array contains > actually 20 millions values. In addition, even if the Dictionary > access time theoretically approaches O(1), in my example random key > access was in average 2 times quicker using the array. Memory-wise, > this approach is also very good when I want to access read-only data > modelled by a C# struct object, as I can pre-allocate a big array of > structs on the stack and removing all the overhead creating by C# > objects (pointers, GC reference and so on...). I even pushed my idea > forward by implementing a generic ArrayDictionary<int,T> object > inheriting C# IDictionary<int,T> interface in order to switch at > runtime between the two implementation. > > So all in all, my questions are : > - What is the hashing algorithm used by C# Dictionary? Is it generic > to all type of data once the GetHashCode function is provided? I think so. But you can check the source code. Either get it from MS or simply use Reflector. > - Is my optimization sounds good or should I have studied more the > possibilities provided by the existing Dictionary (using another > hashing algorithm adapted to my needs?). Nothing will beat direct indexing into an array. Dictionary are intended to be a solution where that is not possible. Arne |