Unique List (List + Dictionary?) [CSharp]

Prev: Help Break-In Mockina
Next: application domain

From: Harlan Messinger on 17 May 2010 16:11

csharper wrote:

> When I read the OP's question, I was also thinking that HashSet is
> what s/he was looking for. But on 2nd thought, I am wondering how
> different is HashSet from a List or Dictionary in terms of
> performance. HashSet does save us from one or two lines comparing if
> the next item is already in the collection, but how is HashSet
> implemented? In its implementation, doesn't it still need to compare
> if the next item is already in the set? Other than saving us a couple
> of lines of code, does HashSet really improve the performance?

Yes, because the way a HashSet stores its items speeds up searches
enormously in comparison with the way a List operates. With a List, a
search has to examine the existing items one by one. With a HashSet, the
search uses arithmetic to compute where the item will be located if it's
already a member, and checks that location directly. See
http://www.vcskicks.com/hashset.php for a sample performance comparison.

From: Arne Vajhøj on 17 May 2010 19:33

On 17-05-2010 15:43, csharper wrote:
> When I read the OP's question, I was also thinking that HashSet is
> what s/he was looking for. But on 2nd thought, I am wondering how
> different is HashSet from a List or Dictionary in terms of
> performance. HashSet does save us from one or two lines comparing if
> the next item is already in the collection, but how is HashSet
> implemented? In its implementation, doesn't it still need to compare
> if the next item is already in the set? Other than saving us a couple
> of lines of code, does HashSet really improve the performance?

The name HashSet somewhat implies that it uses a hash table.

Meaning that lookup is a O(1) operation.

So performance of HashSet would be similar to Dictionary
and much better than List which is O(n) for lookup.

List is better for indexing.

Arne

From: Arne Vajhøj on 17 May 2010 22:00

On 17-05-2010 12:08, Peter Duniho wrote:
>> HashSet<string> words = new HashSet<string>();
>> foreach (string word in text) {
>> words.Add(word);
>> }
>
> The above can be replaced with:
>
> HashSet<string> words = new HashSet<string>(text);

And if .Split(' ') is added to both, then they will both
compile.

Arne

From: Peter Duniho on 18 May 2010 11:18

Arne Vajh�j wrote:
> On 17-05-2010 12:08, Peter Duniho wrote:
>>> HashSet<string> words = new HashSet<string>();
>>> foreach (string word in text) {
>>> words.Add(word);
>>> }
>>
>> The above can be replaced with:
>>
>> HashSet<string> words = new HashSet<string>(text);
>
> And if .Split(' ') is added to both, then they will both
> compile.

They compile fine as long as "text" already implements IEnumerable<string>.

There was nothing in the original post to suggest it didn't, as you seem
to be assuming (in fact, the original post strongly suggests it
_does_�the OP didn't show any declaration of "text", but it is already
being used as an enumerable type in the OP).

Pete

From: Göran Andersson on 19 May 2010 05:43

On 2010-05-17 18:08, Peter Duniho wrote:
> G�ran Andersson wrote:
>> If you are using framework 3.5 or later, you can use a
>> HashSet<string>. The hash set automatically ignores duplicates, so you
>> can just add all the strings to it and at the end it will only contain
>> one of each:
>>
>> HashSet<string> words = new HashSet<string>();
>> foreach (string word in text) {
>> words.Add(word);
>> }
>
> The above can be replaced with:
>
> HashSet<string> words = new HashSet<string>(text);

Good point.

> �where in both examples, of course, whatever "text" is, it's declared to
> be IEnumerable<string>.
>
>> You can also use the Distinct extension method to create a distinct
>> result. This gives simpler code, but has slightly more overhead:
>>
>> List<string> words = text.Distinct().ToList();
>
> The extension method only works in a situation when you could pass the
> object directly to the HashSet constructor, and just using the HashSet
> constructor is simpler than the above (and results in an object that
> continues to have the "set" semantics desired).

Yes, supposing that the set is the desired result. If you just want a
list as the result, there is no specific advantage of going the way over
a set.

>> For earlier versions of the framework the dictionary is the best
>> option. However, you can use a smaller data type than an Object
>> reference, and one that is not a reference type (as the garbage
>> collector has to scan all references each run), like a
>> Dictionary<string, byte>.
>
> Seems like a premature optimization to me. The generational GC in .NET
> will minimize whatever nominal overhead there is in managing the
> dictionary (since it still has to scan the key, which has data locality
> with the value, the bulk of the overhead involved is in the memory i/o
> cost which is incurred regardless). As far as the data type being
> smaller, I doubt there will be any difference between using "object" and
> "byte", because the value is stored inside a managed struct with other
> data elements, the size of which is likely aligned to at least 32-bit size.

The implementation seems to have changed in the latest framework... The
keys and values used to be stored in separate arrays, which would
definitely make the value array smaller with a smaller data type.

>
> Pete

--
G�ran Andersson
_____
http://www.guffa.com

First | Prev | Next | Last
Pages: 1 2 3
Prev: Help Break-In Mockina
Next: application domain