Prev: Help Break-In Mockina
Next: application domain
From: Sin Jeong-hun on 17 May 2010 00:48 For example, I want to extract unique words from some text. Using List<string> to store the extracted words in the following way will be inefficient List<String> words; foreach(String word in text) { if(words.Contains(word)==false) words.Add(word); } because, according to the documentation, the Contains is an O(n) operation. I was looking for something like a UniqueList but I couldn't. Up until now, I was using something like Dictionary<String, object> words; object dummy = null; foreach(String word in text) { words[word]=dummy; } This works but using unnecessary "dummy" object makes the code look dirty. What would be the best way to achieve what I'm trying to do? Thanks.
From: Peter Duniho on 17 May 2010 01:21 Sin Jeong-hun wrote: > For example, I want to extract unique words from some text. Using > List<string> to store the extracted words in the following way will be > inefficient > [...] I was using something like > Dictionary<String, object> words; > object dummy = null; > foreach(String word in text) > { > words[word]=dummy; > } > This works but using unnecessary "dummy" object makes the code look > dirty. What would be the best way to achieve what I'm trying to do? Sounds like you want the HashSet<T> class.
From: Göran Andersson on 17 May 2010 06:06 On 2010-05-17 06:48, Sin Jeong-hun wrote: > For example, I want to extract unique words from some text. Using > List<string> to store the extracted words in the following way will be > inefficient > List<String> words; > foreach(String word in text) > { > if(words.Contains(word)==false) words.Add(word); > } > because, according to the documentation, the Contains is an O(n) > operation. > I was looking for something like a UniqueList but I couldn't. Up until > now, I was using something like > Dictionary<String, object> words; > object dummy = null; > foreach(String word in text) > { > words[word]=dummy; > } > This works but using unnecessary "dummy" object makes the code look > dirty. What would be the best way to achieve what I'm trying to do? > Thanks. If you are using framework 3.5 or later, you can use a HashSet<string>. The hash set automatically ignores duplicates, so you can just add all the strings to it and at the end it will only contain one of each: HashSet<string> words = new HashSet<string>(); foreach (string word in text) { words.Add(word); } You can also use the Distinct extension method to create a distinct result. This gives simpler code, but has slightly more overhead: List<string> words = text.Distinct().ToList(); For earlier versions of the framework the dictionary is the best option. However, you can use a smaller data type than an Object reference, and one that is not a reference type (as the garbage collector has to scan all references each run), like a Dictionary<string, byte>. -- G�ran Andersson _____ http://www.guffa.com
From: Peter Duniho on 17 May 2010 12:08 G�ran Andersson wrote: > If you are using framework 3.5 or later, you can use a HashSet<string>. > The hash set automatically ignores duplicates, so you can just add all > the strings to it and at the end it will only contain one of each: > > HashSet<string> words = new HashSet<string>(); > foreach (string word in text) { > words.Add(word); > } The above can be replaced with: HashSet<string> words = new HashSet<string>(text); �where in both examples, of course, whatever "text" is, it's declared to be IEnumerable<string>. > You can also use the Distinct extension method to create a distinct > result. This gives simpler code, but has slightly more overhead: > > List<string> words = text.Distinct().ToList(); The extension method only works in a situation when you could pass the object directly to the HashSet constructor, and just using the HashSet constructor is simpler than the above (and results in an object that continues to have the "set" semantics desired). > For earlier versions of the framework the dictionary is the best option. > However, you can use a smaller data type than an Object reference, and > one that is not a reference type (as the garbage collector has to scan > all references each run), like a Dictionary<string, byte>. Seems like a premature optimization to me. The generational GC in .NET will minimize whatever nominal overhead there is in managing the dictionary (since it still has to scan the key, which has data locality with the value, the bulk of the overhead involved is in the memory i/o cost which is incurred regardless). As far as the data type being smaller, I doubt there will be any difference between using "object" and "byte", because the value is stored inside a managed struct with other data elements, the size of which is likely aligned to at least 32-bit size. Pete
From: csharper on 17 May 2010 15:43
On May 17, 6:06 am, Göran Andersson <gu...(a)guffa.com> wrote: > On 2010-05-17 06:48, Sin Jeong-hun wrote: > > > > > For example, I want to extract unique words from some text. Using > > List<string> to store the extracted words in the following way will be > > inefficient > > List<String> words; > > foreach(String word in text) > > { > > if(words.Contains(word)==false) words.Add(word); > > } > > because, according to the documentation, the Contains is an O(n) > > operation. > > I was looking for something like a UniqueList but I couldn't. Up until > > now, I was using something like > > Dictionary<String, object> words; > > object dummy = null; > > foreach(String word in text) > > { > > words[word]=dummy; > > } > > This works but using unnecessary "dummy" object makes the code look > > dirty. What would be the best way to achieve what I'm trying to do? > > Thanks. > > If you are using framework 3.5 or later, you can use a HashSet<string>. > The hash set automatically ignores duplicates, so you can just add all > the strings to it and at the end it will only contain one of each: > > HashSet<string> words = new HashSet<string>(); > foreach (string word in text) { > words.Add(word); > } > > You can also use the Distinct extension method to create a distinct > result. This gives simpler code, but has slightly more overhead: > > List<string> words = text.Distinct().ToList(); > > For earlier versions of the framework the dictionary is the best option. > However, you can use a smaller data type than an Object reference, and > one that is not a reference type (as the garbage collector has to scan > all references each run), like a Dictionary<string, byte>. > > -- > Göran Andersson > _____http://www.guffa.com When I read the OP's question, I was also thinking that HashSet is what s/he was looking for. But on 2nd thought, I am wondering how different is HashSet from a List or Dictionary in terms of performance. HashSet does save us from one or two lines comparing if the next item is already in the collection, but how is HashSet implemented? In its implementation, doesn't it still need to compare if the next item is already in the set? Other than saving us a couple of lines of code, does HashSet really improve the performance? |