From: Brad on 21 Jun 2010 08:34 I think there are better (more efficient) ways to do this, but I've only ever needed this simple approach: void compare(const std::vector<std::string>& strings) { // Normally, this is randomly generated. It is a C style string. char the_string[4] = "abc"; std::vector<std::string>::const_iterator send(strings.end()); for (std::vector<std::string>::const_iterator sit = strings.begin(); sit != send; ++sit) { if (the_string == *sit) std::clog << "MATCH:\t" << *sit << std::endl; } } Basically, I'm comparing a randomly generated string to a vector of strings by iterating through the vector (as seen above). Here is the problem: The vector may have one or millions of strings and execution time slows considerably when I go above a few thousand strings. Is there a more efficient way to approach this? My code works fine on smaller vectors, I just never expected to do millions and it may not be possible to do that many as quickly as a few, but I wanted to ask. Perhaps there is something in algorithm that I could use. Just point me to it. I prefer standard C++ wherever possible. Thanks, Brad -- [ See http://www.gotw.ca/resources/clcm.htm for info about ] [ comp.lang.c++.moderated. First time posters: Do this! ]
From: Seungbeom Kim on 21 Jun 2010 21:27 On 2010-06-21 16:34, Brad wrote: > I think there are better (more efficient) ways to do this, but I've > only ever needed this simple approach: > > void compare(const std::vector<std::string>& strings) > { > // Normally, this is randomly generated. It is a C style string. > char the_string[4] = "abc"; > > std::vector<std::string>::const_iterator send(strings.end()); > > for (std::vector<std::string>::const_iterator sit = strings.begin(); sit != send; ++sit) > { > if (the_string == *sit) > std::clog<< "MATCH:\t"<< *sit<< std::endl; > } > } > > Basically, I'm comparing a randomly generated string to a vector of > strings by iterating through the vector (as seen above). Here is the > problem: The vector may have one or millions of strings and execution > time slows considerably when I go above a few thousand strings. > > Is there a more efficient way to approach this? My code works fine on > smaller vectors, I just never expected to do millions and it may not > be possible to do that many as quickly as a few, but I wanted to ask. > Perhaps there is something in algorithm that I could use. Just point > me to it. I prefer standard C++ wherever possible. As it is given, the function above prints the (same) target string as many times as it is found in the given vector. That task cannot be done faster than a running time of O(strings.size() * strlen(the_string)), and I don't see anything in the given function that makes the running time worse than that. Some implementations may be less than optimal; e.g. the library that comes with g++-4.4 computes strlen(the_string) each time operator== is called, so you could make the function faster by constructing an std::string object before the loop and using it for comparison, but I don't think the difference would be very noticeable unless the_string is quite long. And that just reduces the constant in O(strings.size()), still leaving the algorithm as a linear-time one. Depending on your needs, if you need to do the same thing many times, you could store the strings in an std::multiset<std::string> instead of an std::vector<std::string>, and a call to std::multiset<std::string> ..count(the_string) should give you the count in a logarithmic time. Anyway, the task the function above is doing is quite dumb, and I suspect you're actually doing something much more interesting than that, but too much detail has been left out. -- Seungbeom Kim [ See http://www.gotw.ca/resources/clcm.htm for info about ] [ comp.lang.c++.moderated. First time posters: Do this! ]
From: George Neuner on 21 Jun 2010 21:28 On Mon, 21 Jun 2010 17:34:18 CST, Brad <byte8bits(a)gmail.com> wrote: >I think there are better (more efficient) ways to do this, but I've >only ever needed this simple approach: > >void compare(const std::vector<std::string>& strings) >{ > // Normally, this is randomly generated. It is a C style >string. > char the_string[4] = "abc"; > > std::vector<std::string>::const_iterator send(strings.end()); > > for (std::vector<std::string>::const_iterator sit = strings.begin(); >sit != send; ++sit) > { > if (the_string == *sit) > std::clog << "MATCH:\t" << *sit << std::endl; > } >} > >Basically, I'm comparing a randomly generated string ... What use is a random string? > ... to a vector of >strings by iterating through the vector (as seen above). Here is the >problem: The vector may have one or millions of strings and execution >time slows considerably when I go above a few thousand strings. > >Is there a more efficient way to approach this? RegEx? A dedicated string search algorithm such as Boyer-Moore or Knuth-Morris-Pratt or Rabin-Karp? >My code works fine on smaller vectors, I just never expected to do >millions and it may not be possible to do that many as quickly as >a few, but I wanted to ask. You can't possibly check a long stream as quickly as a short one. All the different search algorithms require preprocessing of the search string ... what is best to do depends on buffering, on how many distinct patterns you need to search for, how often the search set is expected to change and how many times you expect to reuse existing search sets. Boyer-Moore is the fastest method. BM searches forward but matches the pattern in reverse order, so it requires the ability to "back up" in the input stream. That's a simple buffering issue, but it slightly complicates use vs other methods that match forward. The running time of BM is, on average, sublinear because BM typically does not need to examine every input character (it does so only in the worst case). Knuth-Morris-Pratt matches forward, so it's useful in limited memory situations. KMP examines every input character once. RegEx matches forward and can search simultaneously for multiple strings. RegEx examines every input character once. Rabin-Karp matches forward and can search simultaneously for multiple strings. RK examines every potential match character (not necessarily every character). BM computes a hash over each potential match target, so it is best used when the input is clearly delimited. You can control the granularity of matches to tune performance (e.g., for a text document you could hash entire sentences or you could hash by word and string them together to match sentence fragments). There are other algorithms but these are the most well known. >Perhaps there is something in algorithm that I could use. Just point >me to it. I prefer standard C++ wherever possible. Current compilers should have RegEx either in std or in tr1. Alternatively, Boost has RegEx. The dedicated search algorithms you will have to code yourself (or find them already coded somewhere). George -- [ See http://www.gotw.ca/resources/clcm.htm for info about ] [ comp.lang.c++.moderated. First time posters: Do this! ]
From: Daniel Krügler on 22 Jun 2010 08:29 On 22 Jun., 01:34, Brad <byte8b...(a)gmail.com> wrote: > I think there are better (more efficient) ways to do this, but I've > only ever needed this simple approach: > > void compare(const std::vector<std::string>& strings) > { > // Normally, this is randomly generated. It is a C style > string. > char the_string[4] = "abc"; > > std::vector<std::string>::const_iterator send(strings.end()); > > for (std::vector<std::string>::const_iterator sit = strings.begin(); > sit != send; ++sit) > { > if (the_string == *sit) > std::clog << "MATCH:\t" << *sit << std::endl; > } > } > > Basically, I'm comparing a randomly generated string to a vector of > strings by iterating through the vector (as seen above). Here is the > problem: The vector may have one or millions of strings and execution > time slows considerably when I go above a few thousand strings. > > Is there a more efficient way to approach this? My code works fine on > smaller vectors, I just never expected to do millions and it may not > be possible to do that many as quickly as a few, but I wanted to ask. > Perhaps there is something in algorithm that I could use. Just point > me to it. I prefer standard C++ wherever possible. It depends. If you can ensure that the input-vector is sorted it could improve the performance drastically by instead using a binary search algorithm of the random string within the sequence (e.g. lower_bound). Take care that the predicate used for the sorting is the same (at least equivalent) to that used in the binary search. The advantage of this approach is especially dominant, if you don't need to sort the container each time. HTH & Greetings from Bremen, Daniel Kr�gler -- [ See http://www.gotw.ca/resources/clcm.htm for info about ] [ comp.lang.c++.moderated. First time posters: Do this! ]
From: Nick Hounsome on 22 Jun 2010 08:28 On 22 June, 00:34, Brad <byte8b...(a)gmail.com> wrote: > I think there are better (more efficient) ways to do this, but I've > only ever needed this simple approach: > > void compare(const std::vector<std::string>& strings) > { > // Normally, this is randomly generated. It is a C style > string. > char the_string[4] = "abc"; > > std::vector<std::string>::const_iterator send(strings.end()); > > for (std::vector<std::string>::const_iterator sit = strings.begin(); > sit != send; ++sit) > { > if (the_string == *sit) > std::clog << "MATCH:\t" << *sit << std::endl; > } > > } > > Basically, I'm comparing a randomly generated string to a vector of > strings by iterating through the vector (as seen above). Here is the > problem: The vector may have one or millions of strings and execution > time slows considerably when I go above a few thousand strings. > > Is there a more efficient way to approach this? My code works fine on > smaller vectors, I just never expected to do millions and it may not > be possible to do that many as quickly as a few, but I wanted to ask. > Perhaps there is something in algorithm that I could use. Just point > me to it. I prefer standard C++ wherever possible. If you are comparing random strings against the same vector multiple times then the way to go is hashing. If the vector is replaced by a hashed set (new std::unordered_set) then the lookup will be much faster in general. Even plain old std::set will be faster (find must be logarithmic or better). -- [ See http://www.gotw.ca/resources/clcm.htm for info about ] [ comp.lang.c++.moderated. First time posters: Do this! ]
|
Next
|
Last
Pages: 1 2 3 4 Prev: New types defined in a return type Next: Generating a derived class from a base class |