Comparing Multiple Strings More Efficiently [C++]

Prev: New types defined in a return type
Next: Generating a derived class from a base class

From: Ian on 22 Jun 2010 09:09

George Neuner wrote:
>
> RegEx? A dedicated string search algorithm such as Boyer-Moore or
> Knuth-Morris-Pratt or Rabin-Karp?
>
.....
> Current compilers should have RegEx either in std or in tr1.
> Alternatively, Boost has RegEx.
>

I don't think RegEx is relevant here. If I understand Brad's problem it
is simply looking for exact string matches in a vector of candidate strings.

--
Ian

[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

From: Dave Harris on 22 Jun 2010 09:11

byte8bits(a)gmail.com (Brad) wrote (abridged):
> Basically, I'm comparing a randomly generated string to a vector of
> strings by iterating through the vector (as seen above). Here is the
> problem: The vector may have one or millions of strings and
> execution time slows considerably when I go above a few thousand
> strings.
>
> Is there a more efficient way to approach this?

Can you sort the vector and use binary search, or replace it with a
multiset or hash table, or a map that yields their original indexes?

-- Dave Harris, Nottingham, UK.

--
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

From: Ian on 22 Jun 2010 09:10

Brad wrote:
>
> Is there a more efficient way to approach this? My code works fine on
> smaller vectors, I just never expected to do millions and it may not
> be possible to do that many as quickly as a few, but I wanted to ask.
> Perhaps there is something in algorithm that I could use. Just point
> me to it. I prefer standard C++ wherever possible.

It depends on the details of what you are trying to do. In your example
code, you appear to be looking for *all* matches within the vector. If
the problem is really to check whether there is *any* match then an
obvious improvement would be to terminate the search on the first match.
And if this is what you are doing, it would be better expressed using
the std::find algorithm, though it wouldn't be more efficient.

As Seungbeom suggested, you are not going to be able improve on this
significantly unless your application is such that your data (your
vector of strings) is set up once (e.g. at the beginning of the program)
and searched many times. In this case, as Seungbeom said, it would be
worth putting the strings into a set or multiset, or sorting the vector
and using the std::binary_search algorithm.

--
Ian

[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

From: Brad on 22 Jun 2010 23:33

On Jun 22, 7:29 pm, Daniel Kr�gler <daniel.krueg...(a)googlemail.com>
wrote:

> It depends. If you can ensure that the input-vector is sorted it
> could improve the performance drastically by instead using
> a binary search algorithm of the random string within the
> sequence (e.g. lower_bound). Take care that the predicate
> used for the sorting is the same (at least equivalent) to that
> used in the binary search.
>
> The advantage of this approach is especially dominant, if
> you don't need to sort the container each time.
>
> HTH & Greetings from Bremen,
>
> Daniel Kr�gler

Yes, it does help. I only sort once before passing the container and
this approach is indeed very fast in my testing.

Paavo (from comp.lang.c++) suggested the same thing (sort and then
binary_search). After making that change, the program can do one
string in about 20 seconds
(depending on CPU) and one million strings in 120 seconds. This
approach seems faster than set and find approach and much faster than
my simple approach.

The one to one million issue comes from the fact that users may
specify one string as an argument or a text file that is read line by
line and may be many lines long.

Also, I did not realize my post here had been accepted so I asked the
same question in comp.lang.c++ after posting here and waiting a few
hours. And other replies I attempted to make here were rejected.
Hopefully this one gets through.

Thanks to everyone,

Brad

--
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

From: M Stefan on 22 Jun 2010 23:34

On Jun 22, 2:34 am, Brad <byte8b...(a)gmail.com> wrote:
> I think there are better (more efficient) ways to do this, but I've
> only ever needed this simple approach:
>
> void compare(const std::vector<std::string>& strings)
> {
> // Normally, this is randomly generated. It is a C style
> string.
> char the_string[4] = "abc";
>
> std::vector<std::string>::const_iterator send(strings.end());
>
> for (std::vector<std::string>::const_iterator sit = strings.begin();
> sit != send; ++sit)
> {
> if (the_string == *sit)
> std::clog << "MATCH:\t" << *sit << std::endl;
> }
>
> }
>
> Basically, I'm comparing a randomly generated string to a vector of
> strings by iterating through the vector (as seen above). Here is the
> problem: The vector may have one or millions of strings and execution
> time slows considerably when I go above a few thousand strings.
>
> Is there a more efficient way to approach this? My code works fine on
> smaller vectors, I just never expected to do millions and it may not
> be possible to do that many as quickly as a few, but I wanted to ask.
> Perhaps there is something in algorithm that I could use. Just point
> me to it. I prefer standard C++ wherever possible.
>
> Thanks,
>
> Brad
>

Implementing a trie data structure will provide very efficient
substring lookup
after the strings have been inserted into the trie.

http://en.wikipedia.org/wiki/Trie
http://en.wikipedia.org/wiki/Patricia_trie

Yours,
Stefan

--
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

First | Prev | Next | Last
Pages: 1 2 3 4
Prev: New types defined in a return type
Next: Generating a derived class from a base class