From: skillzero on
I have two sets of numbers and I'd like to determine how similar they
are. The data comes from measuring the delay between pulses of light.
Reference data sets for different signal types will be measured and
then as new data sets come in, the new data set needs to be match
against one of the previously measured reference data set to find the
closest match. The order of elements in the data set and the number of
elements is important for what I'm doing. For example:

Sample 1 (reference): 1000, 500, 1000, 500, 500, 500
Sample 2: 980, 480, 1100, 600, 550, 540
vs bad sample: 500, 500, 1000, 500, 500, 500 (first 500 is
way off reference of 1000)

I've read that a Pearson correlation may be a good way to measure how
close the two sets of data are. I've also read about a Spearman rank
correlation, but I'm not sure which is more appropriate or if there
are better ways to do this.
From: Henry on
On 28 July, 01:00, skillzero <skillz...(a)gmail.com> wrote:
> I have two sets of numbers and I'd like to determine how similar they
> are. The data comes from measuring the delay between pulses of light.
> Reference data sets for different signal types will be measured and
> then as new data sets come in, the new data set needs to be match
> against one of the previously measured reference data set to find the
> closest match. The order of elements in the data set and the number of
> elements is important for what I'm doing. For example:
>
> Sample 1 (reference):  1000, 500, 1000, 500, 500, 500
> Sample 2:               980, 480, 1100, 600, 550, 540
> vs bad sample:          500, 500, 1000, 500, 500, 500 (first 500 is
> way off reference of 1000)
>
> I've read that a Pearson correlation may be a good way to measure how
> close the two sets of data are. I've also read about a Spearman rank
> correlation, but I'm not sure which is more appropriate or if there
> are better ways to do this.

Pearson should be used where you think there is a linear relationship
between the data (i.e. it ignores different locations and scales in
the two sets of data but nothing more complicated).

Spearman allows any monotonic relationship (one dataset tends to rise
when the other does) even if that relationship is unknown, but it
throws away the cardinal information in the data while using the
ordinal information.


From: Ilmari Karonen on
On 2010-07-28, skillzero <skillzero(a)gmail.com> wrote:
> I have two sets of numbers and I'd like to determine how similar they
> are. The data comes from measuring the delay between pulses of light.
> Reference data sets for different signal types will be measured and
> then as new data sets come in, the new data set needs to be match
> against one of the previously measured reference data set to find the
> closest match. The order of elements in the data set and the number of
> elements is important for what I'm doing. For example:
>
> Sample 1 (reference): 1000, 500, 1000, 500, 500, 500
> Sample 2: 980, 480, 1100, 600, 550, 540
> vs bad sample: 500, 500, 1000, 500, 500, 500 (first 500 is
> way off reference of 1000)
>
> I've read that a Pearson correlation may be a good way to measure how
> close the two sets of data are. I've also read about a Spearman rank
> correlation, but I'm not sure which is more appropriate or if there
> are better ways to do this.

Are you perhaps overcomplicating the problem? If the example you gave
is indeed representative of your problem (i.e. matching data sets
always have the same number elements, order matters), wouldn't
something as simple as the (Euclidian) distance between the data sets
as vectors work as a measure of similarity?

(You also haven't said anything about how you expect the noise in your
data to be distributed. If you expect noise to be additive and of
similar magnitude for all elements, simple Euclidian distance ought to
work well. If you expect multiplicative noise, you may want to take
the logarithm of your data values first instead.)

--
Ilmari Karonen
To reply by e-mail, please replace ".invalid" with ".net" in address.
From: skillzero on
On Jul 28, 8:06 am, Ilmari Karonen <usen...(a)vyznev.invalid> wrote:
> On 2010-07-28, skillzero <skillz...(a)gmail.com> wrote:
>
> > I have two sets of numbers and I'd like to determine how similar they
> > are. The data comes from measuring the delay between pulses of light.
> > Reference data sets for different signal types will be measured and
> > then as new data sets come in, the new data set needs to be match
> > against one of the previously measured reference data set to find the
> > closest match. The order of elements in the data set and the number of
> > elements is important for what I'm doing. For example:
>
> > Sample 1 (reference):  1000, 500, 1000, 500, 500, 500
> > Sample 2:               980, 480, 1100, 600, 550, 540
> > vs bad sample:          500, 500, 1000, 500, 500, 500 (first 500 is
> > way off reference of 1000)
>
> > I've read that a Pearson correlation may be a good way to measure how
> > close the two sets of data are. I've also read about a Spearman rank
> > correlation, but I'm not sure which is more appropriate or if there
> > are better ways to do this.
>
> Are you perhaps overcomplicating the problem?  If the example you gave
> is indeed representative of your problem (i.e. matching data sets
> always have the same number elements, order matters), wouldn't
> something as simple as the (Euclidian) distance between the data sets
> as vectors work as a measure of similarity?
>
> (You also haven't said anything about how you expect the noise in your
> data to be distributed.  If you expect noise to be additive and of
> similar magnitude for all elements, simple Euclidian distance ought to
> work well.  If you expect multiplicative noise, you may want to take
> the logarithm of your data values first instead.)

The data sets I'm comparing have the same number of elements and
should be in the same order. I initially tried calculating a sum of
the differences between each element where the lowest result won. That
seems similar to a Euclidian distance, if I understand it correctly.
It didn't seem to work very well, but it may have just been that I was
doing it incorrectly (or maybe I'm misunderstanding Euclidian
distance).

What I'm doing is recording a reference stream of light pulses, saving
that, and then later when I see light pulses, I try to match the new
light pulses to one of the reference light pulses I recorded earlier.
Noise can come in a variety of ways (light emitter at difference
distances, different angles, etc.) so I don't know if I can say it'll
be additive or multiplicative. Each data point is a light pulse width
(i.e. how long the light was on in microseconds). The width of the
light pulse indicates if it's a 1 or a 0, but in my case, I'm just
trying to match the closest signal and not necessarily determine the
exact pattern of 1's or 0's.

So I think I'm just looking for a good statistical way to match one
pattern with another.
From: Ilmari Karonen on
On 2010-07-28, skillzero <skillzero(a)gmail.com> wrote:
> On Jul 28, 8:06 am, Ilmari Karonen <usen...(a)vyznev.invalid> wrote:
>>
>> Are you perhaps overcomplicating the problem? If the example you gave
>> is indeed representative of your problem (i.e. matching data sets
>> always have the same number elements, order matters), wouldn't
>> something as simple as the (Euclidean) distance between the data sets
>> as vectors work as a measure of similarity?
>
> The data sets I'm comparing have the same number of elements and
> should be in the same order. I initially tried calculating a sum of
> the differences between each element where the lowest result won. That
> seems similar to a Euclidean distance, if I understand it correctly.
> It didn't seem to work very well, but it may have just been that I was
> doing it incorrectly (or maybe I'm misunderstanding Euclidean
> distance).

That's the L^1 distance. For the Euclidean (a.k.a. L^2) distance, you
need to sum the squares of the differences and take the square root of
the result. (Although, if you're just interested in comparing
distances to see which is smallest, you can leave out the square root
step.) There are other possible distance norms you could try, but the
Euclidean distance is optimal if the noise in your data values is
additive, independent and normally distributed (with zero mean and
constant variance). Since many types of real-world noise are at least
approximately like that, it's a good first choice to try.

(ps. Yeah, I misspelled "Euclidean" in my previous post; I've fixed it
above.)

--
Ilmari Karonen
To reply by e-mail, please replace ".invalid" with ".net" in address.