From: skillzero on 27 Jul 2010 20:00 I have two sets of numbers and I'd like to determine how similar they are. The data comes from measuring the delay between pulses of light. Reference data sets for different signal types will be measured and then as new data sets come in, the new data set needs to be match against one of the previously measured reference data set to find the closest match. The order of elements in the data set and the number of elements is important for what I'm doing. For example: Sample 1 (reference): 1000, 500, 1000, 500, 500, 500 Sample 2: 980, 480, 1100, 600, 550, 540 vs bad sample: 500, 500, 1000, 500, 500, 500 (first 500 is way off reference of 1000) I've read that a Pearson correlation may be a good way to measure how close the two sets of data are. I've also read about a Spearman rank correlation, but I'm not sure which is more appropriate or if there are better ways to do this.
From: Henry on 28 Jul 2010 09:45 On 28 July, 01:00, skillzero <skillz...(a)gmail.com> wrote: > I have two sets of numbers and I'd like to determine how similar they > are. The data comes from measuring the delay between pulses of light. > Reference data sets for different signal types will be measured and > then as new data sets come in, the new data set needs to be match > against one of the previously measured reference data set to find the > closest match. The order of elements in the data set and the number of > elements is important for what I'm doing. For example: > > Sample 1 (reference): 1000, 500, 1000, 500, 500, 500 > Sample 2: 980, 480, 1100, 600, 550, 540 > vs bad sample: 500, 500, 1000, 500, 500, 500 (first 500 is > way off reference of 1000) > > I've read that a Pearson correlation may be a good way to measure how > close the two sets of data are. I've also read about a Spearman rank > correlation, but I'm not sure which is more appropriate or if there > are better ways to do this. Pearson should be used where you think there is a linear relationship between the data (i.e. it ignores different locations and scales in the two sets of data but nothing more complicated). Spearman allows any monotonic relationship (one dataset tends to rise when the other does) even if that relationship is unknown, but it throws away the cardinal information in the data while using the ordinal information.
From: Ilmari Karonen on 28 Jul 2010 11:06 On 2010-07-28, skillzero <skillzero(a)gmail.com> wrote: > I have two sets of numbers and I'd like to determine how similar they > are. The data comes from measuring the delay between pulses of light. > Reference data sets for different signal types will be measured and > then as new data sets come in, the new data set needs to be match > against one of the previously measured reference data set to find the > closest match. The order of elements in the data set and the number of > elements is important for what I'm doing. For example: > > Sample 1 (reference): 1000, 500, 1000, 500, 500, 500 > Sample 2: 980, 480, 1100, 600, 550, 540 > vs bad sample: 500, 500, 1000, 500, 500, 500 (first 500 is > way off reference of 1000) > > I've read that a Pearson correlation may be a good way to measure how > close the two sets of data are. I've also read about a Spearman rank > correlation, but I'm not sure which is more appropriate or if there > are better ways to do this. Are you perhaps overcomplicating the problem? If the example you gave is indeed representative of your problem (i.e. matching data sets always have the same number elements, order matters), wouldn't something as simple as the (Euclidian) distance between the data sets as vectors work as a measure of similarity? (You also haven't said anything about how you expect the noise in your data to be distributed. If you expect noise to be additive and of similar magnitude for all elements, simple Euclidian distance ought to work well. If you expect multiplicative noise, you may want to take the logarithm of your data values first instead.) -- Ilmari Karonen To reply by e-mail, please replace ".invalid" with ".net" in address.
From: skillzero on 28 Jul 2010 17:07 On Jul 28, 8:06 am, Ilmari Karonen <usen...(a)vyznev.invalid> wrote: > On 2010-07-28, skillzero <skillz...(a)gmail.com> wrote: > > > I have two sets of numbers and I'd like to determine how similar they > > are. The data comes from measuring the delay between pulses of light. > > Reference data sets for different signal types will be measured and > > then as new data sets come in, the new data set needs to be match > > against one of the previously measured reference data set to find the > > closest match. The order of elements in the data set and the number of > > elements is important for what I'm doing. For example: > > > Sample 1 (reference): 1000, 500, 1000, 500, 500, 500 > > Sample 2: 980, 480, 1100, 600, 550, 540 > > vs bad sample: 500, 500, 1000, 500, 500, 500 (first 500 is > > way off reference of 1000) > > > I've read that a Pearson correlation may be a good way to measure how > > close the two sets of data are. I've also read about a Spearman rank > > correlation, but I'm not sure which is more appropriate or if there > > are better ways to do this. > > Are you perhaps overcomplicating the problem? If the example you gave > is indeed representative of your problem (i.e. matching data sets > always have the same number elements, order matters), wouldn't > something as simple as the (Euclidian) distance between the data sets > as vectors work as a measure of similarity? > > (You also haven't said anything about how you expect the noise in your > data to be distributed. If you expect noise to be additive and of > similar magnitude for all elements, simple Euclidian distance ought to > work well. If you expect multiplicative noise, you may want to take > the logarithm of your data values first instead.) The data sets I'm comparing have the same number of elements and should be in the same order. I initially tried calculating a sum of the differences between each element where the lowest result won. That seems similar to a Euclidian distance, if I understand it correctly. It didn't seem to work very well, but it may have just been that I was doing it incorrectly (or maybe I'm misunderstanding Euclidian distance). What I'm doing is recording a reference stream of light pulses, saving that, and then later when I see light pulses, I try to match the new light pulses to one of the reference light pulses I recorded earlier. Noise can come in a variety of ways (light emitter at difference distances, different angles, etc.) so I don't know if I can say it'll be additive or multiplicative. Each data point is a light pulse width (i.e. how long the light was on in microseconds). The width of the light pulse indicates if it's a 1 or a 0, but in my case, I'm just trying to match the closest signal and not necessarily determine the exact pattern of 1's or 0's. So I think I'm just looking for a good statistical way to match one pattern with another.
From: Ilmari Karonen on 28 Jul 2010 18:05
On 2010-07-28, skillzero <skillzero(a)gmail.com> wrote: > On Jul 28, 8:06 am, Ilmari Karonen <usen...(a)vyznev.invalid> wrote: >> >> Are you perhaps overcomplicating the problem? If the example you gave >> is indeed representative of your problem (i.e. matching data sets >> always have the same number elements, order matters), wouldn't >> something as simple as the (Euclidean) distance between the data sets >> as vectors work as a measure of similarity? > > The data sets I'm comparing have the same number of elements and > should be in the same order. I initially tried calculating a sum of > the differences between each element where the lowest result won. That > seems similar to a Euclidean distance, if I understand it correctly. > It didn't seem to work very well, but it may have just been that I was > doing it incorrectly (or maybe I'm misunderstanding Euclidean > distance). That's the L^1 distance. For the Euclidean (a.k.a. L^2) distance, you need to sum the squares of the differences and take the square root of the result. (Although, if you're just interested in comparing distances to see which is smallest, you can leave out the square root step.) There are other possible distance norms you could try, but the Euclidean distance is optimal if the noise in your data values is additive, independent and normally distributed (with zero mean and constant variance). Since many types of real-world noise are at least approximately like that, it's a good first choice to try. (ps. Yeah, I misspelled "Euclidean" in my previous post; I've fixed it above.) -- Ilmari Karonen To reply by e-mail, please replace ".invalid" with ".net" in address. |