From: Sumeet T on

Hi,

I have huge data set comprising of three vectors say X, Y, Z. The vector 'Z' is non linearly dependent on X,Y. Each vector contains about 10K elements.

I wish to obtain the best fit for this data while at the same time trying not to make this fit 'perfect'. By perfect I mean that the data points which are too far off/scattered may be ignored. I would then like to measure the scatter. Ignoring of some data points is helpful to establish a simple fit as compared to a complex fit obtained by including the widely scattered points. Such a complex fit would not be of much use to me, as it becomes case specific and may not be used elsewhere.

I am struggling to get started on this as I have not used optimization toolbox in the past. I would appreciate feedback and assistance from members of the mathwork community.

Thanks so much.
From: TideMan on
On Aug 4, 9:18 am, "Sumeet T" <sumeettre...(a)gmail.com> wrote:
> Hi,
>
> I have huge data set comprising of three vectors say X, Y, Z. The vector 'Z' is non linearly dependent on X,Y. Each vector contains about 10K elements.
>
> I wish to obtain the best fit for this data while at the same time trying not to make this fit 'perfect'. By perfect I mean that the data points which are too far off/scattered may be ignored. I would then like to measure the scatter. Ignoring of some data points is helpful to establish a simple fit as compared to a complex fit obtained by including the widely scattered points. Such a complex fit would not be of much use to me, as it becomes case specific and may not be used elsewhere.
>
> I am struggling to get started on this as I have not used optimization toolbox in the past. I would appreciate feedback and assistance from members of the mathwork community.
>
> Thanks so much.

First of all, 3 vectors of 10,000 elements each is not a "huge
dataset". After all, there are 86,400 s in a day, so 10K elements is
much less one day's data at 1 Hz.

Before you can decide which points are outliers that need to be
ignored, you need a model.
I'm not sure why you want to use the optimisation toolbox in
preference to mldivide, where you could fit a linear model like this:
coef=[X Y ones(length(X),1)]\Z;
(Note: this could be extended to a nonlinear model simply by including
terms like X.^2 and so on)

Now, you can figure out which points are outliers, set them to NaN,
and repeat on the good data.