From: Kaba on
Pubkeybreaker wrote:
> On Apr 5, 11:00 am, Kaba <n...(a)here.com> wrote:
> > Hi,
> >
> > I am measuring the time spent by an algorithm. Let's assume it is a
> > gaussian-distributed random variable. How many repetitions do I have to
> > make to get a good estimate of the mean and standard deviation of this
> > distribution?
>
> You need to define what you mean by "good".

Well, I am happy when I am able to convince my readers that an algorithm
A is clearly faster than algorithm B:)

--
http://kaba.hilvi.org
From: Man. on
Cookbook/FittingDataPage
1.Fit examples with sinusoidal functions
1.Generating the data
2.Fitting the data
3.A clever use of the cost function
2.Simplifying the syntax
3.Fitting gaussian-shaped data
1.Calculating the moments of the distribution
2.Fitting a 2D gaussian
4.Fitting a power-law to data with errors
1.Generating the data
2.Fitting the data
1 from pylab import *
2 from scipy import *
3
4 # Generate data points with noise
5 num_points = 150
6 Tx = linspace(5., 8., num_points)
7 Ty = Tx
8
9 tX = 11.86*cos(2*pi/0.81*Tx-1.32) + 0.64*Tx+4*((0.5-
rand(num_points))*exp(2*rand(num_points)**2))
10 tY = -32.14*cos(2*pi/0.8*Ty-1.94) + 0.15*Ty+7*((0.5-
rand(num_points))*exp(2*rand(num_points)**2))
1 # Fit the first set
2 fitfunc = lambda p, x: p[0]*cos(2*pi/p[1]*x+p[2]) + p[3]*x #
Target function
3 errfunc = lambda p, x, y: fitfunc(p, x) - y # Distance to the
target function
4 p0 = [-15., 0.8, 0., -1.] # Initial guess for the parameters
5 p1, success = optimize.leastsq(errfunc, p0[:], args=(Tx, tX))
6
7 time = linspace(Tx.min(), Tx.max(), 100)
8 plot(Tx, tX, "ro", time, fitfunc(p1, time), "r-") # Plot of the
data and the fit
9
10 # Fit the second set
11 p0 = [-15., 0.8, 0., -1.]
12 p2,success = optimize.leastsq(errfunc, p0[:], args=(Ty, tY))
13
14 time = linspace(Ty.min(), Ty.max(), 100)
15 plot(Ty, tY, "b^", time, fitfunc(p2, time), "b-")
16
17 # Legend the plot
18 title("Oscillations in the compressed trap")
19 xlabel("time [ms]")
20 ylabel("displacement [um]")
21 legend(('x position', 'x fit', 'y position', 'y fit'))
22
23 ax = axes()
24
25 text(0.8, 0.07,
26 'x freq : %.3f kHz \n y freq : %.3f kHz' % (1/p1[1],1/
p2[1]),
27 fontsize=16,
28 horizontalalignment='center',
29 verticalalignment='center',
30 transform=ax.transAxes)
31
32 show()
1 # Target function
2 fitfunc = lambda T, p, x: p[0]*cos(2*pi/T*x+p[1]) + p[2]*x
3 # Initial guess for the first set's parameters
4 p1 = r_[-15., 0., -1.]
5 # Initial guess for the second set's parameters
6 p2 = r_[-15., 0., -1.]
7 # Initial guess for the common period
8 T = 0.8
9 # Vector of the parameters to fit, it contains all the parameters
of the problem!
10 p = r_[T, p1, p2]
11 # Cost function of the fit, compare it to the previous example.
12 errfunc = lambda p, x1, y1, x2, y2: r_[
13 fitfunc(p[0], p[1:4], x1) - y1,
14 fitfunc(p[0], p[4:7], x2) - y2
15 ]
16 # This time we need to pass the two sets of data, there are thus
four "args".
17 p,success = optimize.leastsq(errfunc, p, args=(Tx, tX, Ty, tY))
18 time = linspace(Tx.min(), Tx.max(), 100) # Plot of the first data
and the fit
19 plot(Tx, tX, "ro", time, fitfunc(p[0], p[1:4], time),"r-")
20
21 # Plot of the second data and the fit
22 time = linspace(Ty.min(), Ty.max(),100)
23 plot(Ty, tY, "b^", time, fitfunc(p[0], p[4:7], time),"b-")
24
25 # Legend the plot
26 title("Oscillations in the compressed trap")
27 xlabel("time [ms]")
28 ylabel("displacement [um]")
29 legend(('x position', 'x fit', 'y position', 'y fit'))
30
31 ax = axes()
32
33 text(0.8, 0.07,
34 'x freq : %.3f kHz' % (1/p[0]),
35 fontsize=16,
36 horizontalalignment='center',
37 verticalalignment='center',
38 transform=ax.transAxes)
39
40 show()
1 from scipy import optimize
2 from numpy import *
3
4 class Parameter:
5 def __init__(self, value):
6 self.value = value
7
8 def set(self, value):
9 self.value = value
10
11 def __call__(self):
12 return self.value
13
14 def fit(function, parameters, y, x = None):
15 def f(params):
16 i = 0
17 for p in parameters:
18 p.set(params[i])
19 i += 1
20 return y - function(x)
21
22 if x is None: x = arange(y.shape[0])
23 p = [param() for param in parameters]
24 optimize.leastsq(f, p)
1 # giving initial parameters
2 mu = Parameter(7)
3 sigma = Parameter(3)
4 height = Parameter(5)
5
6 # define your function:
7 def f(x): return height() * exp(-((x-mu())/sigma())**2)
8
9 # fit! (given that data is an array with the data to fit)
10 fit(f, [mu, sigma, height], data)
1 from pylab import *
2
3 gaussian = lambda x: 3*exp(-(30-x)**2/20.)
4
5 data = gaussian(arange(100))
6
7 plot(data)
8
9 X = arange(data.size)
10 x = sum(X*data)/sum(data)
11 width = sqrt(abs(sum((X-x)**2*data)/sum(data)))
12
13 max = data.max()
14
15 fit = lambda t : max*exp(-(t-x)**2/(2*width**2))
16
17 plot(fit(X))
18
19 show()
1 from numpy import *
2 from scipy import optimize
3
4 def gaussian(height, center_x, center_y, width_x, width_y):
5 """Returns a gaussian function with the given parameters"""
6 width_x = float(width_x)
7 width_y = float(width_y)
8 return lambda x,y: height*exp(
9 -(((center_x-x)/width_x)**2+((center_y-y)/
width_y)**2)/2)
10
11 def moments(data):
12 """Returns (height, x, y, width_x, width_y)
13 the gaussian parameters of a 2D distribution by calculating
its
14 moments """
15 total = data.sum()
16 X, Y = indices(data.shape)
17 x = (X*data).sum()/total
18 y = (Y*data).sum()/total
19 col = data[:, int(y)]
20 width_x = sqrt(abs((arange(col.size)-y)**2*col).sum()/
col.sum())
21 row = data[int(x), :]
22 width_y = sqrt(abs((arange(row.size)-x)**2*row).sum()/
row.sum())
23 height = data.max()
24 return height, x, y, width_x, width_y
25
26 def fitgaussian(data):
27 """Returns (height, x, y, width_x, width_y)
28 the gaussian parameters of a 2D distribution found by a
fit"""
29 params = moments(data)
30 errorfunction = lambda p: ravel(gaussian(*p)
(*indices(data.shape)) -
31 data)
32 p, success = optimize.leastsq(errorfunction, params)
33 return p
1 from pylab import *
2 # Create the gaussian data
3 Xin, Yin = mgrid[0:201, 0:201]
4 data = gaussian(3, 100, 100, 20, 40)(Xin, Yin) +
random.random(Xin.shape)
5
6 matshow(data, cmap=cm.gist_earth_r)
7
8 params = fitgaussian(data)
9 fit = gaussian(*params)
10
11 contour(fit(*indices(data.shape)), cmap=cm.copper)
12 ax = gca()
13 (height, x, y, width_x, width_y) = params
14
15 text(0.95, 0.05, """
16 x : %.1f
17 y : %.1f
18 width_x : %.1f
19 width_y : %.1f""" %(x, y, width_x, width_y),
20 fontsize=16, horizontalalignment='right',
21 verticalalignment='bottom', transform=ax.transAxes)
22
23 show()
1 from pylab import *
2 from scipy import *
3
4 # Define function for calculating a power law
5 powerlaw = lambda x, amp, index: amp * (x**index)
6
7 ##########
8 # Generate data points with noise
9 ##########
10 num_points = 20
11
12 # Note: all positive, non-zero data
13 xdata = linspace(1.1, 10.1, num_points)
14 ydata = powerlaw(xdata, 10.0, -2.0) # simulated perfect data
15 yerr = 0.2 * ydata # simulated errors (10%)
16
17 ydata += randn(num_points) * yerr # simulated noisy data
1
2 ##########
3 # Fitting the data -- Least Squares Method
4 ##########
5
6 # Power-law fitting is best done by first converting
7 # to a linear equation and then fitting to a straight line.
8 #
9 # y = a * x^b
10 # log(y) = log(a) + b*log(x)
11 #
12
13 logx = log10(xdata)
14 logy = log10(ydata)
15 logyerr = yerr / ydata
16
17 # define our (line) fitting function
18 fitfunc = lambda p, x: p[0] + p[1] * x
19 errfunc = lambda p, x, y, err: (y - fitfunc(p, x)) / err
20
21 pinit = [1.0, -1.0]
22 out = optimize.leastsq(errfunc, pinit,
23 args=(logx, logy, logyerr), full_output=1)
24
25 pfinal = out[0]
26 covar = out[1]
27 print pfinal
28 print covar
29
30 index = pfinal[1]
31 amp = 10.0**pfinal[0]
32
33 indexErr = sqrt( covar[0][0] )
34 ampErr = sqrt( covar[1][1] ) * amp
35
36 ##########
37 # Plotting data
38 ##########
39
40 clf()
41 subplot(2, 1, 1)
42 plot(xdata, powerlaw(xdata, amp, index)) # Fit
43 errorbar(xdata, ydata, yerr=yerr, fmt='k.') # Data
44 text(5, 6.5, 'Ampli = %5.2f +/- %5.2f' % (amp, ampErr))
45 text(5, 5.5, 'Index = %5.2f +/- %5.2f' % (index, indexErr))
46 title('Best Fit Power Law')
47 xlabel('X')
48 ylabel('Y')
49 xlim(1, 11)
50
51 subplot(2, 1, 2)
52 loglog(xdata, powerlaw(xdata, amp, index))
53 errorbar(xdata, ydata, yerr=yerr, fmt='k.') # Data
54 xlabel('X (log scale)')
55 ylabel('Y (log scale)')
56 xlim(1.0, 11)
57
58 savefig('power_law_fit.png')
Cookbook/FittingData
Apr 5, 11:03 am, Ludovicus <luir...(a)yahoo.com> wrote:
> On Apr 5, 11:00 am, Kaba <n...(a)here.com> wrote:
>
> > Hi,
>
> > I am measuring the time spent by an algorithm. Let's assume it is a
> > gaussian-distributed random variable. How many repetitions do I have to
> > make to get a good estimate of the mean and standard deviation of this
> > distribution?
>
> > I'd like a cook-book answer this time, because I am in hurry with these
> > measurements. I know it's 101.. Probability is one of my weaker sides.
>
> > --http://kaba.hilvi.org
>
> How can, the time spent by an algorith, be a random variable ?
> If it is run in the same computer the time is a constant.
> If it is run in different computers the values are not of the same
> genus.
BDROP, EDROP = the number of points to ignore at the
beginning and end of the scan,
respectively. (Initial values = 0.)
BBASE, EBASE = the number of points, excluding BDROP
and EDROP, over which to fit the
baseline at each end of the scan.
(Initial values are 50.)
>PROCEDURE BASESET
:BDROP = CCUR
:BBASE = CCUR - BDROP
:EDROP = H0(NOINT) - CCUR
:EBASE = H0(NOINT) - CCUR - EDROP
:RETURN
:FINISH
>NREGION = 10,30,40,50,78,82,105,128
>NREGION(1) = 10 ; NREGION(2) = 30
>NREGION(3) = 40 ; NREGION(4) = 50
>NREGION(5) = 78 ; NREGION(6) = 82
>NREGION(7) = 105 ; NREGION(8) = 128
>NREGION = 0
or
>NREGION = DEFAULT
>PROCEDURE NRSET(N_R)
:SCALAR N_I
:NREGION = DEFAULT
:IF N_R < 1 THEN; ? 'ILLEGAL ARGUMENT !'; RETURN; END
:N_R = MIN(16,N_R)
:FOR N_I = 2 TO N_R * 2 BY 2
: NREGION(N_I - 1) = CCUR
: NREGION(N_I) = CCUR
: END
:RETURN
:FINISH
>DCBASE
>NFIT = 0
>BASELINE
>DCPCT = 40
>PCBASE PAGE SHOW
>NFIT = 5
>BASELINE PAGE SHOW
>PAGE SHOW
>NFIT = 5 ; BSHAPE
>BSHOW
>PROCEDURE GROUPBASE(FIRST_SCAN, NO_OF_SCNS, FIT_ORDER)
# FIRST_SCAN = the scan number for the first scan in the set.
# NO_OF_SCNS = the number of consecutive scans in the set.
# FIT_ORDER = the order of the Chebyshev polynomial to be fitted.
:SCALAR SCAN_I
:IF NO_OF_SCNS < 1 THEN
: PRINT 'LESS THAN ONE SCAN NOT ALLOWED.'
: RETURN
: END
:SCLEAR
:NREGION = 0
:BDROP = 0 ; EDROP = 0
:FOR SCAN_I = FIRST_SCAN TO (FIRST_SCAN + NO_OF_SCNS - 1)
: GET SCAN_I ; ACCUM
: END
:AVE
:PAGE SHOW
:BASESET
#BASESET = the region-setting Procedure defined in Sec. 7.1
:NFIT = FIT_ORDER
:BASELINE COPY(0,2) BMODEL COPY(0,1)
:FOR SCAN_I = FIRST_SCAN TO (FIRST_SCAN + NO_OF_SCNS - 1)
: GET SCAN_I ; DIFF PAGE SHOW PAUSE(10)
: END
:COPY(2,0) PAGE SHOW
:RETURN
:FINISH
Chebyshev Polynomial Sinusoid
-------------------- --------
BASELINE RIPPLE
BSHAPE RSHAPE
BSHOW RSHOW
BMODEL RMODEL
> RPERIOD = 100
> RIPPLE PAGE SHOW
>DCBASE
>PAGE SHOW
>RPERIOD = 100 ; RSHAPE
>RSHOW
>MDBOX = 19; MDBASE
>PAGE SHOW
>RMS
>PRINT 'RMS = ' VRMS
---------------------------------------------------------------------
Adverb Value Usage
---------------------------------------------------------------------
FIXH TRUE If you know the heights of the Gaussians and want
GAUSS to hold their values constant. You must supply
values to HEIGHT. The input and output values of
HEIGHT will be identical.
FALSE [Default] If you want GAUSS to fit the values of the
heights; you need not supply values to HEIGHT in this
case. GAUSS will return to HEIGHT the best-fit values
for the heights of the Gaussians.
FIXC TRUE If you know the centers of the Gaussians and want
GAUSS to hold their values constant. You must supply
values to CENTER. The input and output values of
CENTER will be identical.
FALSE [Default] If you want GAUSS to fit the values of the
centers; you must supply initial guesses to CENTER.
GAUSS will return to CENTER the best-fit values for
the Gaussian centers.
FIXHW TRUE If you know the widths of the Gaussians and want
GAUSS to hold their values constant. You must supply
values to HWIDTH. The input and output values of
HWIDTH will be identical.
FALSE [Default] If you want GAUSS to fit the values of the
widths; you must supply initial guesses to HWIDTH.
GAUSS will return to HWIDTH the best-fit values for
the Gaussian widths.
FIXRELH TRUE If you know the relative heights of the
Gaussians but not the absolute heights. You must
supply values for HEIGHT that represent your best
guesses to the heights. GAUSS will use these values
of HEIGHT(i) as initial guesses, will fit for a
uniform scale factor, and will return to HEIGHT your
initial guesses multiplied by the fitted scale factor.
FALSE [Default] If you don't know the relative heights,
of the Gaussians.
FIXRELC TRUE If you know the relative separations of the
Gaussians but not an overall offset for the complete
pattern of Gaussians. You must supply values for
CENTER that represent your best guesses to the values
of the Gaussian centers. GAUSS will use these values
of CENTER(i) as initial guesses, will fit for an
overall offset to the pattern of Gaussians, and will
return to CENTER your input values adjusted by the
fitted offset.
FALSE [Default] If you don't know the relative separations,
of the Gaussians.
FIXRELH TRUE If you know the relative widths of the
Gaussians but not an overall scale factor for the
widths to apply to each Gaussian. You must supply
values for HWIDTH that represent your best guesses to
the values of the widths. GAUSS will use these values
of HWIDTH(i) as initial guesses, will fit for an
overall scale factor for the widths, and will return
to HWIDTH your input values multiplied by the fitted
factor.
FALSE [Default] If you don't know the relative widths,
of the Gaussians.
-----------------------------------------------------------------------
>BDROP = 400 ; EDROP = 0
>PEAK
>PROCEDURE SETGAUSS(GAUSS_NUM)
:SCALAR GAUSS_I
:IF GAUSS_NUM < 1 THEN
: PRINT 'LESS THAN ONE GAUSSIAN NOT ALLOWED'
: RETURN
: END
:IF GAUSS_NUM > 24 THEN
: PRINT 'MORE THAN 24 GAUSSIANS NOT ALLOWED'
: RETURN
: END
:NGAUSS = GAUSS_NUM
:CENTER = 0 ; HWIDTH = 0 ; HEIGHT = 0
:PRINT 'CLICK ON ENDS OF REGION OVER WHICH TO FIT.'
:BGAUSS = CCUR
:EGAUSS = CCUR
:IF BGAUSS > EGAUSS THEN
: GAUSS_I = EGAUSS
: EGAUSS = BGAUSS
: BGAUSS = GAUSS_I
: END
:FOR GAUSS_I = 1 TO NGAUSS
: PRINT 'CLICK ON PEAK, POSITIONING VERTCAL CURSORS FIRST.'
: CENTER(GAUSS_I) = CCUR
: PRINT 'CLICK ON HALF POWER POINTS.'
: HWIDTH(GAUSS_I) = ABS(CCUR - CCUR)
: END
:RETURN
:FINISH
>GAUSS
>PAGE SHOW
>GPARTS
>PAGE SHOW
>GDISPLAY
>PAGE SHOW
>GMODEL RLINE RESHOW
>PAGE SHOW
>RESIDUAL RLINE RESHOW
____________________________________________________________________
| Chebyshev Baseline | Sinusoidal Baseline | Gaussian Fit |
|-----------------------|------------------------|-----------------|
| BSHAPE | RSHAPE | GAUSs |
|-----------------------|------------------------|-----------------|
| BASELINE | RIPPLE | GAUSS RESIDUAL |
|-----------------------|------------------------|-----------------|
| BMODEL | RMODEL | GMODEL |
|-----------------------|------------------------|-----------------|
| BSHOW | RSHOW | GPARTS |
| | | or GDISPLAY |
|-----------------------|------------------------|-----------------|
The equivalent adverbs for the three different classes of fitting
operations are,
____________________________________________________________________
| Chebyshev Baseline | Sinusoidal Baseline | Gaussian Fit |
|-----------------------|------------------------|-----------------|
| BDROP, EDROP | BDROP, EDROP | BGAUSS, EGAUSS |
| BBASE, EBASE | BBASE, EBASE | |
|-----------------------|------------------------|-----------------|
| NREGION | NREGION | GREGION |
|-----------------------|------------------------|-----------------|
| BPARM | RPERIOD, RAMPLTDE, | HEIGHT, CENTER,|
| | RPHASE | HWIDTH |
|-----------------------|------------------------|-----------------|
Cookbook
From: porky_pig_jr on
On Apr 5, 2:55 pm, Kaba <n...(a)here.com> wrote:
> porky_pig...(a)my-deja.com wrote:
> > Me thinks running the same algorithm as many times as possible and
> > considering the best time as the estimator of the "true running time"
> > is the best you can do.
>
> I would agree, if the OS was the only variation. However, I also vary
> the input pseudo-randomly (similar but not identical input). I am
> assuming that the variation caused by OS is negligible.
>
> --http://kaba.hilvi.org

Oh, in this case it *might* be normal distribution, but still you
should try to get many runs (well, at least 30 to a hundred) and plot
the times, just to see if you're getting a bell curve. And if you feel
you're getting something skewed instead, it might be a different
distribution. There are some non-parametric methods (like bootstrap)
to help you with a claim that your distribution is normal (or not).
That's probably something you can bring on probability/statistics
forum. Once you have some observations.

I can't think of any "cookbook" type of recipe.

PPJ.

PPJ.
From: Peter Webb on
You want an analytic solution?

You need two things:

1. The distribution of the input variable (the random number in this case)

2. The function which tells you how the time depends upon the input value.

Consider the seive of Erastothenes. The algorithm varies as n^2 (maybe it
doesn't; lets pretend).

Pick a random number 0..9 with all cases equally likely. The PDF of this is
k * n^2, you can calculate this and hence the mean and sd.

Now generate 0..9 according to a Gaussian distribution G(n). The PDF of this
is k*G(n)*n^2, again you can calculate this and hence the mean and sd.

So for an analytic solution, you need to know:

1. The PDF of the input.
2. How the algorithm times depends on the input.

Once you have these, you have at least an equation for the mean and sd, and
you can probably solve it analytically.

Without these, its going to be trial and error.






From: Man. on
On Apr 5, 7:59 pm, "Peter Webb" <webbfam...(a)DIESPAMDIEoptusnet.com.au>
wrote:
> You want an analytic solution?
>
> You need two things:
>
> 1. The distribution of the input variable (the random number in this case)

1. Results 1 - 10 for distribution of the input variable. (0.24
seconds)

A distribution-free approach to inducing rank correlation among ...
This method is simple to use, is distribution free, preserves the
exact form of the marginal distributions on the input variables, and
may be used with any ...

Input Variable Importance Definition based on: A New Input Variable
Importance Definition.

Quantization of Continuous Input Variables for Binary Classification
is based on the distribution of the input variables.

> 2. The function which tells you how the time depends upon the input value..

2. Can you see that the words "is a function of" can be substituted by
the words "depends upon"?

> Consider the seive of Erastothenes. The algorithm varies as n^2 (maybe it
> doesn't; lets pretend).
>
> Pick a random number 0..9 with all cases equally likely. The PDF of this is
> k * n^2, you can calculate this and hence the mean and sd.
>
> Now generate 0..9 according to a Gaussian distribution G(n). The PDF of this
> is k*G(n)*n^2, again you can calculate this and hence the mean and sd.
>
> So for an analytic solution, you need to know:
>
> 1. The PDF of the input.
> 2. How the algorithm times depends on the input.
>
> Once you have these, you have at least an equation for the mean and sd, and
> you can probably solve it analytically.
>
> Without these, its going to be trial and error.

All you do is multiply the input value with 2, and add 2 to get
the ...

is doing the opposite (reverse) of what the machine tells you to ...

Time taken for a particular journey is a function of average ...
MMM