Reply
 
Thread Tools Display Modes
  #1   Report Post  
Posted to rec.audio.opinion
Schizoid Man
 
Posts: n/a
Default Anyone have a statistics background?

I have a quick question...

I need a numerical approximation to Student's t distribution that is
accurate to double precision.

Unless I am completely mistaken, the TDIST() function in Excel behaves
very strangely.

It cannot calculate a value for x 0 and the result of this function is
the pdf of the distribution, not the cdf unlike NORMSDIST().

I realize that I could construct a better function using the BETADIST()
function, but I'm not sure whether this will alleviate or compound
whatever inaccuracies are already present in Excel.

Currently, the one I am leaning towards is Numerical Recipes.

Thanks in advance,
Schiz
  #2   Report Post  
Posted to rec.audio.opinion
 
Posts: n/a
Default Anyone have a statistics background?

Schizoid Man wrote:
I have a quick question...

I need a numerical approximation to Student's t distribution that is
accurate to double precision.

Unless I am completely mistaken, the TDIST() function in Excel behaves
very strangely.

It cannot calculate a value for x 0 and the result of this function is
the pdf of the distribution, not the cdf unlike NORMSDIST().

I realize that I could construct a better function using the BETADIST()
function, but I'm not sure whether this will alleviate or compound
whatever inaccuracies are already present in Excel.

Currently, the one I am leaning towards is Numerical Recipes.

Thanks in advance,
Schiz


You can use cdf_tdist from
http://members.aol.com/iandjmsmith/Examples.xls

However, for EXCEL 2003 cdf_tdist(x,df) = TDIST(-x,df,1) for x 0.

The problem is that EXCEL only provides one tail for the distributions
they offer, so they appear to guess which tail users would most like
evaluated and go with that. You are lucky that the t-distribution is
symmetric.

On the whole TDIST appears to be reasonably accurate.

It does not work if the degrees of freedom are greater than 1e10. When
the degrees of freedom are greater than 1e6 the relative error in the
values returned by TDIST can drop to about 1e-5.

It incorrectly returns 0 for TDIST(7e153,1,1). TDIST(2e154,1,1) returns
#NUM! which is much better.

However, these are undoubtedly extreme cases for the average user,
whatever that might mean.

Ian Smith

  #5   Report Post  
Posted to rec.audio.opinion
Schizoid Man
 
Posts: n/a
Default Anyone have a statistics background?

wrote:

You can use cdf_tdist from
http://members.aol.com/iandjmsmith/Examples.xls

Thanks for the link, Ian.

It does not work if the degrees of freedom are greater than 1e10. When
the degrees of freedom are greater than 1e6 the relative error in the
values returned by TDIST can drop to about 1e-5.


I think if the degrees of freedom are so high, then Student's t is
probably the wrong distribution to use in any case.

It should begin to converge with the normal for df 100, so that would
be a better choice.

NORMDIST(), though very good, is still not up to spec for the
application I have in mind (with enough digits the value for z=0 will be
returned as 0.499999999781721).

It incorrectly returns 0 for TDIST(7e153,1,1). TDIST(2e154,1,1) returns
#NUM! which is much better.

However, these are undoubtedly extreme cases for the average user,
whatever that might mean.

Ian Smith



  #6   Report Post  
Posted to rec.audio.opinion
 
Posts: n/a
Default Anyone have a statistics background?


Schizoid Man wrote:
Bill Riel wrote:
In article , says...

I have a quick question...

I need a numerical approximation to Student's t distribution that is
accurate to double precision.

Unless I am completely mistaken, the TDIST() function in Excel behaves
very strangely.

It cannot calculate a value for x 0 and the result of this function is
the pdf of the distribution, not the cdf unlike NORMSDIST().

I realize that I could construct a better function using the BETADIST()
function, but I'm not sure whether this will alleviate or compound
whatever inaccuracies are already present in Excel.

Currently, the one I am leaning towards is Numerical Recipes.



I don't know about Excel - I do recall that in the past that some of the
statistical functions were questionable.

However, I've used the code from Numerical Recipes and it's worked well
- it should meet your precision requirements.


Thanks, Bill. That's exactly what I did - a bit convoluted given that
you needed a Log Gamma and Incomplete Beta integral in order to get the
t, but probably worth it given the precision I am striving for.


Watch out with NR algorithms. They have a nasty habit of omitting to
mention where they don't work. I know you are only using it for the
t-distribution but betacf is slow and inaccurate for both large a&b and
for small a&b.

The implementation of gammln only guarantees an absolute error of less
than 2e-10. That means that when you exponentiate it the relative error
in the answer will be 2e-10. This is the best you can claim for your
overall answer for the cdf of the t distribution. I believe the GAMMALN
function in EXCEL2003 uses the same implementation or something very
similar.

While we are on similarities of code, I think EXCEL2003 uses equation
6.4.9 in NR, or similar, to calculate the t-distribution figures in
terms of the beta distribution.

Note, however, that as df gets large, df/(df+t^2) tends to 1 and the
BETADIST calculation in EXCEL terms suffers from the fact that
cancellation errors occur when evaluating 1-(df/(df+t^2)). For large df
and -ve t, it is better to calculate cdf_tdist(t,df) as
comp_cdf_beta(t^2/(t^2+df),1/2,df/2) instead of
cdf_beta(df/(t^2+df),df/2,1/2).

Of course EXCEL does not offer a right tail evaluation for the beta
distribution and so cannot take advantage of this obvious method of
making the calculations more accurate.

Mind you, the NR code does not offer a method of evaluating the right
tail for the beta distribution. So the NR algorithm does not appear to
have any advantages over the EXCEL2003 version.

With the example you gave for NORMDIST, you must be using an earlier
version of EXCEL. In that case, I would strongly recommend you use the
VBA code in
http://members.aol.com/iandjmsmith/Examples.txt This does
have code which evaluates the beta distribution for large and small
values of the parameters. Additionally it has separate code for the t
distribution which is more accurate and does not suffer from problems
such as df/(t^2+df) overflowing in the calculation of t^2 or
underflowing if calculated as df/t/(t+df/t). In other words it works.

Ian Smith

  #7   Report Post  
Posted to rec.audio.opinion
Schizoid Man
 
Posts: n/a
Default Anyone have a statistics background?

wrote:

Watch out with NR algorithms. They have a nasty habit of omitting to
mention where they don't work. I know you are only using it for the
t-distribution but betacf is slow and inaccurate for both large a&b and
for small a&b.


I agree. The algorithms are very badly explained and the code is
extremely non-intuitive. However, the t-distribution algorithm looks
like it is fairly straightforward.

While we are on similarities of code, I think EXCEL2003 uses equation
6.4.9 in NR, or similar, to calculate the t-distribution figures in
terms of the beta distribution.


I heard that a lot of the statistics functions were re-tooled in Excel
2003. I am still on Excel 2002 and the TDIST() function is a joke, just
like the NORMINV. Consequently, NORMDIST and TINV are not bad.

I believe NORMDIST is based on Abramovitz-Stegun, which was I was
originally planning to use. However, I came across an algorithm by J.F.
Hart that is accurate to at least double precision.

Mind you, the NR code does not offer a method of evaluating the right
tail for the beta distribution. So the NR algorithm does not appear to
have any advantages over the EXCEL2003 version.


That's not a problem, once I get one tail calculated correctly, the
other tail is a matter of simple interpolation.

With the example you gave for NORMDIST, you must be using an earlier
version of EXCEL. In that case, I would strongly recommend you use the
VBA code in
http://members.aol.com/iandjmsmith/Examples.txt This does
have code which evaluates the beta distribution for large and small
values of the parameters. Additionally it has separate code for the t
distribution which is more accurate and does not suffer from problems
such as df/(t^2+df) overflowing in the calculation of t^2 or
underflowing if calculated as df/t/(t+df/t). In other words it works.


Thanks, Ian. I'll have a look. In case I need to make up a bibitem
reference, what should I use?
  #8   Report Post  
Posted to rec.audio.opinion
 
Posts: n/a
Default Anyone have a statistics background?

Schizoid Man wrote:
wrote:

Watch out with NR algorithms. They have a nasty habit of omitting to
mention where they don't work. I know you are only using it for the
t-distribution but betacf is slow and inaccurate for both large a&b and
for small a&b.


I agree. The algorithms are very badly explained and the code is
extremely non-intuitive. However, the t-distribution algorithm looks
like it is fairly straightforward.

While we are on similarities of code, I think EXCEL2003 uses equation
6.4.9 in NR, or similar, to calculate the t-distribution figures in
terms of the beta distribution.


I heard that a lot of the statistics functions were re-tooled in Excel
2003. I am still on Excel 2002 and the TDIST() function is a joke, just
like the NORMINV. Consequently, NORMDIST and TINV are not bad.

I believe NORMDIST is based on Abramovitz-Stegun, which was I was
originally planning to use. However, I came across an algorithm by J.F.
Hart that is accurate to at least double precision.

Mind you, the NR code does not offer a method of evaluating the right
tail for the beta distribution. So the NR algorithm does not appear to
have any advantages over the EXCEL2003 version.


That's not a problem, once I get one tail calculated correctly, the
other tail is a matter of simple interpolation.

With the example you gave for NORMDIST, you must be using an earlier
version of EXCEL. In that case, I would strongly recommend you use the
VBA code in
http://members.aol.com/iandjmsmith/Examples.txt This does
have code which evaluates the beta distribution for large and small
values of the parameters. Additionally it has separate code for the t
distribution which is more accurate and does not suffer from problems
such as df/(t^2+df) overflowing in the calculation of t^2 or
underflowing if calculated as df/t/(t+df/t). In other words it works.


Thanks, Ian. I'll have a look. In case I need to make up a bibitem
reference, what should I use?


The NR code for the t-distribution is straightforward. The relationship
it is based on is well known (see A&S 26.7.1)

However, the code gives the unwary a warm feeling which is not
justified.

With the algorithm as given in NR,
for t=-2.3e-8, df=10 I get errors of 2e-8
for t=-8.4e-8, df=100 I get errors of 7e-8
for t=-2.3e-7, df=1000 I get errors of 2e-7
for t=-2.3e-4, df=1e9 I get errors of 2e-4

It is never going to be high precision code without the changes I
suggested and the changes you suggested. Your suggestion of using a
Normal approximation instead of the t distribution for large degrees of
freedom is a good one. A&S 26.7.8 offers a very good Normal
approximation.

My point is that the code as given in NR is not really very good. It
supposedly caters for people without a strong background in numerical
analysis and should not need to be examined closely and corrected for
its shortcomings. There is enough bad code about already and in my
opinion books like NR lead to an increase in bad code. I am not against
books with algorithms for non-experts in a subject. I just feel that
the standards of such code should be high - much higher than that
exhibited by the NR code.

I do not know much about the algorithm by J. F. Hart. I believe it is
similar to Algorithm 715 by W. J. Cody, used in DCDFLIB, Gnumeric...,
which is based on rational Chebyshev approximations. These are quicker
and more accurate than my cnormal function but are fixed precision
algorithms. In other words if I decide I need to work with quadruple
precision, I would need to work out a completely different set of
rational Chebyshev approximations whereas with cnormal I only have to
change the constants which control accuracy. It is also easier to check
that I have implemented the code in cnormal correctly as it is based on
a couple of well known formulae (see A&S 25.2.14 for series, the
continued fraction is adapted from the modified version of Legendre's
continued fraction).

So I prefer arbitrary precision algorithms where speed is not
absolutely crucial. However, I have included AS241 also based on
rational Chebyshev approximations for inverting the normal distribution
function. It is quite easy to write arbitrary-precision code to invert
it but the cdf of the normal distribution is inverted frequently and I
felt the speed improvement more than justified its inclusion.

As regards information about my code, here is a link
(http://members.aol.com/iandjmsmith/Accuracy.htm) to some notes on some
of the theory on which my code is based and information about the
accuracy you can expect. There isn't any information there specifically
about the t distribution. The t-distribution was only included
originally because it was needed for the asymptotic expansion used for
the beta distribution. The code is complicated because in cdf_tdist, if
you comment out the df = Fix(df) statement, it also works for
non-integral degrees of freedom.

However for integral degrees of freedom, it is not difficult to do an
error analysis. The accuracy is proportional to the log of the answer
(assuming the answer is not 0 or 1). This is effectively the absolute
error in calculating df * 0.5 * log0(-(x/(x+df/x))) where log0(x) =
ln(1+x). |9*ln(answer)*2^-53| is, I believe, a conservative error
bound. For answers which are not 0 or 1, a global bound of 1e-12 will
be attained.

Even with modifications to the NR algorithm 2e-10 is the best that can
be claimed due to use of the gammaln function, until the degrees of
freedom are large enough to justify swapping to a Normal approximation
at which point you might be able to claim a bound of somewhere around
|2*ln(answer)*2^-53|.

I'm not very good at writing up all this sort of stuff - hence no
long lists of published papers , although I hope my notes referenced
above are of some use to anyone interested. I have had considerable
help from Jerry W. Lewis with testing and improvements to the code. In
particular Jerry pointed out the problem that the value 0 was returned
for cdf_tdist(x,1) for x -1e154 or so. Fortunately, Jerry is one of
the few people fussier than I about getting things right!

Ian Smith

Reply
Thread Tools
Display Modes

Posting Rules

Smilies are On
[IMG] code is On
HTML code is Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
how to reduce background noise? [email protected] Tech 3 October 31st 05 03:08 AM
Sub Amps - a Follow up Question T Tech 26 April 29th 05 05:26 PM
MP3 background noise artifact Yef Tech 3 March 9th 05 10:48 PM
More Magazine Statistics John Atkinson Audio Opinions 12 January 14th 04 11:36 PM
background vocal versus background music? Jay Kadis Pro Audio 1 July 10th 03 10:08 PM


All times are GMT +1. The time now is 12:38 PM.

Powered by: vBulletin
Copyright ©2000 - 2025, Jelsoft Enterprises Ltd.
Copyright ©2004-2025 AudioBanter.com.
The comments are property of their posters.
 

About Us

"It's about Audio and hi-fi"