Arny Krueger wrote
JBorg" wrote
If our empirical senses are faulty,
That's a scientific fact, which is well known.
If our senses were not faulty there would be no need for microscopes or
telephones.
what would be an example of a
properly controlled listening "test" which would circumvent this
problem ?
Please see www.pcabx.com for examples.
Please see Dr. Corbett's commentary about pcabx dated 11/27/04, RAO
Thread Title: Let's do some "scieenccece" in the Hive
http://tinyurl.com/4kkxa
**************
I had someone do an experiment for me using PCABX.
As soon as I saw the data, I knew something was wrong, as
the numbers from PCABX could not possibly be right.
It took only a few minutes to find these errors in Arny's
code:
...
ptable(12, 1) = 1.642
ptable(12, 2) = 0.2
ptable(13, 1) = 2.072
ptable(13, 2) = 0.25 should be 0.15
ptable(14, 1) = 2.706 ^
ptable(14, 2) = 0.2 should be 0.1
ptable(15, 1) = 3.17 ^
ptable(15, 2) = 0.075
...
I sent Arny an e-mail reporting this, but I never got a reply to that
e-mail. (He had replied to other e-mail I had sent to that address before
that.)
Those typos are only part of the problem with PCABX.
I've been teaching college and university math classes for over thirty
years, so my BS detector is well calibrated. But its meter pegs when I
read what Arny says about scientific and technical issues involving
mathematics, statistics, and design of experiments. You know the feeling
when you are in a store and you overhear the salesman unloading a pile of
BS on an unsuspecting customer? It's pretty much the same whether it is
Radio Shack, or Best Buy, or Lafayette Radio, or an audiophile salon, and
Arny brings it to the Internet.
When someone follows Arny's advice on statistical design or analysis, you
know it is a double blind experiment---it's a case of the blind leading
the blind.
(1)
What Arny calls the "probability you were guessing" is apparently what the
rest of the world calls a "p-value". I wrote "apparently" because PCABX
cannot even calculate those numbers correctly; even if he had the right
numbers, Arny obviously does not understand what they mean.
In an ABX experiment, a p-value is calculated under the assumption that
the subject is guessing. For instance, if a subject gets 14 correct in 16
trials, we say p = .002 because IF someone is guessing (with 50% chance of
a correct answer on each trial) THEN the probability that he will get 14,
or 15, or 16 correct in 16 trials is approximately .002.
Arny has this bass-ackwards.
He claims that IF someone gets 14 correct THEN the probability is .002
that the person was guessing. Of course there is absolutely NO logical or
scientific support for that---it is entirely a result of Arny's failure to
comprehend what the calculations are about. The fact that Arny refers to
a p-value as a probability that the test subject was guessing is a dead
giveaway that he has no clue about how statistical science works.
(2)
There are several reasons why PCABX reports bogus numbers for p-values:
One reason is the typos I already mentioned.
Another is the fact that Arny based his calculations on part of what David
Carlstrom presented as the statistical basis for the original ABX
comparator. Carlstrom mentioned two tests---one was based on a binomial
distibution and a second was based on a chi-squared distribution.
The binomial approach leads to an exact solution for testing
H_0: theta = .5
vs
H_1: theta .5
where theta is the single-trial probability of a correct answer. Thus
theta = .5 means the subject is guessing with the same chance of success
as flipping a fair coin, and theta .5 means he is doing better than
that. That is an appropriate test if you want to see if a subject is
doing *better* than chance would cause him to do.
But Carlstrom made an error when he proposed the other test.
He described a chi-squared procedure that tests
H_0: theta = .5
vs
H_1: theta not equal to .5.
Now this compares chance behavior to *dfferent-from-chance* performance.
Since that includes theta .5 as well as theta .5, the numbers
generated this way are off by a factor of two from what would be
comparable to the binomial test. This is obvious to anyone with real
statistical training, but not to someone who naively copied a formula out
of a book and coded it into a computer program. Of course a competent
statistician would know how to adapt that chi-squared procedure to the
sort of test that Carlstrom described with his binomial plan. Arny's
PCABX uses the flawed chi-squared approach, so his calculations are
biased; PCABX reports larger p-values (hence less-significant results)
than it should. (That error is not quite as far off as a factor of two
because there are other errors from approximating a discrete distribution
by a continuous one; since they are in the opposite directions, the errors
partially cancel.)
To see this effect search Google Groups for the Usenet article with
Subject: Statistics and PCABX (was weakest Link in the Chain)
Newsgroups: rec.audio.high-end
Date: 2004-01-13
(3)
Yet another issue is that some of the numbers PCABX returns are not
calculated by standard procedures at all. Although Arny claims that PCABX
follows recognized scientific practice, the fact is that some of the
numbers PCABX returns are pure fabrication. Maybe because Arny did not
understand what a p-value is, or maybe because he did not realize that he
based his calculations on an inappropriate method, PCABX reports p-values
of 1 when the observed data show less than half the trials with correct
answers. This is NOT a standard calculation based on techniques in any
textbook I'm aware of. It also does not agree with the methods described
in
http://www.pcavtech.com/abx/abx_p9.htm which Arny cited earlier in this
thread as an authoritative reference. If Arny has a specific citation of
a reference showing how someone with pencil and paper (and perhaps a
simple calculator) can duplicate the numbers PCABX comes up with, I'd like
to see it.
So it's clear that the analysis side of PCABX is broken in many ways.
It is also the case that he experimental design part has problems.
Although much effort went into refining experimental technique, there
appears to be very little awareness of the rest of experimental design.
Arny's Ten Commandments^H^H^H^H^H^H^H^H^H^H^H^HRequirements are NOT
sufficient to make a good listening experiment. No matter how well you
try, the reality is that if a test has only one trial, there is a 50% type
I error risk. The ONLY way to reduce that is statistical---you need more
trials. Once you do that, there is the issue of how many trials to do,
and how many of those are needed to pass the test. PCABX suggests 14
correct in 16 trials, even though that is a really bad choice.
If the effect being tested is small, say near threshold, then the 14/16
test will usually (80% of the time) _fail_ to detect a real effect.
If the effect is large, then 16 trials is wasteful. A test with far fewer
trials may be adequate then. There are plenty of designs that are better
than 14/16, but it would be hard to find one that is worse.
Once again, Arny gets it bacwards. He starts with 16 trials, then picks
14 (it used to be 12) as a passing score. Of course a rational design
might start with specified levels for type I and type II error risks, and
then determine a sample size to achieve that performance.
For a graduated collection of tests, such as would be the case if the
links in the table near the bottom of
http://www.pcabx.com/training/index.htm actually worked, we would need
only a few trials for the easy samples but many more for the harder ones
if we wanted comparable sensitivity of the tests. Using the same number
of trials for different levels means that the tests do not have the same
power (sensitivity); the result is that subjects will seem to have a
threshold-style respnse even if their true response were a linear function
of stimulus level. If the true response has a threshold then it is
confounded with the test's power function, making interpretation of the
results difficult. This is analagous to measuring a decreasing signal
with a meter. As the signal level drops, the meter needs to be adjusted
to read on a lower range (more sensitive) scale. If that is not done, a
naive user may "see" that below some point there is apparently no response
when actually there is some response below the current meter range. Using
a fixed size of 16 trials over a broad range of stimulus levels will cause
that sort of error, yet that is precisely what PCABX says to do.
The statistical science in PCABX is Completely Ridiculous & Absolutely
Preposterous, which we can abbreviate as CRAP.
Lest anyone get the wrong imnpression, I want to be clear that I am in
favor of properly-done scientific tests. ABX and similar tests can be
properly done, but merely using an ABX data collection plan is no
guarantee of a worthwhile experiment. A worthwhile experiment requires
competent statistical design and analysis along with good experimental
technique. No part is sufficient---all these are necessary. No matter
how good the other parts are, if the statistical aspects are bungled, the
experiment is ruined. Now I do not claim that good statistical practice
is enough to make a successful experiment, but I do argue that failing to
get the statistical stuff right is enough to botch the experiment. It is
much the same as noting that neither level matching nor time-synchronizing
nor blinding alone will make a good experiment, but missing any one can
esily ruin on otherwise-okay experiment.
*************************************
End report.