Reply
 
Thread Tools Display Modes
  #1   Report Post  
Phil
 
Posts: n/a
Default A/B/X Testing (was: dB vs. Apparent Loudness)


"George M. Middius" wrote in message
...


Phil said:

So who is the real fraud. Arny, who has been claiming for years that his

ABX
test proved that people can't hear things and then he pontificates that

he
always knew that he couldn't prove a negative or myself who gives a

detailed
mathematical reasons for why Arny's test doesn't work.


You can't defeat religious faith with logic or science. Call a 'borg a

'borg.

I hate to disagree George but Arny's response is not religious. It is rather
common among engineers and even more so around technicians. It is not
unusual for technical people to have their egos tide up with tech knowledge.
Attack that knowledge, you attack their soul, what it is that makes them,
them. Many cases they are right, but they're wrong too. The problem comes in
that when there is a technical debate, when the argument gets heavy they
take it to personnel and much to seriously.
My view is simple, it is a hobby, enjoy yourself. I can't tell you what to
feel, I can only give you my technical opinion based on my background. It is
up to you to make any decision you want that makes you happy.
I will give you my opinion and you will give yours and we will agree to
disagree. If you treat people respectfully you rarely run into much
problems.
However there are some that their egos are to wrapped up in the issue and no
politeness or fair treatment will cause them to act reasonability. There are
individuals like this on both sides but unfortunately they more common on
the technical side.

Phil


  #2   Report Post  
Arny Krueger
 
Posts: n/a
Default A/B/X Testing (was: dB vs. Apparent Loudness)

Phil wrote:


I hate to disagree George but Arny's response is not religious. It is
rather common among engineers and even more so around technicians. It
is not unusual for technical people to have their egos tide up with
tech knowledge. Attack that knowledge, you attack their soul, what it
is that makes them, them. Many cases they are right, but they're
wrong too. The problem comes in that when there is a technical
debate, when the argument gets heavy they take it to personnel and
much to seriously.


So Phil, are you saying that you're not an engineer or technical person of
any kind?

Why do you think that you are immune to this purported problem of tying your
ego up with technical knowledge?


  #3   Report Post  
Phil
 
Posts: n/a
Default A/B/X Testing (was: dB vs. Apparent Loudness)


"Arny Krueger" wrote in message
...
Phil wrote:


I hate to disagree George but Arny's response is not religious. It is
rather common among engineers and even more so around technicians. It
is not unusual for technical people to have their egos tide up with
tech knowledge. Attack that knowledge, you attack their soul, what it
is that makes them, them. Many cases they are right, but they're
wrong too. The problem comes in that when there is a technical
debate, when the argument gets heavy they take it to personnel and
much to seriously.


So Phil, are you saying that you're not an engineer or technical person of
any kind?


No Arny, I said it is common, that doesn't mean every engineer is so
immature. I normally try to have respect for people even if I disagree with
them. However, if they are as rude and childish as you are it gets rather
difficult.
The above is a rather cheap shot considering I was cutting you a break, but
you never will miss an chance to show, once again, that you no class.

Why do you think that you are immune to this purported problem of tying

your
ego up with technical knowledge?



I'm not immune. When a third rater like you starts condescending to me, it
is a bit difficult to hold back. But, one tries to have a bit a class even
when facing someone as rude and nasty as you. It is difficult, but I try,
and you don't, as your above post demonstrates.

Phil



  #4   Report Post  
ff123
 
Posts: n/a
Default A/B/X Testing (was: dB vs. Apparent Loudness)

On Tue, 01 Jun 2004 16:28:56 -0500, (John
Corbett) wrote:

But that is _not_ an inherent flaw in ABX testing---it is a result of too
small a sample size. You can spot threshold-level performance with 1%
error rates if you are willing to use a large enough sample. A test with
n = 80 (and getting at least 51 correct) will do that.


For an individual, repeating many trials to simulate a large N is
impractical, and may well reduce listening sensitivity due to fatigue.
PC-ABX, being targeted towards individual listeners, is reasonably
focused more on reducing type I errors.

If a large group of people were to participate in a test, though, it
could become practical to try to reduce type II errors.

I might add here that there is a potential problem with the way that
some individual, repeated-trials, ABX tests might be carried out, and
that is knowing whether or not each individual trial is
correct/incorrect and basing decisions to stop a test using that
information. The binomial statistics commonly quoted are applicable
only if the number of trials are fixed prior to the test.

There is a test using fewer than 16 trials that controls type I error
below .01 yet is more powerful (i.e., more sensitive) than Arny's
preferred 14-of-16 test against every alternative theta. Such a test not
only is more sensitive than Arny's, but it also saves time, and doesn't
fatigue the subjects as much.


ABX has an important advantage over other, more statistically
efficient types of tests, such as the triangle test, or "odd man out"
-- its relative simplicity. The less complex the task, the more
likely subtleties can be discerned.

On the other hand, there are some properties of music, such as
loudness, which might be more sensitively measured using a
2-Alternative Forced Choice rather than an ABX. This is because in
such a test, there is one distinct quality (volume) which is being
evaluated.

But if there are many different qualities which could contribute to
one source of music possibly sounding "different from" another source
(as is typical in most musical reproduction sources), then ABX is
appropriate as a method to determine if there is a "real" difference.

ff123
  #5   Report Post  
Arny Krueger
 
Posts: n/a
Default A/B/X Testing (was: dB vs. Apparent Loudness)

Phil wrote:
"Arny Krueger" wrote in message
...
Phil wrote:


I hate to disagree George but Arny's response is not religious. It
is rather common among engineers and even more so around
technicians. It is not unusual for technical people to have their
egos tide up with tech knowledge. Attack that knowledge, you attack
their soul, what it is that makes them, them. Many cases they are
right, but they're wrong too. The problem comes in that when there
is a technical debate, when the argument gets heavy they take it to
personnel and much to seriously.


So Phil, are you saying that you're not an engineer or technical
person of any kind?


No Arny, I said it is common, that doesn't mean every engineer is so
immature. I normally try to have respect for people even if I
disagree with them. However, if they are as rude and childish as you
are it gets rather difficult.


The above is a rather cheap shot considering


It is far less than repayment in kind.

I was cutting you a break,


As in a broken neck.

but you never will miss an chance to show, once again, that you no class.


Phil, do you seriously think that your many cheap shots make you look
classy?

Why do you think that you are immune to this purported problem of
tying your ego up with technical knowledge?


I'm not immune. When a third rater like you starts condescending to
me, it is a bit difficult to hold back.


Phil, as if you haven't been condescending towards me since day one.

But, one tries to have a bit
a class even when facing someone as rude and nasty as you.


Phil, as if you haven't been condescending and nasty towards me since day
one.

It is difficult, but I try, and you don't, as your above post

demonstrates.

Phil, you have absolutely no self-awareness.





  #6   Report Post  
Arny Krueger
 
Posts: n/a
Default A/B/X Testing (was: dB vs. Apparent Loudness)

George M. Middius wrote:
ff123 said:

But that is _not_ an inherent flaw in ABX testing---it is a result
of too small a sample size. You can spot threshold-level
performance with 1% error rates if you are willing to use a large
enough sample. A test with n = 80 (and getting at least 51
correct) will do that.


For an individual, repeating many trials to simulate a large N is
impractical, and may well reduce listening sensitivity due to
fatigue.



This is an audio newsgroup. Get that mental masturbation **** off this
forum NOW!


If irony killed!


  #7   Report Post  
Arny Krueger
 
Posts: n/a
Default Sluts on Alert!

George M. Middius wrote:
La Salope, keeper of the faith......

The question is


Where's the religion jokes, Slut? Admit you were lying. We all know
you can't possibly make a religion joke if your beloved Krooger might
see it.


This is the same George Middius who writes:

"This is an audio newsgroup. Get that mental masturbation **** off this
forum NOW!"

Hypocrisy, anybody?


  #8   Report Post  
John Corbett
 
Posts: n/a
Default A/B/X Testing (was: dB vs. Apparent Loudness)

In article , ff123
wrote:


For an individual, repeating many trials to simulate a large N is
impractical, and may well reduce listening sensitivity due to fatigue.
PC-ABX, being targeted towards individual listeners, is reasonably
focused more on reducing type I errors.


If there were more consideration of power and sample size, I might agree.
As it is, I'd have to say that PC-ABX appears fixated on reducing type I
error, and for the most part ignores type II error.


If a large group of people were to participate in a test, though, it
could become practical to try to reduce type II errors.


That is not as simple as it might appear at first. Under the null
hypothesis (just guessing) the panel's scores behave as if they were from
a single subject. That also holds in the case where the population hears
perfectly. But for type II error calculations, you want to consider
intermediate values, and surprises await the unwary.

Consider a population and a test signal such that half the people hear it
perfectly, and the other half are just guessing. Select subjects at
random from that population, and administer ABX tests. Here are the type
II errors you expect:

For a 12 -of-16 ABX test:

1 listener doing 16 trials .481
4 subjects each doing 4 trials .398
16 subjects each doing 1 trial .370

So the panel does a bit better than a single listener.

For a 14 -of-16 ABX test:

1 listener doing 16 trials .499
4 subjects each doing 4 trials .706
16 subjects each doing 1 trial .803

Now the panel does worse than a single listener!

I might add here that there is a potential problem with the way that
some individual, repeated-trials, ABX tests might be carried out, and
that is knowing whether or not each individual trial is
correct/incorrect and basing decisions to stop a test using that
information. The binomial statistics commonly quoted are applicable
only if the number of trials are fixed prior to the test.


That's absolutely correct.

People can do into this inadvertantly; sometimes a listener gets a score
that's almost significant, and then decides to do another 16 trials to
combine with the first batch in hopes of getting a significant result. Of
course that's bogus.

If you're going to do sequential tests, you need to do something like
Wald's Sequential Probability Ratio Test. Since we're adminstering tests
with a computer, we can easily do the updating and scoring, so a
Wald-style testing plan makes a lot of sense.


There is a test using fewer than 16 trials that controls type I error
below .01 yet is more powerful (i.e., more sensitive) than Arny's
preferred 14-of-16 test against every alternative theta. Such a test not
only is more sensitive than Arny's, but it also saves time, and doesn't
fatigue the subjects as much.


ABX has an important advantage over other, more statistically
efficient types of tests, such as the triangle test, or "odd man out"
-- its relative simplicity. The less complex the task, the more
likely subtleties can be discerned.


You can easily devise a test that keeps all the advantages of ABX while
still being more sensitive and taking less time and causing less listener
fatigue than the usual ABX.

Just do an ABX test with 12 correct in 14 trials as the cutoff; this test
has type I error under .01 and has type II error less than that of the
14/16 test for every alternative choice of theta.



It is often claimed that for a given individual, performance is either
"random" or "nearly perfect"; that is, there is a very rapid transition
between essentially two states as the stimulus passes a threshold value.
This is a plausible model based on our understanding of the physiology of
hearing. (People suggesting this approach seem to think random is random
with success probability theta = .5 , forgetting that other values of
theta are associated with random behavior.)

Now if you really believe that model is correct, then you can do much
smaller experiments. For example, with both alpha (type I) and beta (type
II) levels .05, you need only get 5 correct in 5 trials to separate theta
= .5 from theta = 1. In fact, this even works for theta = .99.

With alpha and beta .01, you can get by with 7 correct in 7 trials (theta
= 1) or 10 correct in 11 trials (theta =.99 or better ). No need for 16
trials.

Of course these tests are not sensitive to smaller values of theta, so
their use depends heavily on the model being correct.

However, even if you are interested only in testing theta = .5 versus
theta nearly 1, you should do tests to show that values such as .75 don't
really occur. So even if you believe the threshold model, you have to do
some tests of theta = .5 versus theta = .75, and then you're right back to
where ABX is not especially sensitive without large samples.

You need big samples to test for small effects.


JC
  #9   Report Post  
ff123
 
Posts: n/a
Default A/B/X Testing (was: dB vs. Apparent Loudness)

On Wed, 02 Jun 2004 12:12:19 -0500, (John
Corbett) wrote:

In article , ff123
wrote:


For an individual, repeating many trials to simulate a large N is
impractical, and may well reduce listening sensitivity due to fatigue.
PC-ABX, being targeted towards individual listeners, is reasonably
focused more on reducing type I errors.


If there were more consideration of power and sample size, I might agree.
As it is, I'd have to say that PC-ABX appears fixated on reducing type I
error, and for the most part ignores type II error.


Come to think of it, I don't recall ever seeing a test (of speaker
cable or amplifier sound, or what have you) in which type II errors
were specifically discussed in the analysis, so you may be correct in
your assessment.

One thing's for su a lot of people have misinterpreted ABX results
and their meaning over the years, especially with regards to type II
errors. The sentiment "if you can't ABX it, it's not real" is common
in some circles. PC-ABX and its cousins (including my own
application) should have an entry box to allow theta to be varied, and
should show corresponding type II errors instead of just the type I
errors. However, this requires even more statistical sophistication
than usual, and the usual is often too high of a requirement for the
typical interested listener. Keeping it simple has its advantages.

If you're going to do sequential tests, you need to do something like
Wald's Sequential Probability Ratio Test. Since we're adminstering tests
with a computer, we can easily do the updating and scoring, so a
Wald-style testing plan makes a lot of sense.


I have considered adding the ratio test to my application
(
http://ff123.net/abchr/abchr.html), but was not thrilled with the
resulting increase in trials needed to drive down the type I errors.
The problem can also be approached from a frequentist viewpoint, where
the maximum number of total trials is fixed, and test endpoints are
strictly enforced by the application. This minimizes the trials
needed to achieve a desired alpha in a sequential test, but I never
implemented this solution either. Mea culpa.

You can easily devise a test that keeps all the advantages of ABX while
still being more sensitive and taking less time and causing less listener
fatigue than the usual ABX.

Just do an ABX test with 12 correct in 14 trials as the cutoff; this test
has type I error under .01 and has type II error less than that of the
14/16 test for every alternative choice of theta.


PC-ABX does not specifically limit the number of trials to 16. This
is a simple change to Arny's website recommendation, where he says:

"More specifically, you should try for 14 correct answers out of 16
trials."

ff123
  #10   Report Post  
John Corbett
 
Posts: n/a
Default A/B/X Testing (was: dB vs. Apparent Loudness)

In article , ff123
wrote:


One thing's for su a lot of people have misinterpreted ABX results
and their meaning over the years, especially with regards to type II
errors. The sentiment "if you can't ABX it, it's not real" is common
in some circles.




Another common misinterpretation in ABX tests involves p-values.

Often the p-value (the probability of obtaining results at least as
extreme as actually observed, under the assumption that the null
hypothesis is true) is presented as the probability that the null
hypothesis is true (given the observed data).
That interpretation is totally bogus.
When you see someone describe a p-value as the probability that a subject
was guessing you can be sure that someone does not know what he's talking
about.

In other words, a p-value is _not_ the "Probability You Were Guessing".




PC-ABX and its cousins (including my own
application) should have an entry box to allow theta to be varied, and
should show corresponding type II errors instead of just the type I
errors.


The time to think about Type II error is before running the experiment.
Once you have data it's too late to worry about power for that experiment.





You can easily devise a test that keeps all the advantages of ABX while
still being more sensitive and taking less time and causing less listener
fatigue than the usual ABX.

Just do an ABX test with 12 correct in 14 trials as the cutoff; this test
has type I error under .01 and has type II error less than that of the
14/16 test for every alternative choice of theta.


PC-ABX does not specifically limit the number of trials to 16. This
is a simple change to Arny's website recommendation, where he says:

"More specifically, you should try for 14 correct answers out of 16
trials."

ff123




IIRC, PC-ABX allows tests with 5--100 trials, but the typical user is
going to do a test with 16 trials because that is suggested as a standard,
even though better choices are available.

Remember that the suggestion to use 16 trials is rooted in a hardware
limitation of the original ABX box; it has no real statistical basis.

You need to choose the number of trials and the cutoff value in light of
TypeI and Type II errors (perhaps with Bonferroni or similar adjustments
for multiple tests in some experiments).

The one-size-fits-all approach of fixing sample size first is a recipe for
tests that either have little chance of success (low power) or are
wasteful (taking more time and unnecessarily fatiguing listeners).

It makes about as much sense as advising people to always set record level
at "9" on a dial, because someone did that once and it worked okay, so
that's what everyone should do for all their recordings, no matter what
mics, other equipment, room, program, performers, ...

JC


  #11   Report Post  
Arny Krueger
 
Posts: n/a
Default A/B/X Testing (was: dB vs. Apparent Loudness)

John Corbett wrote:

IIRC, PC-ABX allows tests with 5--100 trials, but the typical user is
going to do a test with 16 trials because that is suggested as a
standard, even though better choices are available.


Depends on how you define "Better choice".

Remember that the suggestion to use 16 trials is rooted in a hardware
limitation of the original ABX box; it has no real statistical basis.


Nope. 16 trials was originally suggested because it represents a nice
compromise between sensitivity and significance.

You need to choose the number of trials and the cutoff value in light
of TypeI and Type II errors (perhaps with Bonferroni or similar
adjustments for multiple tests in some experiments).


See former comments about the need for a balance between sensitivity and
significance.

The one-size-fits-all approach of fixing sample size first is a
recipe for tests that either have little chance of success (low
power) or are wasteful (taking more time and unnecessarily fatiguing
listeners).


Which ignores a third present danger, which is a test that appears to have
statistically significant results, but is itself a chance occurrence.

It makes about as much sense as advising people to always set record
level at "9" on a dial, because someone did that once and it worked
okay, so that's what everyone should do for all their recordings, no
matter what mics, other equipment, room, program, performers, ...


Posturing and trolling, anybody?


  #13   Report Post  
ff123
 
Posts: n/a
Default A/B/X Testing (was: dB vs. Apparent Loudness)

On Thu, 03 Jun 2004 10:15:51 -0700, ff123 wrote:

Another common misinterpretation in ABX tests involves p-values.

Often the p-value (the probability of obtaining results at least as
extreme as actually observed, under the assumption that the null
hypothesis is true) is presented as the probability that the null
hypothesis is true (given the observed data).
That interpretation is totally bogus.
When you see someone describe a p-value as the probability that a subject
was guessing you can be sure that someone does not know what he's talking
about.

In other words, a p-value is _not_ the "Probability You Were Guessing".


lol! You caught a lingering text error in my application which I've
been meaning to change, but haven't. I may have carried over the text
from PC-ABX. The interpretation of the p-value should read something
like: "probability of achieving result by chance"

ff123


Ack, this isn't quite correct, either. It's more like, "if the
listener is guessing, the probability of getting at least this many
correct" But that's kind of verbose for a description.

ff123
  #14   Report Post  
ff123
 
Posts: n/a
Default A/B/X Testing (was: dB vs. Apparent Loudness)

On Thu, 3 Jun 2004 12:41:50 -0400, "Arny Krueger"
wrote:

Posturing and trolling, anybody?


Arny, you need to step back and remove yourself from the noise that is
the vast majority of RAO -- it seems to have corrupted your judgment
about what is truly useful commentary about ABX. First of all, you
might consider John Corbett's email address before you accuse him of
posturing in an area (statistics) where he is obviously an expert.

Here are some of the valid criticisms raised in this thread:

1. One size fits all (14/16 correct) testing is wasteful in the case
of testing for obvious defects and insufficient for saying that subtle
defects are inaudible, depending on the model assumed for drawing the
line between audible/inaudible. If you must assume a "standard" size,
12/14 would be a better compromise between 0.01 significance for type
I errors and minimizing type II errors.

2. In general, type II errors are ignored on the websites which
popularize ABX testing, yours and mine included. There is not enough
emphasis on pre-planning tests, with specific objectives in mind that
drive alpha and beta values, number of samples/trials needed, etc.

3. Sequential effects are ignored in all current ABX implementations
which allow the listener to see in-progress results. Either trials
must be fixed prior to the test, or the listener must be denied the
ability to stop a test based on disclosed in-progress results, or
Wald's ratio test should be used.

4. The proper description of the p-value should be used.

ff123
  #15   Report Post  
Arny Krueger
 
Posts: n/a
Default A/B/X Testing (was: dB vs. Apparent Loudness)

ff123 wrote:
On Thu, 3 Jun 2004 12:41:50 -0400, "Arny Krueger"
wrote:

Posturing and trolling, anybody?


Arny, you need to step back and remove yourself from the noise that is
the vast majority of RAO -- it seems to have corrupted your judgment
about what is truly useful commentary about ABX. First of all, you
might consider John Corbett's email address before you accuse him of
posturing in an area (statistics) where he is obviously an expert.

Here are some of the valid criticisms raised in this thread:


1. One size fits all (14/16 correct) testing is wasteful in the case
of testing for obvious defects and insufficient for saying that subtle
defects are inaudible, depending on the model assumed for drawing the
line between audible/inaudible. If you must assume a "standard" size,
12/14 would be a better compromise between 0.01 significance for type
I errors and minimizing type II errors.


The posturing and trolling is that "one size fits all" is a relevent point
for discussion. The PCABX Compator hables up to 100 trials.

2. In general, type II errors are ignored on the websites which
popularize ABX testing, yours and mine included. There is not enough
emphasis on pre-planning tests, with specific objectives in mind that
drive alpha and beta values, number of samples/trials needed, etc.


I don't know about your web site, but PCABX does address type II errors a
number of different ways. One major deficiency of the current discussion is
that it is entirely within the context of statistics - number of trials, how
the results of those trials are analyzed, etc.

3. Sequential effects are ignored in all current ABX implementations
which allow the listener to see in-progress results. Either trials
must be fixed prior to the test, or the listener must be denied the
ability to stop a test based on disclosed in-progress results, or
Wald's ratio test should be used.


In-progress results display is an option (you can turn it on or off) of the
PCABX Comparator, and there is a discussion on the web site of some of the
advantages and disadvantages of using it.

4. The proper description of the p-value should be used.

ff123





  #16   Report Post  
MINe 109
 
Posts: n/a
Default A/B/X Testing (was: dB vs. Apparent Loudness)

In article ,
"Arny Krueger" wrote:

The PCABX Compator hables up to 100 trials.


Good to know...

Stephen
  #17   Report Post  
James Tannenbaum
 
Posts: n/a
Default A/B/X Testing

Arny Krueger wrote:

The PCABX Compator hables up to 100 trials.


Do you do drugs too? Tsk, what a mess you are.
  #18   Report Post  
Arny Krueger
 
Posts: n/a
Default A/B/X Testing (was: dB vs. Apparent Loudness)

ff123 wrote:

In other words, a p-value is _not_ the "Probability You Were
Guessing".


I may have carried over the text
from PC-ABX. The interpretation of the p-value should read something
like: "probability of achieving result by chance"


Hey, let's write 100 scholarly papers, comparing and contrasting:

"Probability You Were Guessing".

and

"probability of achieving result by chance"

That seems like the most important thing to do.





  #19   Report Post  
Arny Krueger
 
Posts: n/a
Default A/B/X Testing

James Tannenbaum wrote:
Arny Krueger wrote:

The PCABX Compator hables up to 100 trials.


Do you do drugs too? Tsk, what a mess you are.


Thanks for taking advantage of the opportunity to show what a hypocrite you
are, Tannenbaum or whoever you really are.


  #20   Report Post  
James Tannenbaum
 
Posts: n/a
Default A/B/X Testing

Arny Krueger wrote:
James Tannenbaum wrote:

Arny Krueger wrote:


The PCABX Compator hables up to 100 trials.


Do you do drugs too? Tsk, what a mess you are.



Thanks for taking advantage of the opportunity to show what a hypocrite you
are, Tannenbaum or whoever you really are.


More nonsense, Mr. Krueger. Are you trying to express a thought? It
isn't working.


  #21   Report Post  
John Corbett
 
Posts: n/a
Default A/B/X Testing (was: dB vs. Apparent Loudness)

In article , "Arny Krueger"
wrote:

John Corbett wrote:

IIRC, PC-ABX allows tests with 5--100 trials, but the typical user is
going to do a test with 16 trials because that is suggested as a
standard, even though better choices are available.


Depends on how you define "Better choice".


There are examples elsewhere in this thread of better choices in specific
situations.
Theory for optimal test design in this sort of problem is based on work by
Neyman and Pearson and has been understood for about 70 years now.
Have you got a better idea?




Remember that the suggestion to use 16 trials is rooted in a hardware
limitation of the original ABX box; it has no real statistical basis.


Nope. 16 trials was originally suggested because it represents a nice
compromise between sensitivity and significance.


Two comments here.

1. Merely setting the sample size does not necessarily establish a balance
between
sensitivity (i.e., power, or 1 - beta, associated with Type II errors) and
significance (alpha, associated with Type I errors).

That balance also depends on the choice of critical value.
Furthermore, you have to define just what the test is supposed to be
"sensitive" to; typically this involves stating what minimum effect the
experimenter wants to detect with the specified Type II error rate.

How does suggesting a test of 14/16 do that?



2. Arny is revising history now.
According to Google, here is Arny's earlier version of the story.

------------------------------------------------------------------

Deciding that a standard test is composed of 16 trials was arbitrary
as well.

The first ABX box was limited to 16 trials by the stepping relay that
formed its core. We worked with this for a while and over the years
found that it was a convenient number.

16 trials gives a nice balance between a really grueling test with
too many trials, and one that was too intolerant of mistakes.
Eventually the ABX box was expanded to 100 trials, but people kept on
doing tests with 16 trials because it just seemed to fit.

------------------------------------------------------------------






You need to choose the number of trials and the cutoff value in light
of TypeI and Type II errors (perhaps with Bonferroni or similar
adjustments for multiple tests in some experiments).


See former comments about the need for a balance between sensitivity and
significance.



Whose comments? Arny's are content-free.




The one-size-fits-all approach of fixing sample size first is a
recipe for tests that either have little chance of success (low
power) or are wasteful (taking more time and unnecessarily fatiguing
listeners).


Which ignores a third present danger, which is a test that appears to have
statistically significant results, but is itself a chance occurrence.


Arny once again demonstrates that he does not understand how this works.
Many users of statistics are confused about this, so let's review a bit.

When we set up an experiment, we decide which possible outcomes will be
declared "significant" and which will not. Most often this choice is
merely to define any result with p .05 (or .01) as significant.

Once the experiment is run, we _know_ whether or not the observed result
is significant. It doesn't just "appear" so.

But we don't know if that result is just a fluke or an actual sign that
something is going on. We can never tell if an individual significant
result from a particular experiment is real or just chance.

Significance just flags those results which are unlikely to have occurred
in a pure-chance situation; when a significant result is found, all we
know is that either a real effect is present or an unlikely chance event
has occurred. Significance does not attempt to distinguish between those
cases.

What we can do:

1. Repeat the experiment. If we can reliably get significant results in
many repeated experiments, then we have evidence that they are not all
merely chance results.
Then we come to believe that the observed behavior reflects a real effect.
(Even so, we could still have some chance results in our collection of
significant results, but we don't know which ones those might be; by that
point we wouldn't care anyway, as we would have established a general
underlying result.)

2. If we cannot---or don't want to---repeat the experiment, we could act
as if the result were not due to chance. Then significance serves as a
limited protection, describing the risk of being wrong when we choose to
act that way.

So standard practice by a competent statistician hardly ignores the issue
despite Arny's objection.
Do we want to ask how Arny's methods are any better?

Remember, statistical science is not about dumping data into a black box,
pushing a button, and having truth present itself!
Sadly, for too many people, that is what statistical methods seem to be.

JC
  #22   Report Post  
John Corbett
 
Posts: n/a
Default A/B/X Testing (was: dB vs. Apparent Loudness)

In article , ff123
wrote:

Ack, this isn't quite correct, either. It's more like, "if the
listener is guessing, the probability of getting at least this many
correct" But that's kind of verbose for a description.

ff123


Yup.

That's why it is helpful to have clearly defined technical terms, provided
they're used by people who understand what they mean.

Other attempts to avoid verbosity are often easily misinterpreted.
And once a bogus understanding is implanted, it can be really hard for
someone to see the correct version.

JC
  #23   Report Post  
Arny Krueger
 
Posts: n/a
Default A/B/X Testing (was: dB vs. Apparent Loudness)

John Corbett wrote:
In article , "Arny Krueger"
wrote:

John Corbett wrote:

IIRC, PC-ABX allows tests with 5--100 trials, but the typical user
is going to do a test with 16 trials because that is suggested as a
standard, even though better choices are available.


Depends on how you define "Better choice".


There are examples elsewhere in this thread of better choices in
specific situations.
Theory for optimal test design in this sort of problem is based on
work by Neyman and Pearson and has been understood for about 70 years
now. Have you got a better idea?


Sure. Instead of flogging the statistical analysis which is pretty
cut-and-dried, inherently boring and oh by the way a dead end, I decided to
focus on enhancing the performance of the listener and making the actual
equipment test more representative.


  #24   Report Post  
John Corbett
 
Posts: n/a
Default A/B/X Testing (was: dB vs. Apparent Loudness)

In article , "Arny Krueger"
wrote:

The posturing and trolling is that "one size fits all" is a relevent point
for discussion. The PCABX Compator hables up to 100 trials.


No one said the PCABX Comparator program could only do 16 trials.

But, as far as I can tell, its instructions say only this about sample size:

Your goal should be to obtain 1% or less "Probability You
Were Guessing" , as calculated by the PCABX Comparator.*More
specifically, you should try for 14 correct answers out of 16 trials.*
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(This part is in BOLD type on the PCABX page.)





2. In general, type II errors are ignored on the websites which
popularize ABX testing, yours and mine included. There is not enough
emphasis on pre-planning tests, with specific objectives in mind that
drive alpha and beta values, number of samples/trials needed, etc.


I don't know about your web site, but PCABX does address type II errors a
number of different ways.


I'm not convinced the PCABX web site addresses Type II errors at all.

The concepts of Type I error and Type II error were introduced in 1933 by
Neyman and Pearson. They involve errors made by the EXPERMENTER, not
errors made by the test SUBJECTS.
These have been standard terms for over 70 years.

Recall that the experimenter makes a Type I error when he rejects a true
null hypothesis, and makes a Type II error when he fails to reject a false
null.
(So, a Type I error is when the experimenter treats a chance event as if
it were evidence of a real effect, and a Type II error is when the
experimenter dismisses a true effect as just a chance result.)

Type I and Type II errors do _not_ refer to choices made by the subject on
each trial; they are whole-experiment level issues.

What PCABX refers to involves experimental technique, which Arny
mistakenly refers to as Type II errors. Here is what Arny has said:

----------------------------------------------------------------------
I deal with type 2 errors a little more gently at
http://www.pcabx.com/training/index.htm . There is this
carefully-worded paragraph:

"If you have difficulty completing any samples rated "Difficult" or
easier, please consider upgrading your listening environment
including loudspeakers, sound card, and listening environment. You
will do better in a quiet, undisturbed environment. The more
difficult tests require levels of hearing acuity that not every
person has. Some people may have physical hearing problems that they
are not aware of. Others are simply inexperienced with reliably
hearing differences in a blind listening test, and will find the
additional effort required to complete the training to be very
worthwhile."
----------------------------------------------------------------------

Clearly Arny is confused, perhaps because in ABX the same person can be
both subject and experimenter. Take off the subject hat and put on the
investigator hat when you work with Type I and Type II errors.




One major deficiency of the current discussion is
that it is entirely within the context of statistics - number of trials, how
the results of those trials are analyzed, etc.


How is this a major deficiency, exactly?

The area in which PCABX is most deficient is its statistical component.

That's where the most improvement can be made---the rest of it actually
pretty good, for the most part.

Here are a few suggestions:

(1) Establish two modes, training and testing.

The training mode is much like the current setup, with the subject free to
stop whenever he feels like it.
Indicate after each trial if the answer is right or wrong.
Don't bother to display a significance figure, or perhaps even a tally of
number right and total number of trials.

The testing mode would work this way.
The user specifies a number of trials, and then the program proceeds much
as it now does.
But, there is no running display, and only at the end of the preset number
of trials are results displayed, with a p-value.
If the user quits before the preset number of trials, then those results
are not displayed; they are discarded.

This avoids the situation where the subject cuts a run short for some
reason, or does an extra trial because they felt lucky.
Remember that the binomial(n,theta) distribution applies to a fixed number
of trials.


(2) Another possibility would be for the user to propose what effect size
(theta) he wants to detect, and what alpha and beta he wants for error
control. Then the program would propose a sample size.
The user could then consider if he wants to proceed with that sample size
in test mode as above; if the proposed sample size is larger then he would
find comfortable, he could change his theta, alpha, and beta choices.

This would help avoid running underpowered tests or tests which
uneccessarily watse resources.

JC
  #25   Report Post  
Arny Krueger
 
Posts: n/a
Default A/B/X Testing (was: dB vs. Apparent Loudness)

John Corbett wrote:
In article , "Arny Krueger"
wrote:


The posturing and trolling is that "one size fits all" is a relevent
point for discussion. The PCABX Compator hables up to 100 trials.


No one said the PCABX Comparator program could only do 16 trials.


But you did use the phrase "one size fits all" in the context of 16 trials
now, didn't you Corbett? Now it seems that you're so embarassed by the fact
that you said that, that you deleted the relevant text.

But, as far as I can tell, its instructions say only this about
sample size:


Your goal should be to obtain 1% or less "Probability You
Were Guessing" , as calculated by the PCABX Comparator. More
specifically, you should try for 14 correct answers out of 16
trials.


^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (This part is in BOLD
type on the PCABX page.)


Exactly, but Corbett you seem to be completely disinterested as to my
rationale for saying that.

2. In general, type II errors are ignored on the websites which
popularize ABX testing, yours and mine included. There is not
enough emphasis on pre-planning tests, with specific objectives in
mind that drive alpha and beta values, number of samples/trials
needed, etc.


I don't know about your web site, but PCABX does address type II
errors a number of different ways.


I'm not convinced the PCABX web site addresses Type II errors at all.


Not my problem, Corbett. If you ever deign to politely ask what I meant, I
might even enlighten you.

delete redundent and instulting trival pontifications about what Type 1
and Type 2 errors are.




  #26   Report Post  
ff123
 
Posts: n/a
Default A/B/X Testing (was: dB vs. Apparent Loudness)

On Sat, 05 Jun 2004 14:08:45 -0500, (John
Corbett) wrote:

Here are a few suggestions:

(1) Establish two modes, training and testing.

The training mode is much like the current setup, with the subject free to
stop whenever he feels like it.
Indicate after each trial if the answer is right or wrong.
Don't bother to display a significance figure, or perhaps even a tally of
number right and total number of trials.

The testing mode would work this way.
The user specifies a number of trials, and then the program proceeds much
as it now does.
But, there is no running display, and only at the end of the preset number
of trials are results displayed, with a p-value.
If the user quits before the preset number of trials, then those results
are not displayed; they are discarded.

This avoids the situation where the subject cuts a run short for some
reason, or does an extra trial because they felt lucky.
Remember that the binomial(n,theta) distribution applies to a fixed number
of trials.


I like this suggestion for the purposes of ABC/hr. The ABX component
of this particular application (ABC/hr) is largely a tool to improve
listener sensitivity, so implementing a true training mode is an
excellent idea. And eliminating altogether the running display of
results is probably the simplest solution to the sequential test
problem.

(2) Another possibility would be for the user to propose what effect size
(theta) he wants to detect, and what alpha and beta he wants for error
control. Then the program would propose a sample size.
The user could then consider if he wants to proceed with that sample size
in test mode as above; if the proposed sample size is larger then he would
find comfortable, he could change his theta, alpha, and beta choices.

This would help avoid running underpowered tests or tests which
uneccessarily watse resources.


Another great idea.

ff123
  #27   Report Post  
ff123
 
Posts: n/a
Default A/B/X Testing (was: dB vs. Apparent Loudness)

On Sat, 5 Jun 2004 20:00:09 -0400, "Arny Krueger"
wrote:

John Corbett wrote:
In article , "Arny Krueger"
wrote:


The posturing and trolling is that "one size fits all" is a relevent
point for discussion. The PCABX Compator hables up to 100 trials.


No one said the PCABX Comparator program could only do 16 trials.


But you did use the phrase "one size fits all" in the context of 16 trials
now, didn't you Corbett? Now it seems that you're so embarassed by the fact
that you said that, that you deleted the relevant text.


Arny, if you're through, maybe you get past this so-called argument
about who meant what, and focus on improving PC-ABX. John Corbett's
suggestion is an eminently useful one, and the fact is that type II
errors should be directly addressed in the application, just like type
I errors are, not squirreled away somewhere on your website (not that
I've ever seen type II errors addressed there properly, either).

I know you have "Sensory Evaluation Techniques." So consult pp. 286-7
where it shows how to construct a spreadsheet in which n and theta
(called pd, for proportion of distinguishers in the book) are inputs,
and alpha, beta are outputs. It should be relatively easy to
rearrange things such that alpha, beta, and theta are inputs, with n
as an output as Corbett suggested.

Yes, the statistics may be boring, but if you're going to promote your
application as being scientific, it should be correct as far as
reasonably possible (while you're at it, get rid of that chi-square
stuff and calculate the binomial probability to get your p-values).

ff123
  #28   Report Post  
Arny Krueger
 
Posts: n/a
Default A/B/X Testing (was: dB vs. Apparent Loudness)

ff123 wrote:

Arny, if you're through, maybe you get past this so-called argument
about who meant what, and focus on improving PC-ABX. John Corbett's
suggestion is an eminently useful one, and the fact is that type II
errors should be directly addressed in the application, just like type
I errors are, not squirreled away somewhere on your website (not that
I've ever seen type II errors addressed there properly, either).


This is a very sad commentary on your abilities to perceive a working
solution to the problem of type two errors, even when it is laid out before
you in detail with a working model. Ironically, you've implemented some of
the solutions in your own work, and now seem to have forgetten it all!

The best solution to type two errors is not to further hone the statistics,
but to train the listener to be more sensitive.

Is it coming back to you now?


  #29   Report Post  
ff123
 
Posts: n/a
Default A/B/X Testing (was: dB vs. Apparent Loudness)

On Mon, 7 Jun 2004 06:18:51 -0400, "Arny Krueger"
wrote:

ff123 wrote:

Arny, if you're through, maybe you get past this so-called argument
about who meant what, and focus on improving PC-ABX. John Corbett's
suggestion is an eminently useful one, and the fact is that type II
errors should be directly addressed in the application, just like type
I errors are, not squirreled away somewhere on your website (not that
I've ever seen type II errors addressed there properly, either).


This is a very sad commentary on your abilities to perceive a working
solution to the problem of type two errors, even when it is laid out before
you in detail with a working model. Ironically, you've implemented some of
the solutions in your own work, and now seem to have forgetten it all!

The best solution to type two errors is not to further hone the statistics,
but to train the listener to be more sensitive.

Is it coming back to you now?


Are you serious? You're not talking to someone who's never performed
a proper ABX test!

Training the listener is a necessary but not sufficient condition for
reducing type II errors as much as possible. BTW, training includes
learning techniques for reducing listener fatigue.

And how does the listener know how much power the test needs? The
application can suggest it, depending on how subtle the difference is
judged to be!

And how is the listener supposed to judge the subtlety of the
difference? The application can provide a test mode, where the
listener can make a judgment for himself, without the perils of cherry
picking!

How exactly does PC-ABX or your website address test power vs.
subtlety of a difference?

I don't know how long it's been since I first suggested some
improvements to PC-ABX, but one of the reasons I wrote ABC/hr is
because I realized that those improvements would not be forthcoming.
Things like time-offset correction and playback range selection. You
know, not the boring statistical stuff, but stuff which actually
involves the audio.

PC-ABX can be improved. People are not necessarily posturing and
trolling for criticizing the weaknesses of your program.

ff123
  #30   Report Post  
Arny Krueger
 
Posts: n/a
Default A/B/X Testing (was: dB vs. Apparent Loudness)

ff123 wrote:
On Mon, 7 Jun 2004 06:18:51 -0400, "Arny Krueger"
wrote:

ff123 wrote:

Arny, if you're through, maybe you get past this so-called argument
about who meant what, and focus on improving PC-ABX. John Corbett's
suggestion is an eminently useful one, and the fact is that type II
errors should be directly addressed in the application, just like
type I errors are, not squirreled away somewhere on your website
(not that I've ever seen type II errors addressed there properly,
either).


This is a very sad commentary on your abilities to perceive a working
solution to the problem of type two errors, even when it is laid out
before you in detail with a working model. Ironically, you've
implemented some of the solutions in your own work, and now seem to
have forgotten it all!

The best solution to type two errors is not to further hone the
statistics, but to train the listener to be more sensitive.

Is it coming back to you now?


Are you serious? You're not talking to someone who's never performed
a proper ABX test!


I'm trying to make a point, here.

Training the listener is a necessary but not sufficient condition for
reducing type II errors as much as possible.


Says you.


BTW, training includes
learning techniques for reducing listener fatigue.


Agreed.

And how does the listener know how much power the test needs?


In the end he is obliged to figure that out for himself. He's usually
looking for a personal perception, not just an isolated number.

The application can suggest it, depending on how subtle the difference is
judged to be!


And how is the listener supposed to judge the subtlety of the
difference?


In ABX tests, the subtlety is not one of the dimensions that is tested for.
However, most people seem to figure that out for themselves, in terms that
are most meaningful to them, if they achieve positive identification.

The application can provide a test mode, where the
listener can make a judgment for himself, without the perils of cherry
picking!


I favor cherry picking for the purpose of obtaining unrealistically
sensitive results.

How exactly does PC-ABX or your website address test power vs.
subtlety of a difference?


In the formal sense, it does that by providing links to ABC/hr testing
tools.

In the informal sense, once a person has actually heard a very small
difference, and then perhaps been unable to hear a realistic but even
smaller difference, he often seems to develop an intuitive sense of test
power versus the subtlety of the difference he is listening for.

Among the people I hang with who have been doing DBTs for decades, I would
say that this is our consensus:

The canonical 14/16 ABX test combined with good test condition enhancements,
combined with a well-trained listener, is far more sensitive than is
practically needed.

IOW, we developed the means to routinely and reliably hear enhanced
differences between so-called good converters, optical players, and power
amps. The methodology might be extended to cover speaker wires as well.

We then sat back and looked at each other and asked a question that went
something like this: "If this power amp or player were sitting on you living
room floor all bought and paid for, would you bother to hook it up to get
the sound quality improvement it provides?" The answer was "No, if I hooked
it up, my primary motivation would be curiosity".

In a way that says it all - we achieved a kind of end-point of listener
sensitivity. The listener clearly heard a difference and the difference is
so practically so subtle that he doesn't care. Addressing that difference
isn't going to remove 7 figurative veils or even 0.1 veil from his system's
realism quotient. The worst problems are someplace else.

I don't know how long it's been since I first suggested some
improvements to PC-ABX, but one of the reasons I wrote ABC/hr is
because I realized that those improvements would not be forthcoming.


I think that was a worthwhile effort, and for the exact reason you've
stated. That's one reason why my web site links a number of other ABX and
ABC/hr comparators. I think they are a good thing.

Things like time-offset correction and playback range selection. You
know, not the boring statistical stuff, but stuff which actually
involves the audio.


And there is my point: further polishing the statistics does not provide the
greatest benefit.

PC-ABX can be improved.


Not in a practical sense, and for the reasons stated. Oh, maybe I could
polish it and get more people to the
bored-with-listening-to-differences-that-small state, more quickly.

People are not necessarily posturing and
trolling for criticizing the weaknesses of your program.


In this case Corbett is, which is obviously his goal, based on his various
misleading comments and distortions of my statements.




  #31   Report Post  
John Corbett
 
Posts: n/a
Default A/B/X Testing (was: dB vs. Apparent Loudness)

In article , ff123
wrote:

I know you have "Sensory Evaluation Techniques." So consult pp. 286-7
where it shows how to construct a spreadsheet in which n and theta
(called pd, for proportion of distinguishers in the book) are inputs,
and alpha, beta are outputs. It should be relatively easy to
rearrange things such that alpha, beta, and theta are inputs, with n
as an output as Corbett suggested.


Ah, but theta is not just the proportion of distinguishers.

Consider a population in which some people detect perfectly (theta_1 = 1
for them), and others are guessing (with theta_2 = .5 ).

Suppose that the proportion who detect perfectly is psi (what you've
called pd, but let's save "p" for, um, p-values).

Now randomly select one person from that population, and administer a
single trial. The probability that the listener will get the right answer
is

theta = ( psi )(theta_1 ) + ( 1 - psi ) ( theta_2 ) (*)
= ( psi ) ( 1 ) + ( 1 - psi ) ( .5 ).

If, for example, psi = 1/2, then a listener has a .75 chance of getting
the right answer on a single trial.

However, repeated trials for that same individual have theta = theta_1 or
theta_2, not the theta we've just computed. The theta values depend on
which distribution we sample from; here is a case where many trials by one
subject are not equivalent to many subjects each doing one trial.

Theta and psi are equivalent under the two-kinds-of-listeners threshold model.
We can use either one to parametrize our model, and can convert back and
forth, using (*) and

psi = max( 2*theta - 1 , 0 ).

This is mentioned in one of Burstein's JAES articles.


Yes, the statistics may be boring, but if you're going to promote your
application as being scientific, it should be correct as far as
reasonably possible (while you're at it, get rid of that chi-square
stuff and calculate the binomial probability to get your p-values).

ff123



Arny's use of chi-squared seems to be based on David Carlstrom's ABX
Statistics page http://www.provide.net/~djcarlst/abx_p9.htm .

There Carlstrom mentions both binomial and chi-squared statistics and
works an example where there the two methods don't agree.

But Carlstrom also makes what statisticians sometimes jokingly refer to as
a Type III error---solving the wrong problem.

We want to separate at (or below) chance performance from
better-than-chance performance.

----------
Problem One
The usual test in an ABX experiment involves
the null hypothesis H_0: theta = .5
versus
the alternative hypothesis H_1: theta .5 .
----------

We select a cutoff value such that scores above that value are seen as
evidence against H_0, thus indirectly supporting H_1.


----------
Problem Two
H'_0: theta = .5
versus
H'_1: theta not equal to .5 .
----------

The usual chi-squared test is designed to solve Problem Two, which is not
equivalent to Problem One.

The chi-squared statistic detects values larger than those expected under
the null hypothesis but it also detects values smaller than those expected
under the null hypothesis. So using chi-squared this way produces
p-values twice as large as you want if you think you are doing Problem
One. This effect is partially offset by the fact that the chi-squared
distribution is only asynptotically correct for the chi-squared
statistic.

JC
  #32   Report Post  
John Corbett
 
Posts: n/a
Default A/B/X Testing (was: dB vs. Apparent Loudness)

In article , "Arny Krueger"
wrote:

ff123 wrote:

Arny, if you're through, maybe you get past this so-called argument
about who meant what, and focus on improving PC-ABX. John Corbett's
suggestion is an eminently useful one, and the fact is that type II
errors should be directly addressed in the application, just like type
I errors are, not squirreled away somewhere on your website (not that
I've ever seen type II errors addressed there properly, either).


This is a very sad commentary on your abilities to perceive a working
solution to the problem of type two errors, even when it is laid out before
you in detail with a working model. Ironically, you've implemented some of
the solutions in your own work, and now seem to have forgetten it all!

The best solution to type two errors is not to further hone the statistics,
but to train the listener to be more sensitive.

Is it coming back to you now?


As I mentioned in an earlier post:

It is helpful to have clearly defined technical terms, provided
they're used by people who understand what they mean.

Once a bogus understanding is implanted, it can be really hard for
someone to see the correct version.

JC
  #33   Report Post  
Arny Krueger
 
Posts: n/a
Default A/B/X Testing (was: dB vs. Apparent Loudness)

John Corbett wrote:
In article , "Arny Krueger"
wrote:

ff123 wrote:

Arny, if you're through, maybe you get past this so-called argument
about who meant what, and focus on improving PC-ABX. John Corbett's
suggestion is an eminently useful one, and the fact is that type II
errors should be directly addressed in the application, just like
type I errors are, not squirreled away somewhere on your website
(not that I've ever seen type II errors addressed there properly,
either).


This is a very sad commentary on your abilities to perceive a working
solution to the problem of type two errors, even when it is laid out
before you in detail with a working model. Ironically, you've
implemented some of the solutions in your own work, and now seem to
have forgetten it all!

The best solution to type two errors is not to further hone the
statistics, but to train the listener to be more sensitive.

Is it coming back to you now?


As I mentioned in an earlier post:

It is helpful to have clearly defined technical terms, provided
they're used by people who understand what they mean.

Once a bogus understanding is implanted, it can be really hard for
someone to see the correct version.


Corbett, someplace around your umpty-dumpth rant about statistical minutae
and your hair-splitting related to word meanings, I realized that you don't
understand the problem at all.


  #35   Report Post  
John Corbett
 
Posts: n/a
Default A/B/X Testing (was: dB vs. Apparent Loudness)

In article , ff123
wrote:

On Mon, 07 Jun 2004 10:48:11 -0500, (John
Corbett) wrote:

In article , ff123
wrote:

Theta and psi are equivalent under the two-kinds-of-listeners threshold

model.
We can use either one to parametrize our model, and can convert back and
forth, using (*) and

psi = max( 2*theta - 1 , 0 ).

This is mentioned in one of Burstein's JAES articles.


Do you have the complete reference, by chance?


"Transformed Binomial Confidence Limits for Listening Tests"
J. Audio Eng. Soc. Vol. 37, No. 5, May 1989, pp. 363--367

He spells psi as p_k and theta as p_c . ;-)



The equations within
Sensory Evaluation Techniques assume one trial by multiple people, not
multiple trials by one person. I had assumed theta could be used
interchangeably with psi, but apparently not quite.


Even for a single person there is a difference between detecting (psi) and
correctly responding (theta).
Remember when you took multiple-choice exams where each question had 5
choices for an answer and the exam score was computed as (#right) -
(1/4)(#wrong)?
The "correction for guessing" just derives an estimate for psi from an
estimate for theta.

In fact, Sensory Evaluation Techniques does not mention effect size
(theta) in the context of type II errors, only the proportion of
distinguishers (pd, or psi). Is there a basic tutorial on these
subjects?

ff123


Effect size can be expressed as a value of theta (implicitly compared with
theta_0 associated with the null hypothesis, typically .5), or
a value of psi (implicitly compared to its null value psi_0, typically 0),
or as an explicit difference (delta) between the two thetas (or the two
psis).
(Of course, delta has the same numeric value as psi if psi_0 = 0.)

I don't have Sensory Evaluation Techniques, and it was a while ago when I
saw a copy, but I expect the idea of effect size is somewhere in it.



See "By The Numbers" (Burstein) in the February 1990 issue of Audio
magazine (pp. 43--48) for a nice intro.

You may also want to look at another Burstein paper:

"Approximation Formulas for Error Risk and Sample Size in ABX Testing"
J. Audio Eng. Soc. Vol. 36, No. 11, November 1988, pp. 879--883

I don't have Sensory Evaluation Techniques; I did check it out from a
library a year or so back. I also looked checked out then what I thought
was a better book (and it's about half the price):

Heymann & Lawless, "Sensory Evaluation in Food" (Aspen Publishers, 2001)

I think it's time for lunch.

JC


  #36   Report Post  
ff123
 
Posts: n/a
Default A/B/X Testing (was: dB vs. Apparent Loudness)

On Sat, 05 Jun 2004 14:08:45 -0500, (John
Corbett) wrote:

Here are a few suggestions:

(1) Establish two modes, training and testing.

The training mode is much like the current setup, with the subject free to
stop whenever he feels like it.
Indicate after each trial if the answer is right or wrong.
Don't bother to display a significance figure, or perhaps even a tally of
number right and total number of trials.


I am working on implementing this suggestion. I will probably keep
the tally of correct and total trials, and also include the p-value,
with the caveat stating that it is not strictly correct. This is both
to allow the listener to estimate the theta he should be using in the
normal (testing) mode and to assess his chances of achieving a p-value
less than critical alpha.

The testing mode would work this way.
The user specifies a number of trials, and then the program proceeds much
as it now does.
But, there is no running display, and only at the end of the preset number
of trials are results displayed, with a p-value.
If the user quits before the preset number of trials, then those results
are not displayed; they are discarded.

This avoids the situation where the subject cuts a run short for some
reason, or does an extra trial because they felt lucky.
Remember that the binomial(n,theta) distribution applies to a fixed number
of trials.


Is it not fair to allow the listener to set and change the number of
trials, critical alpha, theta, etc. all the way up to the point where
he presses "finish"? There can be no way for him to alter the results
of a test in progress so long as the tally is kept hidden (i.e., he
doesn't get any looks). In any case, I intend to allow this
flexibility. I also intend to allow the listener to reset the test at
any time up until he presses the finish button.

ff123
  #38   Report Post  
Arny Krueger
 
Posts: n/a
Default A/B/X Testing (was: dB vs. Apparent Loudness)

John Corbett wrote:

Here are a few suggestions:


(1) Establish two modes, training and testing.


Shows how little time you've actually spent looking at the site, Corbett.

There have always been two modes of operation at the PCABX web site,
training and testing.

(2) Another possibility would be for the user to propose what effect size
(theta) he wants to detect...

Shows once again how little time you've actually looked at the site, Corbett

At the PCABX web site, users have always been able to specify what effect
size (theta) they want to detect... Furthermore, the site has been
structured to encourage them to start with larger effects and work down to
smaller effects. The effects have been selected so that the larger effects
are reasonably obvious. The smallest effects are difficult or impossible
detect. A number of intermediate-sized effects are also provided.

Corbett, I'm really wondering how you expect anybody to take you seriously,
given your slap-dash analysis of the PCABX web site. You obviously never
looked at any of it, even for a few seconds. All you've ever seen of it is
the URL, right?



  #39   Report Post  
ff123
 
Posts: n/a
Default A/B/X Testing (was: dB vs. Apparent Loudness)

On Tue, 8 Jun 2004 22:00:39 -0400, "Arny Krueger"
wrote:

John Corbett wrote:

Here are a few suggestions:


(1) Establish two modes, training and testing.


Shows how little time you've actually spent looking at the site, Corbett.

There have always been two modes of operation at the PCABX web site,
training and testing.


He's talking specifically about the PC-ABX application, not the
website. The idea is to remove the errors associated with sequential
testing (testing mode), while simultaneously allowing the listener to
just noodle around (training mode). Training and testing modes could
be used synergistically: use the training mode to estimate what the
value of theta should be for the testing mode so that the number of
total trials can be suggested to control both type I and type II
errors.

(2) Another possibility would be for the user to propose what effect size
(theta) he wants to detect...

Shows once again how little time you've actually looked at the site, Corbett

At the PCABX web site, users have always been able to specify what effect
size (theta) they want to detect... Furthermore, the site has been
structured to encourage them to start with larger effects and work down to
smaller effects. The effects have been selected so that the larger effects
are reasonably obvious. The smallest effects are difficult or impossible
detect. A number of intermediate-sized effects are also provided.


None of this is quantitative. Corbett is suggesting a quantitative
(and generally accepted) way of specifying alpha, beta, and theta to
come up with an appropriate number of total trials.

Corbett, I'm really wondering how you expect anybody to take you seriously,
given your slap-dash analysis of the PCABX web site. You obviously never
looked at any of it, even for a few seconds. All you've ever seen of it is
the URL, right?


It's hard to take you seriously when PC-ABX doesn't even calculate the
right p-values, and you've never bothered to make even this simple
correction!

I have looked at both PC-ABX and your website (obviously). The
statistical concerns with PC-ABX are valid:

1. PC-ABX calculates inaccurate p-values
2. PC-ABX allows a mode in which sequential testing errors are not
controlled
3. PC-ABX does not suggest the number of trials to perform based on
listener specified type II error risk and effect size.

ff123
  #40   Report Post  
John Corbett
 
Posts: n/a
Default A/B/X Testing (was: dB vs. Apparent Loudness)

In article , ff123
wrote:

On Tue, 08 Jun 2004 11:22:58 -0500, (John
Corbett) wrote:

I don't have Sensory Evaluation Techniques, and it was a while ago when I
saw a copy, but I expect the idea of effect size is somewhere in it.


Ok, I think I've got it now. What you call theta is called p_max in
Sensory Evaluation Techniques, or the probability of a correct
response at p_d, the proportion of distinguishers.

p_o = probability of a correct guess = 0.5 for ABX.

For an individual performing repeated trials, p_d is equivalent to the
probability that he can hear a difference, and p_max = p_d +
p_o*(1-p_d)

(ignoring the psi/theta transformation)

So to work an example:

Let's say that we want:

critical alpha = 0.05
critical beta = 0.05
p_max = theta = 0.9 (probability of getting the correct answer)

So he can hear a difference 80% of the time

Then he should be able to meet this with 10 correct of 13 trials.



Bingo!



Playing with the spreadsheet gives one an idea of the enormity of
trying to control both type I and type II errors while testing
near-threshold differences.




That's a bit of understatement, as I think you'll agree.

You might find some interesting stuff he

http://www.stat.umn.edu/~corbett/ABX-plots

Most of these involve the power functions for 12/16 or 14/16 tests.
The power function is the probability of rejecting the null hypothesis as
a function of theta. So, for typical designs, you want power(.50) alpha,
and power(theta) as large as possible for theta values that are part of
the alternative. An ideal power function for this kind of test would be a
step function, 0 for theta .5 and 1 for theta .5 .

One pair shows the type I and type II errors (at threshold, taken as theta
= .75).

Another pair compares 1 subject doing 12 (or 14) trials versus a panel
doing one trial per member. Note that this set is in terms of psi, not
theta, so the usual threshold is .50 on these (but .75 on the others).

Yet another pair shows the sample size and critical values needed for
specifed alpha and beta, as you have now figured out.

The last pair shows the minimum effect detectable with specified power
(power = 1 - beta).

There is one additional plot, as well. ;-)

If you want, I can provide the code (in R) to generate these.

For information about R, go to http://cran.r-project.org .

You might be able to implement these in a spreadsheet.


Another example:

critical alpha = 0.05
critical beta = 0.05
p_max = 0.75 (listener can hear a difference 50% of the time)

Then one needs 27 correct out of 42 trials.

ff123


Now I don't think you'd be so quick to agree anymore to a claim that the
statistics part of DBTs is boring. ;-)

JC
Reply
Thread Tools
Display Modes

Posting Rules

Smilies are On
[IMG] code is On
HTML code is Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
sound in wav-format Andreas Håkansson General 50 October 30th 03 02:47 PM
Loudness Compensation problem Mark D. Zacharias General 1 July 20th 03 11:17 PM


All times are GMT +1. The time now is 11:12 AM.

Powered by: vBulletin
Copyright ©2000 - 2024, Jelsoft Enterprises Ltd.
Copyright ©2004-2024 AudioBanter.com.
The comments are property of their posters.
 

About Us

"It's about Audio and hi-fi"