Reply
 
Thread Tools Display Modes
  #41   Report Post  
Arny Krueger
 
Posts: n/a
Default A/B/X Testing (was: dB vs. Apparent Loudness)

ff123 wrote:

On Tue, 8 Jun 2004 22:00:39 -0400, "Arny Krueger"
wrote:

John Corbett wrote:

Here are a few suggestions:


(1) Establish two modes, training and testing.


Shows how little time you've actually spent looking at the site,
Corbett.


There have always been two modes of operation at the PCABX web site,
training and testing.


He's talking specifically about the PC-ABX application, not the
website.


GMAB, that's not what Corbett said at all. He based his argument on an
out-of-context quote of something he found on the web site, something that
is not part of the PCABX application.

But let's say that he was talking about the PCABX application all by itself.
Arguing about the PCABX application all by itself is a straw man argument
because the PCABX application is presented in the context of the web site.
The core of the PCABX web site is not the PCABX application. In fact the web
site makes a strong point of presenting the PCABX application as a tool that
can be replaced by a number of other similar tools, some of which may be
superior to it in some senses.

I guess it would be interesting to hear you and Corbett pontificate about
what is the most important single thing on the PCABX web site. I'm sure
you'll both get it wrong. I'll give you a hint - count your fingers. What
you are looking for is like your hands, and is as important to subjective
testing as your fingers are to your hands.

The idea is to remove the errors associated with sequential
testing (testing mode), while simultaneously allowing the listener to
just noodle around (training mode).


Been there, done that, but in another critical part of the web site that
Corbett shows no knowledge or understanding of.

Training and testing modes could
be used synergistically: use the training mode to estimate what the
value of theta should be for the testing mode so that the number of
total trials can be suggested to control both type I and type II
errors.


It seems to be difficult or impossible to convince statistics junkies that
there is more to experimental design than statistics. We spent about 10
years looking at reducing type II errors by jacking up the number of trials
by various means. We decided that beyond a certain point this was a bogus
approach, practically speaking. The best way at that point turned out to
make major gains in improving listener sensitivity, not by jacking up the
number of trials, but rather to more train and enable the listener to do a
better job of hearing differences.

This begs the question "why not do both?". The answer to that question
should be found by stepping back from a headlong rush towards sensitivity
for the sake of sensitivity. Instead, you have to look at the practical
relevance of the results that you are getting once you have a certain
combination of listener sensitivity, and statistical detection of
differences in the results.

(2) Another possibility would be for the user to propose what effect
size (theta) he wants to detect...


Shows once again how little time you've actually looked at the site,
Corbett


At the PCABX web site, users have always been able to specify what
effect size (theta) they want to detect... Furthermore, the site has
been structured to encourage them to start with larger effects and
work down to smaller effects. The effects have been selected so that
the larger effects are reasonably obvious. The smallest effects are
difficult or impossible detect. A number of intermediate-sized
effects are also provided.


None of this is quantitative.


Say what?

It is very quantative. The size of the effect is formally known for all
samples on the PCABX site for which there are known and generally
agreed-upon ways to quantify the effect size. This includes the vast
majority of the samples. The only area of exceptions that come to mine are
the perceptual coder samples, because AFAIK there is no known, and generally
agreed-upon ways to quantify the effect size for them.

Corbett is suggesting a quantitative
(and generally accepted) way of specifying alpha, beta, and theta to
come up with an appropriate number of total trials.


Here we go again - trying to find out information that is practically
irrelevant, by jacking up the total number of trials.

Corbett, I'm really wondering how you expect anybody to take you
seriously, given your slap-dash analysis of the PCABX web site. You
obviously never looked at any of it, even for a few seconds. All
you've ever seen of it is the URL, right?


It's hard to take you seriously when PC-ABX doesn't even calculate the
right p-values, and you've never bothered to make even this simple
correction!


You've got me confused with someone who is interested wasting anybody's time
including my own, by splitting hairs.

I have looked at both PC-ABX and your website (obviously). The
statistical concerns with PC-ABX are valid:


1. PC-ABX calculates inaccurate p-values


So what? On the best day of their life, p-values are just a guide towards a
larger goal that transcends mere statistics.

2. PC-ABX allows a mode in which sequential testing errors are not
controlled


So what? Anybody who thinks that a purely statistical approach can actually
quantify all relevant sequential testing errors has missed many important
points of experimental design.

3. PC-ABX does not suggest the number of trials to perform based on
listener specified type II error risk and effect size.


So what?

Anybody who thinks that a purely statistical approach can actually quantify
type II error risk has missed many important points of experimental design.




  #42   Report Post  
ff123
 
Posts: n/a
Default A/B/X Testing (was: dB vs. Apparent Loudness)

On Tue, 08 Jun 2004 23:11:04 -0500, (John
Corbett) wrote:

Playing with the spreadsheet gives one an idea of the enormity of
trying to control both type I and type II errors while testing
near-threshold differences.




That's a bit of understatement, as I think you'll agree.


I think I've decided that since ABC/hr is mostly used for perceptual
codec testing that I may make some presets, the one immediately below
being the most important:

defect description: "moderate"
critical alpha = 0.05
critical beta = 0.2
theta = 0.9 to 0.95 (listener can hear a difference 80 to 90% of the
time)
suggested correct/total trials: 7/8

defect description: "subtle"
critical alpha = 0.05
critical beta = 0.2
theta = 0.8 (listener can hear a difference 60% of the time)
suggested correct/total trials: 13/18

defect description: "obvious"
critical alpha = 0.05
critical beta = 0.2
theta = 0.995 (listener can hear a difference 99% of the time)
suggested correct/total trials: 5/5

Type II errors are not so much of a concern for codec comparisons. I
am trying to detect differences, not to verify similarity.


I will probably implement the "N suggester" via lookup table allowing
the following choices:

critical alpha: {0.05, 0.01}
critical beta: {0.05, 0.1, 0.2}
theta: {0.995, 0.9, 0.85, 0.8, 0.75, 0.7, 0.65}


Note that for similarity testing, critical beta is typically set low
(sometimes lower than 0.05), and theta is typically even lower than
0.65. And sometimes critical alpha is allowed to rise to 0.1 or even
higher. But since I am not interested in this type of testing, such
values will not be allowed.

ff123
Reply
Thread Tools
Display Modes

Posting Rules

Smilies are On
[IMG] code is On
HTML code is Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
sound in wav-format Andreas Håkansson General 50 October 30th 03 02:47 PM
Loudness Compensation problem Mark D. Zacharias General 1 July 20th 03 11:17 PM


All times are GMT +1. The time now is 02:10 PM.

Powered by: vBulletin
Copyright ©2000 - 2024, Jelsoft Enterprises Ltd.
Copyright ©2004-2024 AudioBanter.com.
The comments are property of their posters.
 

About Us

"It's about Audio and hi-fi"