Re: Why DBTs in audio do not deliver (was: Finally ... The Furutech [Archive]

View Full Version : Re: Why DBTs in audio do not deliver (was: Finally ... The Furutech

July 3rd 03, 04:06 PM

>Just so that we don't misunderstand exactly what is happening during
>one of these comparative tests:
>
>Note the implementation of ABC/Hr being used:
>
>http://ff123.net/abchr/abchr.html
>
>The reference is not being presented as another codec. In the process
>of rating each codec, the listener must choose from the hidden
>reference. This is different from MUSHRA, where there is a hidden
>reference, but which is typically presented as another codec. I think
>we're both on the same page when talking about ABC/Hr, but I just want
>to make sure.
>
>When I speak of a bias, I am mostly concerned that one codec is not
>falsely preferred over another.
>
>In the ABC/Hr implementation we're discussing, suppose a listener
>downrates the hidden reference for one or more codecs. Such a
>listener is not an *outlier* -- he is clearly *incorrect*. Then, the
>procedure used would be to discard that particular listener's entire
>set of results for all codecs.

After checking the test description, I see that the test has a built
in assumption that the reference is best, and asks the subject to
rate the codec in terms of the amount of "annoying degredation".
If a subject is consistently rating a hidden reference as degraded
under those conditions, I would think that what's happening is this:

The subject hears that particular codec as euphonic. That doesn't fit
the test assumption, and there isn't any way to rate the codec as better
than the reference. The easy way out of that dilemma, rather than
thinking about it enough to recognize what has happened,
is to assume the reference is the better one as the tester
says it should be, and if the subject thinks otherwise they must
have misidentified it. So they pick the better one as the reference
and report degradation on the other one.

If you discard the data from the people who did this, rather than
recognizing what it means, wouldn't that create a severe bias in this
testing procedure against any codec that sounds euphonic to some
people?

Bob

Arny Krueger

July 4th 03, 06:38 PM

> wrote in message

> After checking the test description, I see that the test has a built
> in assumption that the reference is best, and asks the subject to
> rate the codec in terms of the amount of "annoying degredation".

I think it is safe to say that the ABC/hr test is based on the
widely-accepted concept of "sonic accuracy".

> If a subject is consistently rating a hidden reference as degraded
> under those conditions, I would think that what's happening is this:

> The subject hears that particular codec as euphonic. That doesn't fit
> the test assumption, and there isn't any way to rate the codec as
> better than the reference. The easy way out of that dilemma, rather
> than thinking about it enough to recognize what has happened,
> is to assume the reference is the better one as the tester
> says it should be, and if the subject thinks otherwise they must
> have misidentified it. So they pick the better one as the reference
> and report degradation on the other one.

Within the context of the concept of sonic accuracy there is no way that
anything can sound better than perfect reproduction of the reference.

> If you discard the data from the people who did this, rather than
> recognizing what it means, wouldn't that create a severe bias in this
> testing procedure against any codec that sounds euphonic to some
> people?

Within the context of the sonic accuracy model, there is no such thing as
euphonic coloration. All audible changes represent inaccurate reproduction,
which is undesirable. This is reasonable because there is no universal
agreement about what constitutes euphonic coloration.

S888Wheel

July 7th 03, 03:42 AM

>> I said
>>
>> >> Wrong.In the Dave Clark test listener #2 got 30/48 correct with a
>> >statistical
>> >> relaibility of hearing a difference of 94% Listener #6 got 26/48 with a
>> >> statistical probablity of 84% chance of hearing differences. Listener
>#15
>> >got
>> >> 15/21 correct with an 81% chance of
>hearing a difference.

Keith hughs said

>I'd like to know how these probabilities were calculated. For
>example, in the case where the listener heard a difference in 30
>of 48 trials, one can do a simple F-test between populations (i.e.
>30/48 versus 24/48 - the expectation if results are due to random
>guessing). For this example, F-critical is 1.623, with
>F-calculated of 1.067, i.e. the population
>variances are not
>different at the .05 level. One can then compare the means of the
>populations using a simple t-test for populations with equal
>variances. The t-critical is then 1.99 - two-tailed, or 1.66 -
>one-tailed, with t-calculated = 1.23. Thus the means (i.e. RESULTS
>of the listener vs. random guessing) are
>not statistically
>different, at the 0.05 (95% probability) level. Nor are they
>significant at the 0.10 (90% confidence) level.

I got the numbers from the article.

>> Bob said
>>
>> >
>> >I don't recall this article, but this conclusion seems to be well
>> >supported by the data you cite. If the best performance of the group
>> >wasn't statistically significant at a 95% confidence level, then it's
>> >perfectly reasonable to say that no listener was able to identify the
>> >amps in the test. (Note: Saying they couldn't is not the same as
>> >saying they can't. As I noted above, we can never say definitively
>> >that they can't; we can only surmise from their--and everybody
>> >else's--inability to do so.)

I said

>
>> That is ridiculous. If all of them scored 94% it would be reasonable to
>say
>> this?
>

Kieth said

>
>I believe Bob is talking about 95% confidence interval, *not* 95%
>scores.

Yes I know.

Kieth said

> And yes, it is very common to require a 95% confidence
>level.

How often is 94% confidence level results regarded as a null when one is
seeking 95% confidence results which is in fact a subjective choice. Bottom
line is the tests were inconclusive in this particlular case.

I said

>
>> No. It all depends on how they fall into the bell curve. but even this is
>> problematic for two reasons. 1. the listeners were never tested for
>sensitivty
>> to subtle diferences. The abilities of the participants will profoundly
>affect
>> any bell curve.
>

Kieth said

>No, the variance in those abilities are the *cause* of the bell
>curve.

I was using the predicted bell curve outcome one would get if there are no
audible differences. That bell curve is dictated by the number of samples.

Kieth said

> That's why sample size (panel size in this context) is so
>important, and why the weight given to one or two individuals
>performances, in any test, must be limited.

I took that into consideration when I claimed the test results were
inconclusive. That is why I suggested the thing that should have been done was
further testing of those individuals and the equipment that scored near or
beyond that which the bell curve predicts. That is why i don't run around
claiming this test proves that some people hear differences. Let's also not
forget that we have no tests on listener or system sensitivity to subtle
audible differences. So we have unknown variables. Further the use of many amps
in no obvious pattern for comparisons introduces another variable that could
profoundly affect the bell curve in unknown ways if some amps sound like each
other and some don't. All of this leaves the interpretation of those results
wide open.

Kieth said

>Also why such high
>criteria such as a 95% CI are applied. Having a limited
>population, one cannot alway assume a bell curve, making outliers
>more difficult to identify.
>

Sorry but this is in some ways an arbitrary number. If a dozen people took such
tests and half of them scored between 90% and 94% confidence then this 95%
limmit would be an unreasonable one. OTOH if a hundred people took the test and
one scored at 95% confidence I think you could argue that this does fall within
the predicted bell curve of a null result. As it is, as far as i can tell some
of the results of the Clark test are close enough to the edge of the bell curve
prediction or beyond it to warrent further investigation and to preclude anyone
of drawing definitive conclusions. Statistical analysis is not my strength in
math but that is how it looks o me.

I said

>> 2. many different amps were used. We have no way of knowing
>> that we didn't have a mix of some amps sounding different and some sounding
>the
>> same. The Counterpoint amp not only was identified with a probablity of 94%
>.
>> Given there wre 8 idfferent combinations it fell out of the predicted bell
>> curve if my math is right.

Kieth said

>
>What math did you use? Not having the data your using to hand, I'm
>curious how you calculated such a high probability.

I used the numbers given in the article. The article gave a 94% confidence
level on the Counterpoint amp compared to an NAD amp. There were a total of 8
comparisons between different amps. I was rounding by the way. The number is
actually 94.4%.

I said

>
>> Bottom line is you cannot draw definitive
>> conclusions either way. If the one listener had made one more correct ID he
>> would have been well above 94%. I doubt that one can simply make 95%
>> probability a barrier of truth.
>

Kieth said

>
>You should read more scientific literature then. It is the most
>commonly used confidence level IME. Especially with limited
>population size.
>

I will look into it but i will be quite surprised if that number does not
heavily depend on the sample sizes. It has to vary with sample size.

I said

>
>> It is ridiculous. Bell curves don't work that
>> way.

Kieth said

>Nonsense, of course they do. The fact that there are two-tailed
>bell curves, in most population responses, is the genesis for use
>of high confidence intervals. Because there *are* tails - way out
>on the edges of the population, allowance for these tails must be
>made when comparing populations for significant heterogeneity.
>

Maybe you didn't get what i was trying to say. What does and does not fall
within the predictions of a bell curve depends heavily on the number of
samples.

I said

>
>> No foot stamping is needed. The test was quite inconclusive.

Kieth said

>From what's been presented here, I'd say is was quite conclusive.
>The data do not support rejecting the null hypothesis (i.e. that
>the amps sound the same), for the test population, under the test
>protocol and conditions used. Not knowing the population size, or
>protocol used, I wouldn't venture an opinion on whethere it may or
>may not be applicable to the broader
>population.
>

So you are drawing definitive conclusions from one test without even knowing
the sample size? I think you are leaping without looking. You are entitled to
your opinions. I don't find your arguements convincing so far.

I said

>
>> Had the test been
>> put infront of a scientific peer review with the same data and the same
>> conclusions that panel would have sent it back for corrections. The
>analysis
>> was wrong scientifically speaking.

Kieth said

>
>Doesn't appear to be based on what's been posted here. You appear
>to think that definitive results are required, in support of some
>postulate, for the conclusion to be valid (or acceptable for a
>review board). This is simply incorrect. Oftentimes, the data
>obtained from a test are not useful, or positive, relative to
>supporting a postulate, but that does not invalidate the data.
>Clearly, failure to reject the null hypothesis, in any experiment,
>does not invalidate the test, nor is it something requiring
>"correction". It is merely part of the database that the author,
>and others, build on in future research.

Where do I claim definitve results are required in support of some postulate? I
would say definitve results are required for one to make claims of definitve
results. Certainly such a test as a part of a body of evidence could be seen as
supportive but even that I think is dodgy given some of the results as far as I
can see would call for further testing and the absense of listener sensitivty
alone makes it impossible to make any specific claims about the results. I
never said the test was not valid due to it's failure to reject the null. I
simply said I think the results are such that calls for further investigation
and are on the border between a null a positive. Not the sort of thing one can
base definitive conclusions on.

Kieth said

>
>It's instructive to also note that peer reviewed journals publish,
>not infrequently, two or more studies that contradict one another.
>You seem to believe this can't be the case, because the "wrong"
>ones would be sent back for "correction".

Nonsense. what I think will be sent back is a poor analysis. If the analysis
fits the data it won't be sent back. If someone is making claims of definitive
conclusions based on this test one is jumping the gun. Scientific research
papers with huge sample sizes and apparent definitive rsults are ususally
carefully worded to not make definitive claims. That is a part of propper
scientific prudence.

Kieht said

>ones would be sent back for "correction". A look through some
>select peer reviewed journals, such as the "Journal of
>Pharmacuetical Science and Technology", will show that to be
>mistaken.

Really? I'd like to see any peer reviewed published studies thatmake definitive
claims based on sample sizes of 25 participants especially when one had a 94%
confidence score.

Kieth said

>Often in the case of biological studies, conflicting
>data are obtained, often due to unknown (at the time) and/or
>uncontrolled variables (often inherent to the specific population
>under study). These data, while contradictory, are still of great
>value for future studies.

I never suggested the data should be ignored or was valueless, only that it was
inconclusive. I think the many uncontroled variables in this specific test are
problematic. Don't you?

Nousaine

July 7th 03, 06:13 AM

"Keith A. Hughes" wrote:

...snips........

S888Wheel wrote:

>> Had the test been
>> put infront of a scientific peer review with the same data and the same
>> conclusions that panel would have sent it back for corrections. The
>analysis
>> was wrong scientifically speaking.
>
>Doesn't appear to be based on what's been posted here. You appear
>to think that definitive results are required, in support of some
>postulate, for the conclusion to be valid (or acceptable for a
>review board). This is simply incorrect. Oftentimes, the data
>obtained from a test are not useful, or positive, relative to
>supporting a postulate, but that does not invalidate the data.
>Clearly, failure to reject the null hypothesis, in any experiment,
>does not invalidate the test, nor is it something requiring
>"correction". It is merely part of the database that the author,
>and others, build on in future research.
>
>It's instructive to also note that peer reviewed journals publish,
>not infrequently, two or more studies that contradict one another.
>You seem to believe this can't be the case, because the "wrong"
>ones would be sent back for "correction". A look through some
>select peer reviewed journals, such as the "Journal of
>Pharmacuetical Science and Technology", will show that to be
>mistaken. Often in the case of biological studies, conflicting
>data are obtained, often due to unknown (at the time) and/or
>uncontrolled variables (often inherent to the specific population
>under study). These data, while contradictory, are still of great
>value for future studies.
>
>Keith Hughes

What's more interesting is that no contradictory experimental data has ever
been published anywhere. No manufacturer, distributor, retailer or enthusiast
has ever demonstrated an ability to hear nominally competent amps, wires or
parts under listening bias controlled conditions in normally reverberant
conditions (including their personal reference systems.)

Keith A. Hughes

July 7th 03, 07:50 PM

S888Wheel wrote:
>
<snip>

> I said
>
> >
> >> That is ridiculous. If all of them scored 94% it would be reasonable to
> >say
> >> this?
> >
>
> Kieth said
>
> >
> >I believe Bob is talking about 95% confidence interval, *not* 95%
> >scores.
>
> Yes I know.

Then talking about "scoring 94%" is meaningless.

> Kieth said
>
> > And yes, it is very common to require a 95% confidence
> >level.
>
> How often is 94% confidence level results regarded as a null when one is
> seeking 95% confidence results which is in fact a subjective choice. Bottom
> line is the tests were inconclusive in this particlular case.

You clearly don't understand statistical analysis. There is a null
hypothesis that states, basically, that the populations (in this
case, say the sound of Amp A and the sound of Amp B) are not
different. You set your confidence level *prior* to your analysis,
and yes 95% is ubiquitous, and you either meet that level, or you
don't. There is no gray area.

Irrespective of how "close" you may get to reaching the confidence
level, you're either in, or out. When out, the results *ARE*
conclusive - always. That conclusion is "you cannot reject the
null hypothesis at the 0.05 level", *by definition*. Period. You
cannot say for example "well, it was close, so they're probably
different". You *can* calculate what confidence level within which
you *can* reject the null hypothesis, if you want (it's
approximately 78%, by my calculations, for a result of 30/48
versus baseline expectations of 24/48). A number that would be
considered insignificant by all statisticians I've worked with.

But again, by expanding the CI, you risk erroneously rejecting the
null hypothesis unless the population is very large (i.e.
sufficiently so to approximate a two-tailed bell curve).
<snip>

> >No, the variance in those abilities are the *cause* of the bell
> >curve.
>
> I was using the predicted bell curve outcome one would get if there are no
> audible differences. That bell curve is dictated by the number of samples.

The "Bell" curve is dictated by varying response levels. The
degree to which the specific curve approximates a normal
distribution *is* a function of population size (or number of
samples if you prefer). Hence the need for a very high CI when
using small populations (like 15 panelists for eg.), as you do
*not* have a high degree of confidence that the population has a
normal distribution.

> Kieth said
>
> > That's why sample size (panel size in this context) is so
> >important, and why the weight given to one or two individuals
> >performances, in any test, must be limited.
>
> I took that into consideration when I claimed the test results were
> inconclusive.

Well, again, I think your missing the common usage of "conclusive"
relative to statistical analysis. As stated previously, there are
only two outcomes of statistical analysis (i.e. typical ANOVA),
reject or accept the null hypothesis. Either is conclusive, as
this one apparently was. The data don't allow you to reject the
null hypothesis at even the 0.1 level.

So, you cannot say a difference was shown. You can't say there was
no difference either. This appears to be the genesis of your
"inconclusive" apellation. But this is an incorrect
interpretation, as detailed above.

> That is why I suggested the thing that should have been done was
> further testing of those individuals and the equipment that scored near or
> beyond that which the bell curve predicts. That is why i don't run around
> claiming this test proves that some people hear differences. Let's also not
> forget that we have no tests on listener or system sensitivity to subtle
> audible differences.

Acutally, we *assume* variation in listener abilities and
responses. Without such, there would be no bell curve, and
statistical analysis would be impossible. The *only* requirement
is that the test population is sufficiently large such that they
approximate the population as a whole, relative to reponse to the
stimuli under test. Again, the smaller the test population, the
tighter the CI *must* be due to the lower confidence in the test
population having the same distribution as the total population.

> So we have unknown variables.

Yes, always. Given appropriate controls and statistics, the
effects of most are accounted for in the analysis.

> Further the use of many amps
> in no obvious pattern for comparisons introduces another variable that could
> profoundly affect the bell curve in unknown ways if some amps sound like each
> other and some don't.

Well, not having the article, I can't comment one way or another.

> All of this leaves the interpretation of those results
> wide open.

Well, no, as stated previously. It may, however, leave the
*question* of audibility open to further study.

<snip>

> >You should read more scientific literature then. It is the most
> >commonly used confidence level IME. Especially with limited
> >population size.
> >
>
> I will look into it but i will be quite surprised if that number does not
> heavily depend on the sample sizes. It has to vary with sample size.

Not usually. It varies with the criticality of the results. The
confidence one has in extrapolating the results to the general
population, irrespective of CI, increases as a function of sample
size.

<snip>

> >Nonsense, of course they do. The fact that there are two-tailed
> >bell curves, in most population responses, is the genesis for use
> >of high confidence intervals. Because there *are* tails - way out
> >on the edges of the population, allowance for these tails must be
> >made when comparing populations for significant heterogeneity.
> >
>
> Maybe you didn't get what i was trying to say. What does and does not fall
> within the predictions of a bell curve depends heavily on the number of
> samples.

You don't seem to understand what a Bell curve is. ALL data is in
the curve, so I don't know what you're trying to say by "does and
does not fall within the predictions of a bell curve". Bell curves
don't predict anything, they merely incorporate all data points,
and the shape is the result of a two-tailed normal distribution.
You can, with a normal population (your Bell curve) predict the %
of the population that fits within any X standard deviations from
the mean. That's what were doing with CI's, we're *assuming* a
normal distribution - a requirement for statistics to work
(unfortunately).

<snip>

> So you are drawing definitive conclusions from one test without even knowing
> the sample size?

Only from the data you presented (as I've said, I don't have the
article). You picked individual results (such as the 30/48 vs
baseline of 24/48) as indicative of a significant outcome. I've
shown you, through the most common statistical tool used, that
that 30/48 and 24/48 are not significantly different.

Based on the data you provided, the conclusion of not rejecting
the null hypothesis seems perfectly sound.

> I think you are leaping without looking. You are entitled to
> your opinions. I don't find your arguements convincing so far.

Maybe you need a better understanding of statistical analysis, and
its limitations.

> Where do I claim definitve results are required in support of some postulate?

Everywhere, as far as I can tell.

> I would say definitve results are required for one to make claims of definitve
> results.

The definitive result of failing to reject the null hypothesis, to
a chosen confidence level, requires no more data than was
apparently presented in the article. You seem to be confusing "no
difference was shown at the x.xx level" with "there are no
differences". The latter can never be said based on failure to
reject the null.

> Certainly such a test as a part of a body of evidence could be seen as
> supportive but even that I think is dodgy given some of the results as far as I
> can see would call for further testing and the absense of listener sensitivty
> alone makes it impossible to make any specific claims about the results. I

As long as the panel sensistivity is representative of the general
population, sensitivity is irrelevant. No matter the test
protocol, one *must* assume the sensitivities are representative,
unless the sensitivity of the whole population is known, and a
panel is then chosen that represents a normal distribution
congruent with the overall population. And this never happens.

> never said the test was not valid due to it's failure to reject the null. I
> simply said I think the results are such that calls for further investigation
> and are on the border between a null a positive. Not the sort of thing one can
> base definitive conclusions on.

It is clearly definitive for the panel, under the conditions of
the test.

> >
> >It's instructive to also note that peer reviewed journals publish,
> >not infrequently, two or more studies that contradict one another.
> >You seem to believe this can't be the case, because the "wrong"
> >ones would be sent back for "correction".
>
> Nonsense. what I think will be sent back is a poor analysis.

Which you've yet to illustrate in this case.

> If the analysis
> fits the data it won't be sent back. If someone is making claims of definitive
> conclusions based on this test one is jumping the gun. Scientific research
> papers with huge sample sizes and apparent definitive rsults are ususally
> carefully worded to not make definitive claims. That is a part of propper
> scientific prudence.

True. Does this article say "there are no sonic differences
between amps", or does it say (paraphrasing of course)
"differences between amps couldn't be demonstrated"? I understood
it to be the latter.

> Kieht said
<snip>
> Kieth said
<snip>

> I never suggested the data should be ignored or was valueless, only that it was
> inconclusive. I think the many uncontroled variables in this specific test are
> problematic. Don't you?

I don't have the article, so I don't know how many variables were
uncontrolled. The sensitivity issue is, I believe, a red herring
however.

BTW, if you're going to repeatedly use my name in replying, common
courtesy would be to spell it right - at least once.

Keith Hughes