About Us

Just so that we don't misunderstand exactly what is happening during
one of these comparative tests:

Note the implementation of ABC/Hr being used:

http://ff123.net/abchr/abchr.html

The reference is not being presented as another codec. In the process
of rating each codec, the listener must choose from the hidden
reference. This is different from MUSHRA, where there is a hidden
reference, but which is typically presented as another codec. I think
we're both on the same page when talking about ABC/Hr, but I just want
to make sure.

When I speak of a bias, I am mostly concerned that one codec is not
falsely preferred over another.

In the ABC/Hr implementation we're discussing, suppose a listener
downrates the hidden reference for one or more codecs. Such a
listener is not an *outlier* -- he is clearly *incorrect*. Then, the
procedure used would be to discard that particular listener's entire
set of results for all codecs.

After checking the test description, I see that the test has a built
in assumption that the reference is best, and asks the subject to
rate the codec in terms of the amount of "annoying degredation".
If a subject is consistently rating a hidden reference as degraded
under those conditions, I would think that what's happening is this:

The subject hears that particular codec as euphonic. That doesn't fit
the test assumption, and there isn't any way to rate the codec as better
than the reference. The easy way out of that dilemma, rather than
thinking about it enough to recognize what has happened,
is to assume the reference is the better one as the tester
says it should be, and if the subject thinks otherwise they must
have misidentified it. So they pick the better one as the reference
and report degradation on the other one.

If you discard the data from the people who did this, rather than
recognizing what it means, wouldn't that create a severe bias in this
testing procedure against any codec that sounds euphonic to some
people?

Bob

wrote in message

After checking the test description, I see that the test has a built
in assumption that the reference is best, and asks the subject to
rate the codec in terms of the amount of "annoying degredation".

I think it is safe to say that the ABC/hr test is based on the
widely-accepted concept of "sonic accuracy".

If a subject is consistently rating a hidden reference as degraded
under those conditions, I would think that what's happening is this:

The subject hears that particular codec as euphonic. That doesn't fit
the test assumption, and there isn't any way to rate the codec as
better than the reference. The easy way out of that dilemma, rather
than thinking about it enough to recognize what has happened,
is to assume the reference is the better one as the tester
says it should be, and if the subject thinks otherwise they must
have misidentified it. So they pick the better one as the reference
and report degradation on the other one.

Within the context of the concept of sonic accuracy there is no way that
anything can sound better than perfect reproduction of the reference.

If you discard the data from the people who did this, rather than
recognizing what it means, wouldn't that create a severe bias in this
testing procedure against any codec that sounds euphonic to some
people?

Within the context of the sonic accuracy model, there is no such thing as
euphonic coloration. All audible changes represent inaccurate reproduction,
which is undesirable. This is reasonable because there is no universal
agreement about what constitutes euphonic coloration.

I said

Wrong.In the Dave Clark test listener #2 got 30/48 correct with a
statistical
relaibility of hearing a difference of 94% Listener #6 got 26/48 with a
statistical probablity of 84% chance of hearing differences. Listener
#15
got
15/21 correct with an 81% chance of
hearing a difference.

Keith hughs said

I'd like to know how these probabilities were calculated. For
example, in the case where the listener heard a difference in 30
of 48 trials, one can do a simple F-test between populations (i.e.
30/48 versus 24/48 - the expectation if results are due to random
guessing). For this example, F-critical is 1.623, with
F-calculated of 1.067, i.e. the population
variances are not
different at the .05 level. One can then compare the means of the
populations using a simple t-test for populations with equal
variances. The t-critical is then 1.99 - two-tailed, or 1.66 -
one-tailed, with t-calculated = 1.23. Thus the means (i.e. RESULTS
of the listener vs. random guessing) are
not statistically
different, at the 0.05 (95% probability) level. Nor are they
significant at the 0.10 (90% confidence) level.

I got the numbers from the article.

Bob said

I don't recall this article, but this conclusion seems to be well
supported by the data you cite. If the best performance of the group
wasn't statistically significant at a 95% confidence level, then it's
perfectly reasonable to say that no listener was able to identify the
amps in the test. (Note: Saying they couldn't is not the same as
saying they can't. As I noted above, we can never say definitively
that they can't; we can only surmise from their--and everybody
else's--inability to do so.)

I said

That is ridiculous. If all of them scored 94% it would be reasonable to
say
this?

Kieth said

I believe Bob is talking about 95% confidence interval, *not* 95%
scores.

Yes I know.

Kieth said

And yes, it is very common to require a 95% confidence
level.

How often is 94% confidence level results regarded as a null when one is
seeking 95% confidence results which is in fact a subjective choice. Bottom
line is the tests were inconclusive in this particlular case.

I said

No. It all depends on how they fall into the bell curve. but even this is
problematic for two reasons. 1. the listeners were never tested for
sensitivty
to subtle diferences. The abilities of the participants will profoundly
affect
any bell curve.

Kieth said

No, the variance in those abilities are the *cause* of the bell
curve.

I was using the predicted bell curve outcome one would get if there are no
audible differences. That bell curve is dictated by the number of samples.

Kieth said

That's why sample size (panel size in this context) is so
important, and why the weight given to one or two individuals
performances, in any test, must be limited.

I took that into consideration when I claimed the test results were
inconclusive. That is why I suggested the thing that should have been done was
further testing of those individuals and the equipment that scored near or
beyond that which the bell curve predicts. That is why i don't run around
claiming this test proves that some people hear differences. Let's also not
forget that we have no tests on listener or system sensitivity to subtle
audible differences. So we have unknown variables. Further the use of many amps
in no obvious pattern for comparisons introduces another variable that could
profoundly affect the bell curve in unknown ways if some amps sound like each
other and some don't. All of this leaves the interpretation of those results
wide open.

Kieth said

Also why such high
criteria such as a 95% CI are applied. Having a limited
population, one cannot alway assume a bell curve, making outliers
more difficult to identify.

Sorry but this is in some ways an arbitrary number. If a dozen people took such
tests and half of them scored between 90% and 94% confidence then this 95%
limmit would be an unreasonable one. OTOH if a hundred people took the test and
one scored at 95% confidence I think you could argue that this does fall within
the predicted bell curve of a null result. As it is, as far as i can tell some
of the results of the Clark test are close enough to the edge of the bell curve
prediction or beyond it to warrent further investigation and to preclude anyone
of drawing definitive conclusions. Statistical analysis is not my strength in
math but that is how it looks o me.

I said

2. many different amps were used. We have no way of knowing
that we didn't have a mix of some amps sounding different and some sounding
the
same. The Counterpoint amp not only was identified with a probablity of 94%
.
Given there wre 8 idfferent combinations it fell out of the predicted bell
curve if my math is right.

Kieth said

What math did you use? Not having the data your using to hand, I'm
curious how you calculated such a high probability.

I used the numbers given in the article. The article gave a 94% confidence
level on the Counterpoint amp compared to an NAD amp. There were a total of 8
comparisons between different amps. I was rounding by the way. The number is
actually 94.4%.

I said

Bottom line is you cannot draw definitive
conclusions either way. If the one listener had made one more correct ID he
would have been well above 94%. I doubt that one can simply make 95%
probability a barrier of truth.

Kieth said

You should read more scientific literature then. It is the most
commonly used confidence level IME. Especially with limited
population size.

I will look into it but i will be quite surprised if that number does not
heavily depend on the sample sizes. It has to vary with sample size.

I said

It is ridiculous. Bell curves don't work that
way.

Kieth said

Nonsense, of course they do. The fact that there are two-tailed
bell curves, in most population responses, is the genesis for use
of high confidence intervals. Because there *are* tails - way out
on the edges of the population, allowance for these tails must be
made when comparing populations for significant heterogeneity.

Maybe you didn't get what i was trying to say. What does and does not fall
within the predictions of a bell curve depends heavily on the number of
samples.

I said

No foot stamping is needed. The test was quite inconclusive.

Kieth said

From what's been presented here, I'd say is was quite conclusive.
The data do not support rejecting the null hypothesis (i.e. that
the amps sound the same), for the test population, under the test
protocol and conditions used. Not knowing the population size, or
protocol used, I wouldn't venture an opinion on whethere it may or
may not be applicable to the broader
population.

So you are drawing definitive conclusions from one test without even knowing
the sample size? I think you are leaping without looking. You are entitled to
your opinions. I don't find your arguements convincing so far.

I said

Had the test been
put infront of a scientific peer review with the same data and the same
conclusions that panel would have sent it back for corrections. The
analysis
was wrong scientifically speaking.

Kieth said

Doesn't appear to be based on what's been posted here. You appear
to think that definitive results are required, in support of some
postulate, for the conclusion to be valid (or acceptable for a
review board). This is simply incorrect. Oftentimes, the data
obtained from a test are not useful, or positive, relative to
supporting a postulate, but that does not invalidate the data.
Clearly, failure to reject the null hypothesis, in any experiment,
does not invalidate the test, nor is it something requiring
"correction". It is merely part of the database that the author,
and others, build on in future research.

Where do I claim definitve results are required in support of some postulate? I
would say definitve results are required for one to make claims of definitve
results. Certainly such a test as a part of a body of evidence could be seen as
supportive but even that I think is dodgy given some of the results as far as I
can see would call for further testing and the absense of listener sensitivty
alone makes it impossible to make any specific claims about the results. I
never said the test was not valid due to it's failure to reject the null. I
simply said I think the results are such that calls for further investigation
and are on the border between a null a positive. Not the sort of thing one can
base definitive conclusions on.

Kieth said

It's instructive to also note that peer reviewed journals publish,
not infrequently, two or more studies that contradict one another.
You seem to believe this can't be the case, because the "wrong"
ones would be sent back for "correction".

Nonsense. what I think will be sent back is a poor analysis. If the analysis
fits the data it won't be sent back. If someone is making claims of definitive
conclusions based on this test one is jumping the gun. Scientific research
papers with huge sample sizes and apparent definitive rsults are ususally
carefully worded to not make definitive claims. That is a part of propper
scientific prudence.

Kieht said

ones would be sent back for "correction". A look through some
select peer reviewed journals, such as the "Journal of
Pharmacuetical Science and Technology", will show that to be
mistaken.

Really? I'd like to see any peer reviewed published studies thatmake definitive
claims based on sample sizes of 25 participants especially when one had a 94%
confidence score.

Kieth said

Often in the case of biological studies, conflicting
data are obtained, often due to unknown (at the time) and/or
uncontrolled variables (often inherent to the specific population
under study). These data, while contradictory, are still of great
value for future studies.

I never suggested the data should be ignored or was valueless, only that it was
inconclusive. I think the many uncontroled variables in this specific test are
problematic. Don't you?

"Keith A. Hughes" wrote:

...snips........

S888Wheel wrote:

Had the test been
put infront of a scientific peer review with the same data and the same
conclusions that panel would have sent it back for corrections. The
analysis
was wrong scientifically speaking.

Doesn't appear to be based on what's been posted here. You appear
to think that definitive results are required, in support of some
postulate, for the conclusion to be valid (or acceptable for a
review board). This is simply incorrect. Oftentimes, the data
obtained from a test are not useful, or positive, relative to
supporting a postulate, but that does not invalidate the data.
Clearly, failure to reject the null hypothesis, in any experiment,
does not invalidate the test, nor is it something requiring
"correction". It is merely part of the database that the author,
and others, build on in future research.

It's instructive to also note that peer reviewed journals publish,
not infrequently, two or more studies that contradict one another.
You seem to believe this can't be the case, because the "wrong"
ones would be sent back for "correction". A look through some
select peer reviewed journals, such as the "Journal of
Pharmacuetical Science and Technology", will show that to be
mistaken. Often in the case of biological studies, conflicting
data are obtained, often due to unknown (at the time) and/or
uncontrolled variables (often inherent to the specific population
under study). These data, while contradictory, are still of great
value for future studies.

Keith Hughes

What's more interesting is that no contradictory experimental data has ever
been published anywhere. No manufacturer, distributor, retailer or enthusiast
has ever demonstrated an ability to hear nominally competent amps, wires or
parts under listening bias controlled conditions in normally reverberant
conditions (including their personal reference systems.)

S888Wheel wrote:

snip

I said

That is ridiculous. If all of them scored 94% it would be reasonable to
say
this?

Kieth said

I believe Bob is talking about 95% confidence interval, *not* 95%
scores.

Yes I know.

Then talking about "scoring 94%" is meaningless.

Kieth said

And yes, it is very common to require a 95% confidence
level.

How often is 94% confidence level results regarded as a null when one is
seeking 95% confidence results which is in fact a subjective choice. Bottom
line is the tests were inconclusive in this particlular case.

You clearly don't understand statistical analysis. There is a null
hypothesis that states, basically, that the populations (in this
case, say the sound of Amp A and the sound of Amp B) are not
different. You set your confidence level *prior* to your analysis,
and yes 95% is ubiquitous, and you either meet that level, or you
don't. There is no gray area.

Irrespective of how "close" you may get to reaching the confidence
level, you're either in, or out. When out, the results *ARE*
conclusive - always. That conclusion is "you cannot reject the
null hypothesis at the 0.05 level", *by definition*. Period. You
cannot say for example "well, it was close, so they're probably
different". You *can* calculate what confidence level within which
you *can* reject the null hypothesis, if you want (it's
approximately 78%, by my calculations, for a result of 30/48
versus baseline expectations of 24/48). A number that would be
considered insignificant by all statisticians I've worked with.

But again, by expanding the CI, you risk erroneously rejecting the
null hypothesis unless the population is very large (i.e.
sufficiently so to approximate a two-tailed bell curve).
snip

No, the variance in those abilities are the *cause* of the bell
curve.

I was using the predicted bell curve outcome one would get if there are no
audible differences. That bell curve is dictated by the number of samples.

The "Bell" curve is dictated by varying response levels. The
degree to which the specific curve approximates a normal
distribution *is* a function of population size (or number of
samples if you prefer). Hence the need for a very high CI when
using small populations (like 15 panelists for eg.), as you do
*not* have a high degree of confidence that the population has a
normal distribution.

Kieth said

That's why sample size (panel size in this context) is so
important, and why the weight given to one or two individuals
performances, in any test, must be limited.

I took that into consideration when I claimed the test results were
inconclusive.

Well, again, I think your missing the common usage of "conclusive"
relative to statistical analysis. As stated previously, there are
only two outcomes of statistical analysis (i.e. typical ANOVA),
reject or accept the null hypothesis. Either is conclusive, as
this one apparently was. The data don't allow you to reject the
null hypothesis at even the 0.1 level.

So, you cannot say a difference was shown. You can't say there was
no difference either. This appears to be the genesis of your
"inconclusive" apellation. But this is an incorrect
interpretation, as detailed above.

That is why I suggested the thing that should have been done was
further testing of those individuals and the equipment that scored near or
beyond that which the bell curve predicts. That is why i don't run around
claiming this test proves that some people hear differences. Let's also not
forget that we have no tests on listener or system sensitivity to subtle
audible differences.

Acutally, we *assume* variation in listener abilities and
responses. Without such, there would be no bell curve, and
statistical analysis would be impossible. The *only* requirement
is that the test population is sufficiently large such that they
approximate the population as a whole, relative to reponse to the
stimuli under test. Again, the smaller the test population, the
tighter the CI *must* be due to the lower confidence in the test
population having the same distribution as the total population.

So we have unknown variables.

Yes, always. Given appropriate controls and statistics, the
effects of most are accounted for in the analysis.

Further the use of many amps
in no obvious pattern for comparisons introduces another variable that could
profoundly affect the bell curve in unknown ways if some amps sound like each
other and some don't.

Well, not having the article, I can't comment one way or another.

All of this leaves the interpretation of those results
wide open.

Well, no, as stated previously. It may, however, leave the
*question* of audibility open to further study.

snip

You should read more scientific literature then. It is the most
commonly used confidence level IME. Especially with limited
population size.

I will look into it but i will be quite surprised if that number does not
heavily depend on the sample sizes. It has to vary with sample size.

Not usually. It varies with the criticality of the results. The
confidence one has in extrapolating the results to the general
population, irrespective of CI, increases as a function of sample
size.

snip

Nonsense, of course they do. The fact that there are two-tailed
bell curves, in most population responses, is the genesis for use
of high confidence intervals. Because there *are* tails - way out
on the edges of the population, allowance for these tails must be
made when comparing populations for significant heterogeneity.

Maybe you didn't get what i was trying to say. What does and does not fall
within the predictions of a bell curve depends heavily on the number of
samples.

You don't seem to understand what a Bell curve is. ALL data is in
the curve, so I don't know what you're trying to say by "does and
does not fall within the predictions of a bell curve". Bell curves
don't predict anything, they merely incorporate all data points,
and the shape is the result of a two-tailed normal distribution.
You can, with a normal population (your Bell curve) predict the %
of the population that fits within any X standard deviations from
the mean. That's what were doing with CI's, we're *assuming* a
normal distribution - a requirement for statistics to work
(unfortunately).

snip

So you are drawing definitive conclusions from one test without even knowing
the sample size?

Only from the data you presented (as I've said, I don't have the
article). You picked individual results (such as the 30/48 vs
baseline of 24/48) as indicative of a significant outcome. I've
shown you, through the most common statistical tool used, that
that 30/48 and 24/48 are not significantly different.

Based on the data you provided, the conclusion of not rejecting
the null hypothesis seems perfectly sound.

I think you are leaping without looking. You are entitled to
your opinions. I don't find your arguements convincing so far.

Maybe you need a better understanding of statistical analysis, and
its limitations.

Where do I claim definitve results are required in support of some postulate?

Everywhere, as far as I can tell.

I would say definitve results are required for one to make claims of definitve
results.

The definitive result of failing to reject the null hypothesis, to
a chosen confidence level, requires no more data than was
apparently presented in the article. You seem to be confusing "no
difference was shown at the x.xx level" with "there are no
differences". The latter can never be said based on failure to
reject the null.

Certainly such a test as a part of a body of evidence could be seen as
supportive but even that I think is dodgy given some of the results as far as I
can see would call for further testing and the absense of listener sensitivty
alone makes it impossible to make any specific claims about the results. I

As long as the panel sensistivity is representative of the general
population, sensitivity is irrelevant. No matter the test
protocol, one *must* assume the sensitivities are representative,
unless the sensitivity of the whole population is known, and a
panel is then chosen that represents a normal distribution
congruent with the overall population. And this never happens.

never said the test was not valid due to it's failure to reject the null. I
simply said I think the results are such that calls for further investigation
and are on the border between a null a positive. Not the sort of thing one can
base definitive conclusions on.

It is clearly definitive for the panel, under the conditions of
the test.

It's instructive to also note that peer reviewed journals publish,
not infrequently, two or more studies that contradict one another.
You seem to believe this can't be the case, because the "wrong"
ones would be sent back for "correction".

Nonsense. what I think will be sent back is a poor analysis.

Which you've yet to illustrate in this case.

If the analysis
fits the data it won't be sent back. If someone is making claims of definitive
conclusions based on this test one is jumping the gun. Scientific research
papers with huge sample sizes and apparent definitive rsults are ususally
carefully worded to not make definitive claims. That is a part of propper
scientific prudence.

True. Does this article say "there are no sonic differences
between amps", or does it say (paraphrasing of course)
"differences between amps couldn't be demonstrated"? I understood
it to be the latter.

Kieht said
snip
Kieth said
snip

I never suggested the data should be ignored or was valueless, only that it was
inconclusive. I think the many uncontroled variables in this specific test are
problematic. Don't you?

I don't have the article, so I don't know how many variables were
uncontrolled. The sensitivity issue is, I believe, a red herring
however.

BTW, if you're going to repeatedly use my name in replying, common
courtesy would be to spell it right - at least once.

Keith Hughes

Thread Tools
Show Printable Version
Display Modes
Linear Mode Switch to Hybrid Mode Switch to Threaded Mode

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Crazy market saturation!	CatalystX	Car Audio	48	February 12th 04 09:18 AM
FAQ: RAM LISTING OF SCAMMERS, SLAMMERS, AND N'EER DO WELLS! V. 8.1	OFFICIAL RAM BLUEBOOK VALUATION	Audio Opinions	0	November 1st 03 08:14 AM
A quick study in very recent RAHE moderator inconsistency	Arny Krueger	Audio Opinions	74	October 7th 03 05:56 PM
Why DBTs in audio do not deliver (was: Finally ... The Furutech CD-do-something)	Bob Marcus	High End Audio	313	September 9th 03 01:17 AM
System balance for LP?	MiNE 109	Audio Opinions	41	August 10th 03 07:00 PM

Menu

About Us