About Us

Is there any published DBT of amps, CD players or cables where the
number of trials are greater than 500?

If there difference is miniscule there is likely that many "guesses"
are wrong and would require many trials to reveal any subtle
difference?

Thomas

Thomas A wrote:
Is there any published DBT of amps, CD players or cables where the
number of trials are greater than 500?

If there difference is miniscule there is likely that many "guesses"
are wrong and would require many trials to reveal any subtle
difference?

There are published tests where people claimed they could hear the
difference sighted , but when they were 'blinded' they could not.
In this case the argument that 500 trials are needed would seem
to be weak.

However, a real and miniscule difference would certainly be
discerned more reliably if there was specific training to hear it
beforehand.

--
-S.

Steven Sullivan wrote in message news:d_XUa.142496$GL4.36308@rwcrnsc53...
Thomas A wrote:
Is there any published DBT of amps, CD players or cables where the
number of trials are greater than 500?

If there difference is miniscule there is likely that many "guesses"
are wrong and would require many trials to reveal any subtle
difference?

There are published tests where people claimed they could hear the
difference sighted , but when they were 'blinded' they could not.
In this case the argument that 500 trials are needed would seem
to be weak.

Yes, that's for sure. But how are scientific tests of just noticable
difference set up? A difference, when very small, could introduce more
incorrect answers from the test subjects. Thus I think the question is
interesting.

However, a real and miniscule difference would certainly be
discerned more reliably if there was specific training to hear it
beforehand.

Yes, but still, if the difference is real and miniscule it could
introduce incorrect answers even if there is specific training
beforehand. If there would be an all or nothing thing, then the result
would always be 100% correct (difference) or 50% (no difference).
What if the answers are 60% correct?

On 28 Jul 2003 14:46:01 GMT, (Thomas A)
wrote:

Steven Sullivan wrote in message news:d_XUa.142496$GL4.36308@rwcrnsc53...
Thomas A wrote:
Is there any published DBT of amps, CD players or cables where the
number of trials are greater than 500?

If there difference is miniscule there is likely that many "guesses"
are wrong and would require many trials to reveal any subtle
difference?

There are published tests where people claimed they could hear the
difference sighted , but when they were 'blinded' they could not.
In this case the argument that 500 trials are needed would seem
to be weak.

Yes, that's for sure. But how are scientific tests of just noticable
difference set up? A difference, when very small, could introduce more
incorrect answers from the test subjects. Thus I think the question is
interesting.

However, a real and miniscule difference would certainly be
discerned more reliably if there was specific training to hear it
beforehand.

Yes, but still, if the difference is real and miniscule it could
introduce incorrect answers even if there is specific training
beforehand. If there would be an all or nothing thing, then the result
would always be 100% correct (difference) or 50% (no difference).
What if the answers are 60% correct?

The real problem is that you can *never* say that the difference is
real, just that there is a very high statistical probability that a
difference was detected. After all, it *is* possible to toss a coin
and get 500 heads in a row, it's just *very* unlikely.
--

Stewart Pinkerton | Music is Art - Audio is Engineering

(Thomas A) wrote in message ...

Yes, but still, if the difference is real and miniscule it could
introduce incorrect answers even if there is specific training
beforehand. If there would be an all or nothing thing, then the result
would always be 100% correct (difference) or 50% (no difference).
What if the answers are 60% correct?

You would still need far fewer than 500 trials to get a statistically
significant result. For 165 trials, 99 correct, which is 60%, would be
statistically significant at the 99% confidence level. If you were
willing to settle for a 95% confidence level, you would need even
fewer trials.

And remember, even if you don't get a statistically significant
result, you still can't conclude that the difference is inaudible. So
you wouldn't get an incorrect result; you'd get an inconclusive one.

bob

Thomas A wrote:
Steven Sullivan wrote in message news:d_XUa.142496$GL4.36308@rwcrnsc53...
Thomas A wrote:
Is there any published DBT of amps, CD players or cables where the
number of trials are greater than 500?

If there difference is miniscule there is likely that many "guesses"
are wrong and would require many trials to reveal any subtle
difference?

There are published tests where people claimed they could hear the
difference sighted , but when they were 'blinded' they could not.
In this case the argument that 500 trials are needed would seem
to be weak.

Yes, that's for sure. But how are scientific tests of just noticable
difference set up? A difference, when very small, could introduce more
incorrect answers from the test subjects. Thus I think the question is
interesting.

However, a real and miniscule difference would certainly be
discerned more reliably if there was specific training to hear it
beforehand.

Yes, but still, if the difference is real and miniscule it could
introduce incorrect answers even if there is specific training
beforehand. If there would be an all or nothing thing, then the result
would always be 100% correct (difference) or 50% (no difference).
What if the answers are 60% correct?

What level of certitude are you looking for? Scientists use
statistical tools to calculate probabilities of different
kinds of error in such cases.

--
-S.

Steven Sullivan wrote in message news:UqnVa.4003$Oz4.1480@rwcrnsc54...
Thomas A wrote:
Steven Sullivan wrote in message news:d_XUa.142496$GL4.36308@rwcrnsc53...
Thomas A wrote:
Is there any published DBT of amps, CD players or cables where the
number of trials are greater than 500?

If there difference is miniscule there is likely that many "guesses"
are wrong and would require many trials to reveal any subtle
difference?

There are published tests where people claimed they could hear the
difference sighted , but when they were 'blinded' they could not.
In this case the argument that 500 trials are needed would seem
to be weak.

Yes, that's for sure. But how are scientific tests of just noticable
difference set up? A difference, when very small, could introduce more
incorrect answers from the test subjects. Thus I think the question is
interesting.

However, a real and miniscule difference would certainly be
discerned more reliably if there was specific training to hear it
beforehand.

Yes, but still, if the difference is real and miniscule it could
introduce incorrect answers even if there is specific training
beforehand. If there would be an all or nothing thing, then the result
would always be 100% correct (difference) or 50% (no difference).
What if the answers are 60% correct?

What level of certitude are you looking for? Scientists use
statistical tools to calculate probabilities of different
kinds of error in such cases.

Well confidence limits of 95% or 99% are usually applied. The power of
the test is however important when you approach the audible limit.
Also, in sample sizes 200 you need not use correction for continuity
in the statistical calculation. I am not sure, but I think this
correction applies in this case when sample sizes are 25-200. Below
25, this correction is not sufficient.

"Thomas A" wrote in message
news

BUUa.141509$OZ2.27088@rwcrnsc54...
Is there any published DBT of amps, CD players or cables where the
number of trials are greater than 500?

I've never seen one. It would be difficult to get a single subject to
do that many trials. So, it would have to be many subjects and they
would have to be isolated to prevent subtle influence from one to the
other.

If there difference is miniscule there is likely that many "guesses"
are wrong and would require many trials to reveal any subtle
difference?

(Note that the word is spelled "minuscule.")

Norm Strong

(Thomas A) wrote:

Is there any published DBT of amps, CD players or cables where the
number of trials are greater than 500?

If there difference is miniscule there is likely that many "guesses"
are wrong and would require many trials to reveal any subtle
difference?

Thomas

With regard to amplifiers as of May 1990 there had been such tests. In 1978
QUAD published an erxperiment with 576 trials. In 1980 Smith peterson and
Jackson published an experiment with 1104 trials; in 1989 Stereophile published
a 3530 trial comparison. In 1986 Clark & Masters published an experiment with
772 trials. All were null.

There's a misconception that blind tests tend to have very small sample sizes.
As of 1990 the 23 published amplifier experiments had a mean average of 426 and
a median of 90 trials. If we exclude the 3530 trial experiment the mean becomes
285 trials. The median remains unchanged.

(Nousaine) wrote in message ...
(Thomas A) wrote:

Is there any published DBT of amps, CD players or cables where the
number of trials are greater than 500?

If there difference is miniscule there is likely that many "guesses"
are wrong and would require many trials to reveal any subtle
difference?

Thomas

With regard to amplifiers as of May 1990 there had been such tests. In 1978
QUAD published an erxperiment with 576 trials. In 1980 Smith peterson and
Jackson published an experiment with 1104 trials; in 1989 Stereophile published
a 3530 trial comparison. In 1986 Clark & Masters published an experiment with
772 trials. All were null.

There's a misconception that blind tests tend to have very small sample sizes.
As of 1990 the 23 published amplifier experiments had a mean average of 426 and
a median of 90 trials. If we exclude the 3530 trial experiment the mean becomes
285 trials. The median remains unchanged.

Ok thanks. Is it possible to get the numbers for each test? I would
like to see if it possible to do a meta-analysis in the amplifier
case. The test by tagmclaren is an additional one:

http://www.tagmclaren.com/members/news/news77.asp

Thomas

Thomas -

Thanks for the post of the Tag Mclaren test link (and to Tom for the other
references). I've looked at the Tag link and suspect it's going to add to
the controversy here. My comments on the test follow.

From the tone of the web info on this test, one can presume that Tag set out
to show its relatively inexpensive gear was just as good as some
acknowledged industry standards. But....wonder why Tag choose the 99%
confidence level? Being careful *not* to say that it was prechosen in
advance? It is because had they used the more common and almost
universally-used 95% level it would have shown that:

* When cable A was the "X" it was recognized at a significant level by the
panel (and guess whose cable probably would "lose" in a preference test
versus a universally recognized standard of excellence chosen as "tops" by
both Stereophile and TAS, as well as by other industry publishers)

* One individual differentiated both cable A and combined cables at the
significant level

Results summarized as follows:

Tag Mclaren Published ABX Results

Sample 99% 95% Actual Confidence

Total Test

Cables
A 96 60 53 e 52 94.8% e
B 84 54 48 e 38 coin toss
Both 180 107 97 e 90 coin toss

Amps
A 96 60 53 e 47 coin toss
B 84 54 48 e 38 coin toss
Both 180 107 97 e 85 coin toss

Top Individuals

Cables
A 8 8 7 6 94.5%
B 7 7 7 5 83.6%
Both 15 13 11 11 95.8%

Amps
A 8 8 7 5 83.6%
B 7 7 7 5 83.6%
Both 15 13 11 10 90.8%

e = extrapolated based on scores for 100 and 50 sample size

In general, the test while seemingly objective has more negatives than
positives when measured against the consensus of the objectivists (and some
subjectivists) in this group as to what constitutes a good abx test:

TEST POSITIVES
*double blind
*level matched

TEST NEGATIVES
*short snippets
*no user control over switching and (apparently) no repeats
*no user control over content
*group test, no safeguards against visual interaction
*no group selection criteria apparent and no pre-training or testing

The results and the summary of positives/negatives above raise some
interesting questions:

*why, for example, should one cable be significantly identified when "x" and
the other fail miserably to be identified. This has to be due and
interaction between the characteristics of the music samples chosen, the
characteristics of the cables under test, and perhaps aggravated by the use
of short snippets with an inadequate time frame to establish the proper
evaluation context. Did the test itself create the overall null where
people could not differentiate based soley on the test not favoring B as
much as A?

* do the differences in people scoring high on the two tests support the
idea that different people react to different attributes of the DUT's. Or
does it again suggest some interaction between the music chosen, the
characteristics of the individual pieces, and perhaps the evaluation time
frame.

* or is it possible that the abx test itself, when used with short snippets,
makes some kinds of differences more apparent and others less apparent and
thus by working against exposing *all* kinds of differences help create more
*no differences* than should be the result.

* since the panel is not identified and there was no training, do the
results suggest a "dumbing down" of differentiation from the scores of the
more able listeners? I am sure it will be suggested that the two different
high scorers were simply random outliers...I'm not so sure especially since
the individual scoring high on the cable test hears the cable differences
exactly like the general sample but at a higher level (required because of
smaller sample size) and the high scorer on the amp test is in much the same
position.

if some of these arguments sound familiar, they certainly raises echoes of
the issues raised here by subjectivists over the years...and yet these
specifics are rooted in the results of this one test.

I'd like to hear other views on this test.

"Thomas A" wrote in message
news:ahwVa.6957$cF.2308@rwcrnsc53...
(Nousaine) wrote in message
...
(Thomas A) wrote:

Is there any published DBT of amps, CD players or cables where the
number of trials are greater than 500?

If there difference is miniscule there is likely that many "guesses"
are wrong and would require many trials to reveal any subtle
difference?

Thomas

With regard to amplifiers as of May 1990 there had been such tests. In
1978
QUAD published an erxperiment with 576 trials. In 1980 Smith peterson
and
Jackson published an experiment with 1104 trials; in 1989 Stereophile
published
a 3530 trial comparison. In 1986 Clark & Masters published an experiment
with
772 trials. All were null.

There's a misconception that blind tests tend to have very small sample
sizes.
As of 1990 the 23 published amplifier experiments had a mean average of
426 and
a median of 90 trials. If we exclude the 3530 trial experiment the mean
becomes
285 trials. The median remains unchanged.

Ok thanks. Is it possible to get the numbers for each test? I would
like to see if it possible to do a meta-analysis in the amplifier
case. The test by tagmclaren is an additional one:

http://www.tagmclaren.com/members/news/news77.asp

Thomas

"Harry Lavo" wrote:

Thomas -

Thanks for the post of the Tag Mclaren test link (and to Tom for the other
references). I've looked at the Tag link and suspect it's going to add to
the controversy here.

Actually there's no 'controversey' here. No proponent of amp/wire-sound has
ever shown that nominally competent amps or wires have any sound of their own
when played back over loudspeakers.

The only 'controversey' is over whether Arny Kreuger's pcabx tests cab with
headphones and special programs can be extrapolated to commerically available
programs and speakers in a normally reverberant environment.

The Tag-M results are fully within those expected given the more than 2 dozen
published experiments of amps and wires.

y comments on the test follow.

From the tone of the web info on this test, one can presume that Tag set out
to show its relatively inexpensive gear was just as good as some
acknowledged industry standards. But....wonder why Tag choose the 99%
confidence level?

Why not? But you can analyze it any way your want. That's the wonderful thing
about published results.

Being careful *not* to say that it was prechosen in
advance? It is because had they used the more common and almost
universally-used 95% level it would have shown that:

* When cable A was the "X" it was recognized at a significant level by the
panel (and guess whose cable probably would "lose" in a preference test
versus a universally recognized standard of excellence chosen as "tops" by
both Stereophile and TAS, as well as by other industry publishers)

* One individual differentiated both cable A and combined cables at the
significant level

Results summarized as follows:

Tag Mclaren Published ABX Results

Sample 99% 95% Actual Confidence

Total Test

Cables
A 96 60 53 e 52 94.8% e
B 84 54 48 e 38 coin toss
Both 180 107 97 e 90 coin toss

Amps
A 96 60 53 e 47 coin toss
B 84 54 48 e 38 coin toss
Both 180 107 97 e 85 coin toss

Top Individuals

Cables
A 8 8 7 6 94.5%
B 7 7 7 5 83.6%
Both 15 13 11 11 95.8%

Amps
A 8 8 7 5 83.6%
B 7 7 7 5 83.6%
Both 15 13 11 10 90.8%

e = extrapolated based on scores for 100 and 50 sample size

In general, the test while seemingly objective has more negatives than
positives when measured against the consensus of the objectivists (and some
subjectivists) in this group as to what constitutes a good abx test:

This is what always happens with 'bad news.' Instead of giving us contradictory
evidence we get endless wishful 'data-dredging' to find any possible reason to
ignore the evidence.

In any other circle when one thinks the results of a given experiment are wrong
they just duplicate it showing the error OR produce a valid one with contrary
evidence.

TEST POSITIVES
*double blind
*level matched

TEST NEGATIVES
*short snippets
*no user control over switching and (apparently) no repeats
*no user control over content
*group test, no safeguards against visual interaction
*no group selection criteria apparent and no pre-training or testing

OK how many of your sighted 'tests' have ignored one or all of these positives
or negatives?

The results and the summary of positives/negatives above raise some
interesting questions:

No, not really. All of the true questions about bias controlled listening tests
have been addressed prior.

*why, for example, should one cable be significantly identified when "x" and
the other fail miserably to be identified. This has to be due and
interaction between the characteristics of the music samples chosen, the
characteristics of the cables under test, and perhaps aggravated by the use
of short snippets with an inadequate time frame to establish the proper
evaluation context. Did the test itself create the overall null where
people could not differentiate based soley on the test not favoring B as
much as A?

* do the differences in people scoring high on the two tests support the
idea that different people react to different attributes of the DUT's. Or
does it again suggest some interaction between the music chosen, the
characteristics of the individual pieces, and perhaps the evaluation time
frame.

* or is it possible that the abx test itself, when used with short snippets,
makes some kinds of differences more apparent and others less apparent and
thus by working against exposing *all* kinds of differences help create more
*no differences* than should be the result.

* since the panel is not identified and there was no training, do the
results suggest a "dumbing down" of differentiation from the scores of the
more able listeners? I am sure it will be suggested that the two different
high scorers were simply random outliers...I'm not so sure especially since
the individual scoring high on the cable test hears the cable differences
exactly like the general sample but at a higher level (required because of
smaller sample size) and the high scorer on the amp test is in much the same
position.

if some of these arguments sound familiar, they certainly raises echoes of
the issues raised here by subjectivists over the years...and yet these
specifics are rooted in the results of this one test.

I'd like to hear other views on this test.

These results are consistent with the 2 dozen and more other bias controlled
listening tests of power amplifiers and wires.

"Thomas A" wrote in message
news:ahwVa.6957$cF.2308@rwcrnsc53...
(Nousaine) wrote in message
...
(Thomas A) wrote:

Is there any published DBT of amps, CD players or cables where the
number of trials are greater than 500?

If there difference is miniscule there is likely that many "guesses"
are wrong and would require many trials to reveal any subtle
difference?

Thomas

With regard to amplifiers as of May 1990 there had been such tests. In
1978
QUAD published an erxperiment with 576 trials. In 1980 Smith peterson
and
Jackson published an experiment with 1104 trials; in 1989 Stereophile
published
a 3530 trial comparison. In 1986 Clark & Masters published an experiment
with
772 trials. All were null.

There's a misconception that blind tests tend to have very small sample
sizes.
As of 1990 the 23 published amplifier experiments had a mean average of
426 and
a median of 90 trials. If we exclude the 3530 trial experiment the mean
becomes
285 trials. The median remains unchanged.

Ok thanks. Is it possible to get the numbers for each test? I would
like to see if it possible to do a meta-analysis in the amplifier
case. The test by tagmclaren is an additional one:

Thanks for the reference.

Nousaine wrote:

This is what always happens with 'bad news.' Instead of giving us contradictory
evidence we get endless wishful 'data-dredging' to find any possible reason to
ignore the evidence.

In any other circle when one thinks the results of a given experiment are wrong
they just duplicate it showing the error OR produce a valid one with contrary
evidence.

Not necessarily. It's quite common for questions to be raised during peer
review of a scientific paper; it is then incumbent upon the *experimenter*, not
the critic, to justify his or her choice of protocol, or his/her explanation of
the results. Often this involves doing more experiments to address the reviewer's
concerns. Sometimes it merely involved explaining the results more clearly, or
in more qualified terms. If the experimenter feels the reviewer has ignored some
important point, that comes out too in the reply to the reviews.

I say all this having not yet visited the link, so I'm totally unbiased ;

On Wed, 30 Jul 2003 03:26:24 GMT, "Harry Lavo"
wrote:

From the tone of the web info on this test, one can presume that Tag set out
to show its relatively inexpensive gear was just as good as some
acknowledged industry standards. But....wonder why Tag choose the 99%
confidence level? Being careful *not* to say that it was prechosen in
advance? It is because had they used the more common and almost
universally-used 95% level it would have shown that:

Can anyone smell fish? Specifically, red herring?

* When cable A was the "X" it was recognized at a significant level by the
panel (and guess whose cable probably would "lose" in a preference test
versus a universally recognized standard of excellence chosen as "tops" by
both Stereophile and TAS, as well as by other industry publishers)

No Harry, *all* tests fell below the 95% level, except for one single
participant in the cable test, which just scraped in. Given that there
were 12 volunteers, there's less than 2:1 odds against this happening
when tossing coins. Interesting that you also failed to note that the
'best performers' in the cable test did *not* perform well in the
amplifier test, and vice versa.

You do love to cherry-pick in search of your *required* result, don't
you?

* One individual differentiated both cable A and combined cables at the
significant level

Results summarized as follows:

Tag Mclaren Published ABX Results

Sample 99% 95% Actual Confidence

Total Test

Cables
A 96 60 53 e 52 94.8% e
B 84 54 48 e 38 coin toss
Both 180 107 97 e 90 coin toss

Amps
A 96 60 53 e 47 coin toss
B 84 54 48 e 38 coin toss
Both 180 107 97 e 85 coin toss

Top Individuals

Cables
A 8 8 7 6 94.5%
B 7 7 7 5 83.6%
Both 15 13 11 11 95.8%

Amps
A 8 8 7 5 83.6%
B 7 7 7 5 83.6%
Both 15 13 11 10 90.8%

e = extrapolated based on scores for 100 and 50 sample size

In general, the test while seemingly objective has more negatives than
positives when measured against the consensus of the objectivists (and some
subjectivists) in this group as to what constitutes a good abx test:

TEST POSITIVES
*double blind
*level matched

TEST NEGATIVES
*short snippets
*no user control over switching and (apparently) no repeats
*no user control over content
*group test, no safeguards against visual interaction
*no group selection criteria apparent and no pre-training or testing

The results and the summary of positives/negatives above raise some
interesting questions:

*why, for example, should one cable be significantly identified when "x" and
the other fail miserably to be identified. This has to be due and
interaction between the characteristics of the music samples chosen, the
characteristics of the cables under test, and perhaps aggravated by the use
of short snippets with an inadequate time frame to establish the proper
evaluation context.

No it doen't Harry, I doesn't *have* to be due to anything but random
chance.

Did the test itself create the overall null where
people could not differentiate based soley on the test not favoring B as
much as A?

* do the differences in people scoring high on the two tests support the
idea that different people react to different attributes of the DUT's. Or
does it again suggest some interaction between the music chosen, the
characteristics of the individual pieces, and perhaps the evaluation time
frame.

No, since the high scorers on one test were not the high scorers in
the other test. It's called a distrinution, harry, and it is simply
more evidence that there were in fact no audible differences - as any
reasonable person would expect.

http://www.tagmclaren.com/members/news/news77.asp

--

Stewart Pinkerton | Music is Art - Audio is Engineering

"Stewart Pinkerton" wrote in message
news:7jKVa.10234$Oz4.4174@rwcrnsc54...
On Wed, 30 Jul 2003 03:26:24 GMT, "Harry Lavo"
wrote:

From the tone of the web info on this test, one can presume that Tag set
out
to show its relatively inexpensive gear was just as good as some
acknowledged industry standards. But....wonder why Tag choose the 99%
confidence level? Being careful *not* to say that it was prechosen in
advance? It is because had they used the more common and almost
universally-used 95% level it would have shown that:

Can anyone smell fish? Specifically, red herring?

Are you an outlyer? Or are you simply sensitive to fish? Or did you not
conceive that thought double-blind and it is just your imagination? :=)

* When cable A was the "X" it was recognized at a significant level by
the
panel (and guess whose cable probably would "lose" in a preference test
versus a universally recognized standard of excellence chosen as "tops"
by
both Stereophile and TAS, as well as by other industry publishers)

No Harry, *all* tests fell below the 95% level, except for one single
participant in the cable test, which just scraped in. Given that there
were 12 volunteers, there's less than 2:1 odds against this happening
when tossing coins. Interesting that you also failed to note that the
'best performers' in the cable test did *not* perform well in the
amplifier test, and vice versa.

I'm sorry, but when rounded to whole numbers 94.8% is a lot closer than one
number higher which would be about 96% in the larger panels and 97% in the
smaller panels. The standard is 95%. To say that 94.8% doesn't qualify is
splitting hairs. I inclduded the actual numbers needed to pass the barrier
just to satisfy the purists, but you *ARE* splitting hairs here, Stewart.

You do love to cherry-pick in search of your *required* result, don't
you?

You mean not accepting the "received truth" without doing my own analysis is
cherry picking, is that it Stewart? We are not allowed to point out
anonomlies and ask "why"? "how come"? "what could be causing this?"

And would you explain why a significant level was reached on the "A" cable
test with 96 trials? Was that "cherry picking". C'mon, Stewart, you know
better. In fact the real issue here is: if one cable can be so readily
picked out, why can't the other be? What is it in the test, procedure,
quality of the cables, order bias, or what. Something is rotten in the
beloved state of ABX here!

* One individual differentiated both cable A and combined cables at the
significant level

Results summarized as follows:

Tag Mclaren Published ABX Results

Sample 99% 95% Actual Confidence

Total Test

Cables
A 96 60 53 e 52 94.8% e
B 84 54 48 e 38 coin toss
Both 180 107 97 e 90 coin toss

Amps
A 96 60 53 e 47 coin toss
B 84 54 48 e 38 coin toss
Both 180 107 97 e 85 coin toss

Top Individuals

Cables
A 8 8 7 6 94.5%
B 7 7 7 5 83.6%
Both 15 13 11 11 95.8%

Amps
A 8 8 7 5 83.6%
B 7 7 7 5 83.6%
Both 15 13 11 10 90.8%

e = extrapolated based on scores for 100 and 50 sample size

In general, the test while seemingly objective has more negatives than
positives when measured against the consensus of the objectivists (and
some
subjectivists) in this group as to what constitutes a good abx test:

TEST POSITIVES
*double blind
*level matched

TEST NEGATIVES
*short snippets
*no user control over switching and (apparently) no repeats
*no user control over content
*group test, no safeguards against visual interaction
*no group selection criteria apparent and no pre-training or testing

The results and the summary of positives/negatives above raise some
interesting questions:

*why, for example, should one cable be significantly identified when "x"
and
the other fail miserably to be identified. This has to be due and
interaction between the characteristics of the music samples chosen, the
characteristics of the cables under test, and perhaps aggravated by the
use
of short snippets with an inadequate time frame to establish the proper
evaluation context.

No it doen't Harry, I doesn't *have* to be due to anything but random
chance.

Did the test itself create the overall null where
people could not differentiate based soley on the test not favoring B as
much as A?

* do the differences in people scoring high on the two tests support the
idea that different people react to different attributes of the DUT's.
Or
does it again suggest some interaction between the music chosen, the
characteristics of the individual pieces, and perhaps the evaluation time
frame.

No, since the high scorers on one test were not the high scorers in
the other test. It's called a distrinution, harry, and it is simply
more evidence that there were in fact no audible differences - as any
reasonable person would expect.

http://www.tagmclaren.com/members/news/news77.asp

--

Stewart Pinkerton | Music is Art - Audio is Engineering

I notice no comment on this latter part, Stewart. That is the *SUBSTANCE*
of the interesting results of the test/techniques used and the questions
raised.

In article zVGVa.15179$Ho3.2323@sccrnsc03, "Harry Lavo"
wrote:

Thomas -

Thanks for the post of the Tag Mclaren test link (and to Tom for the other
references). I've looked at the Tag link and suspect it's going to add to
the controversy here. My comments on the test follow.

From the tone of the web info on this test, one can presume that Tag set out
to show its relatively inexpensive gear was just as good as some
acknowledged industry standards. But....wonder why Tag choose the 99%
confidence level? Being careful *not* to say that it was prechosen in
advance? It is because had they used the more common and almost
universally-used 95% level it would have shown that:

* When cable A was the "X" it was recognized at a significant level by the
panel (and guess whose cable probably would "lose" in a preference test
versus a universally recognized standard of excellence chosen as "tops" by
both Stereophile and TAS, as well as by other industry publishers)

* One individual differentiated both cable A and combined cables at the
significant level

Results summarized as follows:

Tag Mclaren Published ABX Results

Sample 99% 95% Actual Confidence

Total Test

Cables
A 96 60 53 e 52 94.8% e
B 84 54 48 e 38 coin toss
Both 180 107 97 e 90 coin toss

Amps
A 96 60 53 e 47 coin toss
B 84 54 48 e 38 coin toss
Both 180 107 97 e 85 coin toss

Top Individuals

Cables
A 8 8 7 6 94.5%
B 7 7 7 5 83.6%
Both 15 13 11 11 95.8%

Amps
A 8 8 7 5 83.6%
B 7 7 7 5 83.6%
Both 15 13 11 10 90.8%

e = extrapolated based on scores for 100 and 50 sample size

[snip]

I'd like to hear other views on this test.

Mr. Lavo, here are some comments on your numbers.

The short story: your numbers are bogus.

The long story follows.

I don't know how you came up with critical values for what you
think is a reasonable level of significance.

For n = 96 trials, the critical values a
60 for .01 level of significance
57 for .05 level of significance
53 for .20 level of significance

for n = 84 trials, the critical values a
54 for .01 level of significance
51 for .05 level of significance
47 for .20 level of significance

for n = 180 trials, the critical values a
107 for .01 level of significance
102 for .05 level of significance
97 for .20 level of significance

for n = 8 trials, the critical values a
8 for .01 level of significance
7 for .05 level of significance
6 for .20 level of significance

for n = 7 trials, the critical values a
7 for .01 level of significance
7 for .05 level of significance
6 for .20 level of significance

for n = 15 trials, the critical values a
13 for .01 level of significance
12 for .05 level of significance
10 for .20 level of significance

The values you provide for what you call 95% confidence (i.e., .05 level
of significance) are almost the correct values for 20% significance.

You make much of an apparently borderline significant result,
where the best individual cable test scores were 11 of 15 correct.

If that had been the entire experiment, we would have a p-value
of .059, reflecting the probability that one would do at least that
well in a single run of 15 trials.
That is, the probability that someone would score 11, 12, 13, 14, or 15
correct just by guessing is .059; also, the probability that the score is
less than 11 would be 1 - .059 = .941 for a single batch of 15 trials.

But what was reported was the best such performance in a dozen sets of
trials. That's not the same as a single run of 15 trials.

The probability of at least one of 12 subjects doing at least as well as
11 of 15 is 1 - [ probability that all 12 do worse than 11 of 15 ].
Thus we get 1 - (.941)^(12), which is about .52.

So, even your star performer is not doing better than chance suggests
he should.

Mr. Lavo, your conjecture (that that the test organizers have tried
to distort the results to fit an agenda) appears to be without support.

Now for some comments on the TAG McLaren report itself.

There are problems with some numbers provided by TAG McLaren, but they
are confined to background material.
There do not appear to be problems with the actual report of experimental
results.

TAG McLaren claims that you need more than 10 trials to obtain results
significant at the .01 level, but they are wrong. In fact, 7 trials suffice.
With 10 trials you can reach .001.
There is a table just before the results section of their report with some
discussion about small sample sizes. The first several rows of that table have
bogus numbers in the third column, and their sample size claims are based
on those wrong numbers.
However, the values for 11 or more trials are correct.

As I have already noted, the numbers used in the report itself appear to
be correct.

Some would argue that their conclusion should be that they found no
evidence to support a claim of audible difference, rather than concluding
that there was no audible difference, but that's another issue.

You have correctly noted concerns about the form of ABX presentation (not
the same as the usual ABX scheme discussed on rahe) but that does not
invalidate the experiment.

There are questions about how the sample of listeners was obtained.

For the most part, TAG McLaren seems to have designed a test, carried it
out, run the numbers properly, and then accurately reported what they did.
That's more than can be said for many tests.

JC

"John Corbett" wrote in message
news:hQlWa.37169$YN5.32913@sccrnsc01...
In article zVGVa.15179$Ho3.2323@sccrnsc03, "Harry Lavo"
wrote:

Thomas -

Thanks for the post of the Tag Mclaren test link (and to Tom for the
other
references). I've looked at the Tag link and suspect it's going to add
to
the controversy here. My comments on the test follow.

From the tone of the web info on this test, one can presume that Tag set
out
to show its relatively inexpensive gear was just as good as some
acknowledged industry standards. But....wonder why Tag choose the 99%
confidence level? Being careful *not* to say that it was prechosen in
advance? It is because had they used the more common and almost
universally-used 95% level it would have shown that:

* When cable A was the "X" it was recognized at a significant level by
the
panel (and guess whose cable probably would "lose" in a preference test
versus a universally recognized standard of excellence chosen as "tops"
by
both Stereophile and TAS, as well as by other industry publishers)

* One individual differentiated both cable A and combined cables at the
significant level

Results summarized as follows:

Tag Mclaren Published ABX Results

Sample 99% 95% Actual Confidence

Total Test

Cables
A 96 60 53 e 52 94.8% e
B 84 54 48 e 38 coin toss
Both 180 107 97 e 90 coin toss

Amps
A 96 60 53 e 47 coin toss
B 84 54 48 e 38 coin toss
Both 180 107 97 e 85 coin toss

Top Individuals

Cables
A 8 8 7 6 94.5%
B 7 7 7 5 83.6%
Both 15 13 11 11 95.8%

Amps
A 8 8 7 5 83.6%
B 7 7 7 5 83.6%
Both 15 13 11 10 90.8%

e = extrapolated based on scores for 100 and 50 sample size

[snip]

I'd like to hear other views on this test.

Mr. Lavo, here are some comments on your numbers.

The short story: your numbers are bogus.

The long story follows.

I don't know how you came up with critical values for what you
think is a reasonable level of significance.

For n = 96 trials, the critical values a
60 for .01 level of significance
57 for .05 level of significance
53 for .20 level of significance

for n = 84 trials, the critical values a
54 for .01 level of significance
51 for .05 level of significance
47 for .20 level of significance

for n = 180 trials, the critical values a
107 for .01 level of significance
102 for .05 level of significance
97 for .20 level of significance

for n = 8 trials, the critical values a
8 for .01 level of significance
7 for .05 level of significance
6 for .20 level of significance

for n = 7 trials, the critical values a
7 for .01 level of significance
7 for .05 level of significance
6 for .20 level of significance

for n = 15 trials, the critical values a
13 for .01 level of significance
12 for .05 level of significance
10 for .20 level of significance

The values you provide for what you call 95% confidence (i.e., .05 level
of significance) are almost the correct values for 20% significance.

You make much of an apparently borderline significant result,
where the best individual cable test scores were 11 of 15 correct.

If that had been the entire experiment, we would have a p-value
of .059, reflecting the probability that one would do at least that
well in a single run of 15 trials.
That is, the probability that someone would score 11, 12, 13, 14, or 15
correct just by guessing is .059; also, the probability that the score is
less than 11 would be 1 - .059 = .941 for a single batch of 15 trials.

But what was reported was the best such performance in a dozen sets of
trials. That's not the same as a single run of 15 trials.

The probability of at least one of 12 subjects doing at least as well as
11 of 15 is 1 - [ probability that all 12 do worse than 11 of 15 ].
Thus we get 1 - (.941)^(12), which is about .52.

So, even your star performer is not doing better than chance suggests
he should.

Mr. Lavo, your conjecture (that that the test organizers have tried
to distort the results to fit an agenda) appears to be without support.

Now for some comments on the TAG McLaren report itself.

There are problems with some numbers provided by TAG McLaren, but they
are confined to background material.
There do not appear to be problems with the actual report of experimental
results.

TAG McLaren claims that you need more than 10 trials to obtain results
significant at the .01 level, but they are wrong. In fact, 7 trials
suffice.
With 10 trials you can reach .001.
There is a table just before the results section of their report with some
discussion about small sample sizes. The first several rows of that table
have
bogus numbers in the third column, and their sample size claims are based
on those wrong numbers.
However, the values for 11 or more trials are correct.

As I have already noted, the numbers used in the report itself appear to
be correct.

Some would argue that their conclusion should be that they found no
evidence to support a claim of audible difference, rather than concluding
that there was no audible difference, but that's another issue.

You have correctly noted concerns about the form of ABX presentation (not
the same as the usual ABX scheme discussed on rahe) but that does not
invalidate the experiment.

There are questions about how the sample of listeners was obtained.

For the most part, TAG McLaren seems to have designed a test, carried it
out, run the numbers properly, and then accurately reported what they did.
That's more than can be said for many tests.

John -

I have explained that I made an error, and I thank you for pointing it out.
I also explained how and why, but that is an explaination, not an excuse.

Perhaps you could bring your statistical skills to bear on the Greenhill
test as reported in Stereo Review in1983. The raw results were posted here
by me and by Ludovic about a year ago. As I recall one of the participants
in that test did very well across several tests..I'd be interested in your
calculation of the probability of his achieving those results by chance.
This is not a troll..the mathematics of it are simply beyond me and I tried
at the time to calculate the odds and apparently failed. The argument was:
outlyer, or golden ear.

(Thomas A) wrote:

(Nousaine) wrote in message
...
(Thomas A) wrote:

Is there any published DBT of amps, CD players or cables where the
number of trials are greater than 500?

If there difference is miniscule there is likely that many "guesses"
are wrong and would require many trials to reveal any subtle
difference?

Thomas

With regard to amplifiers as of May 1990 there had been such tests. In 1978
QUAD published an erxperiment with 576 trials. In 1980 Smith peterson and
Jackson published an experiment with 1104 trials; in 1989 Stereophile
published
a 3530 trial comparison. In 1986 Clark & Masters published an experiment
with
772 trials. All were null.

There's a misconception that blind tests tend to have very small sample
sizes.
As of 1990 the 23 published amplifier experiments had a mean average of 426
and
a median of 90 trials. If we exclude the 3530 trial experiment the mean
becomes
285 trials. The median remains unchanged.

Ok thanks. Is it possible to get the numbers for each test? I would
like to see if it possible to do a meta-analysis in the amplifier
case. The test by tagmclaren is an additional one:

http://www.tagmclaren.com/members/news/news77.asp

Thomas

I did just that in 1990 to answer the nagging question "has sample size and
barely audible difference hidden anything?" A summary of these data can be
found in The Proceedings of the 1990 AES Conference "The Sound of Audio" May
1990 in the paper "The Great Debate: Is Anyone Winning?" (www.aes.org)

In general larger sample sizes did not produce more significant results and
there wasn't a relationship of criterion score to sample size.

IME if there is a true just-audible difference scores tend to run high. For
example in tests I ran last summer scores were, as I recall, 21/23 and 17/21 in
two successive runs in a challenge where the session leader claimed a
transparent transfer. IOW results go from chance to strongly positive once
threshold has been reached.

You can test this for yourself at www.pcabx.com where Arny Krueger has
training sessions with increasing levels of difficulty. Also the codec testing
sites are a good place to investigate this issue.

(Nousaine) wrote in message news:nWGVa.15987$YN5.14030@sccrnsc01...
(Thomas A) wrote:

(Nousaine) wrote in message
...
(Thomas A) wrote:

Is there any published DBT of amps, CD players or cables where the
number of trials are greater than 500?

If there difference is miniscule there is likely that many "guesses"
are wrong and would require many trials to reveal any subtle
difference?

Thomas

With regard to amplifiers as of May 1990 there had been such tests. In 1978
QUAD published an erxperiment with 576 trials. In 1980 Smith peterson and
Jackson published an experiment with 1104 trials; in 1989 Stereophile
published
a 3530 trial comparison. In 1986 Clark & Masters published an experiment
with
772 trials. All were null.

There's a misconception that blind tests tend to have very small sample
sizes.
As of 1990 the 23 published amplifier experiments had a mean average of 426
and
a median of 90 trials. If we exclude the 3530 trial experiment the mean
becomes
285 trials. The median remains unchanged.

Ok thanks. Is it possible to get the numbers for each test? I would
like to see if it possible to do a meta-analysis in the amplifier
case. The test by tagmclaren is an additional one:

http://www.tagmclaren.com/members/news/news77.asp

Thomas

I did just that in 1990 to answer the nagging question "has sample size and
barely audible difference hidden anything?" A summary of these data can be
found in The Proceedings of the 1990 AES Conference "The Sound of Audio" May
1990 in the paper "The Great Debate: Is Anyone Winning?" (www.aes.org)

Ok thanks. I'll look it up.

In general larger sample sizes did not produce more significant results and
there wasn't a relationship of criterion score to sample size.

Where the data from all experiments pooled? It might not be the best
way, if some experiments *did* include real audible differences but in
which the sample size was too small to reveal any statistically
significant difference whereas other did not include real audible
difference. Any measured responses in the experiments? Did any of test
include control tests where the difference was audible but subtle and
then comparing e.g. different subjects? Where the "best scorers"
allowed to repeat the experiments in the main experiment? Many
questions but they may be relevant when making a meta-analysis.

In addition, have any of the experiments used test signals in the LF
range (around 15-20 Hz) and high-capable subwoofers (120 dB SPL @ 20
Hz)? I've just curious since the tests from the Swedish
Audio-Technical Society frequently identifies amplfiers than roll of
in the low end using blind tests. It might not be said to be an
audible difference since the difference is percieved as a difference
in vibrations in the body. I think I mentioned this before. Also for
testing CD players, have anybody used a sin2 pulse in evaluating
audible differences?

IME if there is a true just-audible difference scores tend to run high. For
example in tests I ran last summer scores were, as I recall, 21/23 and 17/21 in
two successive runs in a challenge where the session leader claimed a
transparent transfer. IOW results go from chance to strongly positive once
threshold has been reached.

Yes, I have come to similar conclusions myself in my own system.

You can test this for yourself at www.pcabx.com where Arny Krueger has
training sessions with increasing levels of difficulty. Also the codec testing
sites are a good place to investigate this issue.

I've tried the tests at Arnys site a couple of times, but I feel I
need better hardware to do these tests more accurate.

(Thomas A) wrote:

.....some snip.....

Where the data from all experiments pooled? It might not be the best
way, if some experiments *did* include real audible differences but in
which the sample size was too small to reveal any statistically
significant difference whereas other did not include real audible
difference.

Of the 23 tests only one had a sample size as small as 16. Three had sample
sizes of 40 or fewer.

Any measured responses in the experiments?

It was typical, but not universal, to verify frequency response. The most
common type of significance included amplifiers whihc were found to have
operating malfunction.

Did any of test
include control tests where the difference was audible but subtle and
then comparing e.g. different subjects?

These were power amplifiers remember. One of the earlier ones went to
significant effort to track down subtle, hey ...ANY, differences and were
unable to find them.

Where the "best scorers"
allowed to repeat the experiments in the main experiment?

This did not appear to be part of the protocol for any but subject analysis was
common.

Many
questions but they may be relevant when making a meta-analysis.

In addition, have any of the experiments used test signals in the LF
range (around 15-20 Hz) and high-capable subwoofers (120 dB SPL @ 20
Hz)?

No. But there are no commercially available subwoofers that will do 120+ dB at
2 meters in a real room. I've tested dozens and dozens and the only ones with
this capability are custom.

I've just curious since the tests from the Swedish
Audio-Technical Society frequently identifies amplfiers than roll of
in the low end using blind tests.

The typical half power point for my stock of a dozen power amplifiers is 6 Hz.
I've not seen the SATS data though.

It might not be said to be an
audible difference since the difference is percieved as a difference
in vibrations in the body. I think I mentioned this before. Also for
testing CD players, have anybody used a sin2 pulse in evaluating
audible differences?

Not that I know of.

"Thomas A" wrote in message
news

BUUa.141509$OZ2.27088@rwcrnsc54

Is there any published DBT of amps, CD players or cables where the
number of trials are greater than 500?

I think N = 200+ has been reached.

If there difference is miniscule there is likely that many "guesses"
are wrong and would require many trials to reveal any subtle
difference?

If you look at theory casually, you might reach that conclusion. However,
what invariably happens in tests that produce questionable results with a
small number of trials, is that adding more trials makes it clearer than
ever that the small-sample results were due to random guessing.

"Arny Krueger" wrote in message news:0xnVa.4274$cF.1296@rwcrnsc53...
"Thomas A" wrote in message
newsBUUa.141509$OZ2.27088@rwcrnsc54

Is there any published DBT of amps, CD players or cables where the
number of trials are greater than 500?

I think N = 200+ has been reached.

If there difference is miniscule there is likely that many "guesses"
are wrong and would require many trials to reveal any subtle
difference?

If you look at theory casually, you might reach that conclusion. However,
what invariably happens in tests that produce questionable results with a
small number of trials, is that adding more trials makes it clearer than
ever that the small-sample results were due to random guessing.

So what happens when the situation is small but just audible? Has any
such test situations been set up? Does the result end up with close to
100% correct or e.g. 55% correct? My question is what happens when
test subjects are "forced" with differences that approach to the
"audible limit".

Ref: Blindtest issues...
For what its worth...

Use about any criteria you desire regarding cables,
amps, etc...also, if you feel better about it, put
a sign on each component with its name in blazing
qualities. It possibly will make you feel better
about the system and strangely, the whole thing
might well sound better. That is part of this whole
experience regarding audio...if your prejudices
are deep set from within..then give in to them
and enjoy the music. Be happy with the most
expensive equipment you can afford, it might well
be pretty good..mentally, you might come to
accept that fact..music will flourish, bloom
and all will be right with the Universe!!

All this "shadow-boxing" regarding "all is the
same" is interesting in this strange dimension
that surrounds Audio. Go with you own prejudices
and be happy. Very important to your Audio
happiness!

Leonard...

__________________________________________________ _____

On Sun, 27 Jul 2003 18:11:48 +0000, Thomas A wrote:

Is there any published DBT of amps, CD players or cables where the
number of trials are greater than 500?

If there difference is miniscule there is likely that many "guesses"
are wrong and would require many trials to reveal any subtle
difference?

Thomas

Thread Tools
Show Printable Version
Display Modes
Switch to Linear Mode Hybrid Mode Switch to Threaded Mode

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
RCA out and Speaker Question in 2004 Ranger Edge Question	magicianstalk	Car Audio	0	March 10th 04 02:32 AM
capacitor + parallel wiring question?	Chi	Car Audio	2	March 7th 04 12:56 PM
Sub + amp wiring question	Incog	Car Audio	1	February 16th 04 12:49 AM
Subwoofer box question	Joseph Luner	Car Audio	5	December 30th 03 04:13 PM
Subwoofer position question	Robert Morein	Audio Opinions	1	August 24th 03 02:19 PM

Menu

About Us