Reply
 
Thread Tools Display Modes
  #1   Report Post  
Thomas A
 
Posts: n/a
Default Blindtest question

Is there any published DBT of amps, CD players or cables where the
number of trials are greater than 500?

If there difference is miniscule there is likely that many "guesses"
are wrong and would require many trials to reveal any subtle
difference?

Thomas

  #2   Report Post  
Steven Sullivan
 
Posts: n/a
Default Blindtest question

Thomas A wrote:
Is there any published DBT of amps, CD players or cables where the
number of trials are greater than 500?


If there difference is miniscule there is likely that many "guesses"
are wrong and would require many trials to reveal any subtle
difference?


There are published tests where people claimed they could hear the
difference sighted , but when they were 'blinded' they could not.
In this case the argument that 500 trials are needed would seem
to be weak.

However, a real and miniscule difference would certainly be
discerned more reliably if there was specific training to hear it
beforehand.

--
-S.

  #3   Report Post  
Thomas A
 
Posts: n/a
Default Blindtest question

Steven Sullivan wrote in message news:d_XUa.142496$GL4.36308@rwcrnsc53...
Thomas A wrote:
Is there any published DBT of amps, CD players or cables where the
number of trials are greater than 500?


If there difference is miniscule there is likely that many "guesses"
are wrong and would require many trials to reveal any subtle
difference?


There are published tests where people claimed they could hear the
difference sighted , but when they were 'blinded' they could not.
In this case the argument that 500 trials are needed would seem
to be weak.


Yes, that's for sure. But how are scientific tests of just noticable
difference set up? A difference, when very small, could introduce more
incorrect answers from the test subjects. Thus I think the question is
interesting.

However, a real and miniscule difference would certainly be
discerned more reliably if there was specific training to hear it
beforehand.


Yes, but still, if the difference is real and miniscule it could
introduce incorrect answers even if there is specific training
beforehand. If there would be an all or nothing thing, then the result
would always be 100% correct (difference) or 50% (no difference).
What if the answers are 60% correct?
  #4   Report Post  
normanstrong
 
Posts: n/a
Default Blindtest question

"Thomas A" wrote in message
newsBUUa.141509$OZ2.27088@rwcrnsc54...
Is there any published DBT of amps, CD players or cables where the
number of trials are greater than 500?


I've never seen one. It would be difficult to get a single subject to
do that many trials. So, it would have to be many subjects and they
would have to be isolated to prevent subtle influence from one to the
other.

If there difference is miniscule there is likely that many "guesses"
are wrong and would require many trials to reveal any subtle
difference?


(Note that the word is spelled "minuscule.")

Norm Strong
  #5   Report Post  
Stewart Pinkerton
 
Posts: n/a
Default Blindtest question

On 28 Jul 2003 14:46:01 GMT, (Thomas A)
wrote:

Steven Sullivan wrote in message news:d_XUa.142496$GL4.36308@rwcrnsc53...
Thomas A wrote:
Is there any published DBT of amps, CD players or cables where the
number of trials are greater than 500?


If there difference is miniscule there is likely that many "guesses"
are wrong and would require many trials to reveal any subtle
difference?


There are published tests where people claimed they could hear the
difference sighted , but when they were 'blinded' they could not.
In this case the argument that 500 trials are needed would seem
to be weak.


Yes, that's for sure. But how are scientific tests of just noticable
difference set up? A difference, when very small, could introduce more
incorrect answers from the test subjects. Thus I think the question is
interesting.

However, a real and miniscule difference would certainly be
discerned more reliably if there was specific training to hear it
beforehand.


Yes, but still, if the difference is real and miniscule it could
introduce incorrect answers even if there is specific training
beforehand. If there would be an all or nothing thing, then the result
would always be 100% correct (difference) or 50% (no difference).
What if the answers are 60% correct?


The real problem is that you can *never* say that the difference is
real, just that there is a very high statistical probability that a
difference was detected. After all, it *is* possible to toss a coin
and get 500 heads in a row, it's just *very* unlikely.
--

Stewart Pinkerton | Music is Art - Audio is Engineering


  #8   Report Post  
Steven Sullivan
 
Posts: n/a
Default Blindtest question

Thomas A wrote:
Steven Sullivan wrote in message news:d_XUa.142496$GL4.36308@rwcrnsc53...
Thomas A wrote:
Is there any published DBT of amps, CD players or cables where the
number of trials are greater than 500?


If there difference is miniscule there is likely that many "guesses"
are wrong and would require many trials to reveal any subtle
difference?


There are published tests where people claimed they could hear the
difference sighted , but when they were 'blinded' they could not.
In this case the argument that 500 trials are needed would seem
to be weak.


Yes, that's for sure. But how are scientific tests of just noticable
difference set up? A difference, when very small, could introduce more
incorrect answers from the test subjects. Thus I think the question is
interesting.

However, a real and miniscule difference would certainly be
discerned more reliably if there was specific training to hear it
beforehand.


Yes, but still, if the difference is real and miniscule it could
introduce incorrect answers even if there is specific training
beforehand. If there would be an all or nothing thing, then the result
would always be 100% correct (difference) or 50% (no difference).
What if the answers are 60% correct?


What level of certitude are you looking for? Scientists use
statistical tools to calculate probabilities of different
kinds of error in such cases.

--
-S.

  #9   Report Post  
Arny Krueger
 
Posts: n/a
Default Blindtest question

"Thomas A" wrote in message
newsBUUa.141509$OZ2.27088@rwcrnsc54

Is there any published DBT of amps, CD players or cables where the
number of trials are greater than 500?


I think N = 200+ has been reached.

If there difference is miniscule there is likely that many "guesses"
are wrong and would require many trials to reveal any subtle
difference?


If you look at theory casually, you might reach that conclusion. However,
what invariably happens in tests that produce questionable results with a
small number of trials, is that adding more trials makes it clearer than
ever that the small-sample results were due to random guessing.

  #11   Report Post  
Thomas A
 
Posts: n/a
Default Blindtest question

"Arny Krueger" wrote in message news:0xnVa.4274$cF.1296@rwcrnsc53...
"Thomas A" wrote in message
newsBUUa.141509$OZ2.27088@rwcrnsc54

Is there any published DBT of amps, CD players or cables where the
number of trials are greater than 500?


I think N = 200+ has been reached.

If there difference is miniscule there is likely that many "guesses"
are wrong and would require many trials to reveal any subtle
difference?


If you look at theory casually, you might reach that conclusion. However,
what invariably happens in tests that produce questionable results with a
small number of trials, is that adding more trials makes it clearer than
ever that the small-sample results were due to random guessing.


So what happens when the situation is small but just audible? Has any
such test situations been set up? Does the result end up with close to
100% correct or e.g. 55% correct? My question is what happens when
test subjects are "forced" with differences that approach to the
"audible limit".

  #12   Report Post  
Thomas A
 
Posts: n/a
Default Blindtest question

Steven Sullivan wrote in message news:UqnVa.4003$Oz4.1480@rwcrnsc54...
Thomas A wrote:
Steven Sullivan wrote in message news:d_XUa.142496$GL4.36308@rwcrnsc53...
Thomas A wrote:
Is there any published DBT of amps, CD players or cables where the
number of trials are greater than 500?


If there difference is miniscule there is likely that many "guesses"
are wrong and would require many trials to reveal any subtle
difference?

There are published tests where people claimed they could hear the
difference sighted , but when they were 'blinded' they could not.
In this case the argument that 500 trials are needed would seem
to be weak.


Yes, that's for sure. But how are scientific tests of just noticable
difference set up? A difference, when very small, could introduce more
incorrect answers from the test subjects. Thus I think the question is
interesting.

However, a real and miniscule difference would certainly be
discerned more reliably if there was specific training to hear it
beforehand.


Yes, but still, if the difference is real and miniscule it could
introduce incorrect answers even if there is specific training
beforehand. If there would be an all or nothing thing, then the result
would always be 100% correct (difference) or 50% (no difference).
What if the answers are 60% correct?


What level of certitude are you looking for? Scientists use
statistical tools to calculate probabilities of different
kinds of error in such cases.


Well confidence limits of 95% or 99% are usually applied. The power of
the test is however important when you approach the audible limit.
Also, in sample sizes 200 you need not use correction for continuity
in the statistical calculation. I am not sure, but I think this
correction applies in this case when sample sizes are 25-200. Below
25, this correction is not sufficient.

  #13   Report Post  
Harry Lavo
 
Posts: n/a
Default Blindtest question

Thomas -

Thanks for the post of the Tag Mclaren test link (and to Tom for the other
references). I've looked at the Tag link and suspect it's going to add to
the controversy here. My comments on the test follow.

From the tone of the web info on this test, one can presume that Tag set out
to show its relatively inexpensive gear was just as good as some
acknowledged industry standards. But....wonder why Tag choose the 99%
confidence level? Being careful *not* to say that it was prechosen in
advance? It is because had they used the more common and almost
universally-used 95% level it would have shown that:

* When cable A was the "X" it was recognized at a significant level by the
panel (and guess whose cable probably would "lose" in a preference test
versus a universally recognized standard of excellence chosen as "tops" by
both Stereophile and TAS, as well as by other industry publishers)

* One individual differentiated both cable A and combined cables at the
significant level

Results summarized as follows:

Tag Mclaren Published ABX Results

Sample 99% 95% Actual Confidence

Total Test

Cables
A 96 60 53 e 52 94.8% e
B 84 54 48 e 38 coin toss
Both 180 107 97 e 90 coin toss

Amps
A 96 60 53 e 47 coin toss
B 84 54 48 e 38 coin toss
Both 180 107 97 e 85 coin toss

Top Individuals

Cables
A 8 8 7 6 94.5%
B 7 7 7 5 83.6%
Both 15 13 11 11 95.8%

Amps
A 8 8 7 5 83.6%
B 7 7 7 5 83.6%
Both 15 13 11 10 90.8%

e = extrapolated based on scores for 100 and 50 sample size

In general, the test while seemingly objective has more negatives than
positives when measured against the consensus of the objectivists (and some
subjectivists) in this group as to what constitutes a good abx test:

TEST POSITIVES
*double blind
*level matched

TEST NEGATIVES
*short snippets
*no user control over switching and (apparently) no repeats
*no user control over content
*group test, no safeguards against visual interaction
*no group selection criteria apparent and no pre-training or testing

The results and the summary of positives/negatives above raise some
interesting questions:

*why, for example, should one cable be significantly identified when "x" and
the other fail miserably to be identified. This has to be due and
interaction between the characteristics of the music samples chosen, the
characteristics of the cables under test, and perhaps aggravated by the use
of short snippets with an inadequate time frame to establish the proper
evaluation context. Did the test itself create the overall null where
people could not differentiate based soley on the test not favoring B as
much as A?

* do the differences in people scoring high on the two tests support the
idea that different people react to different attributes of the DUT's. Or
does it again suggest some interaction between the music chosen, the
characteristics of the individual pieces, and perhaps the evaluation time
frame.

* or is it possible that the abx test itself, when used with short snippets,
makes some kinds of differences more apparent and others less apparent and
thus by working against exposing *all* kinds of differences help create more
*no differences* than should be the result.

* since the panel is not identified and there was no training, do the
results suggest a "dumbing down" of differentiation from the scores of the
more able listeners? I am sure it will be suggested that the two different
high scorers were simply random outliers...I'm not so sure especially since
the individual scoring high on the cable test hears the cable differences
exactly like the general sample but at a higher level (required because of
smaller sample size) and the high scorer on the amp test is in much the same
position.

if some of these arguments sound familiar, they certainly raises echoes of
the issues raised here by subjectivists over the years...and yet these
specifics are rooted in the results of this one test.

I'd like to hear other views on this test.

"Thomas A" wrote in message
news:ahwVa.6957$cF.2308@rwcrnsc53...
(Nousaine) wrote in message

...
(Thomas A) wrote:

Is there any published DBT of amps, CD players or cables where the
number of trials are greater than 500?

If there difference is miniscule there is likely that many "guesses"
are wrong and would require many trials to reveal any subtle
difference?

Thomas


With regard to amplifiers as of May 1990 there had been such tests. In

1978
QUAD published an erxperiment with 576 trials. In 1980 Smith peterson

and
Jackson published an experiment with 1104 trials; in 1989 Stereophile

published
a 3530 trial comparison. In 1986 Clark & Masters published an experiment

with
772 trials. All were null.

There's a misconception that blind tests tend to have very small sample

sizes.
As of 1990 the 23 published amplifier experiments had a mean average of

426 and
a median of 90 trials. If we exclude the 3530 trial experiment the mean

becomes
285 trials. The median remains unchanged.


Ok thanks. Is it possible to get the numbers for each test? I would
like to see if it possible to do a meta-analysis in the amplifier
case. The test by tagmclaren is an additional one:

http://www.tagmclaren.com/members/news/news77.asp

Thomas


  #14   Report Post  
Nousaine
 
Posts: n/a
Default Blindtest question

(Thomas A) wrote:

(Nousaine) wrote in message
...
(Thomas A) wrote:

Is there any published DBT of amps, CD players or cables where the
number of trials are greater than 500?

If there difference is miniscule there is likely that many "guesses"
are wrong and would require many trials to reveal any subtle
difference?

Thomas


With regard to amplifiers as of May 1990 there had been such tests. In 1978
QUAD published an erxperiment with 576 trials. In 1980 Smith peterson and
Jackson published an experiment with 1104 trials; in 1989 Stereophile

published
a 3530 trial comparison. In 1986 Clark & Masters published an experiment

with
772 trials. All were null.

There's a misconception that blind tests tend to have very small sample

sizes.
As of 1990 the 23 published amplifier experiments had a mean average of 426

and
a median of 90 trials. If we exclude the 3530 trial experiment the mean

becomes
285 trials. The median remains unchanged.


Ok thanks. Is it possible to get the numbers for each test? I would
like to see if it possible to do a meta-analysis in the amplifier
case. The test by tagmclaren is an additional one:

http://www.tagmclaren.com/members/news/news77.asp

Thomas


I did just that in 1990 to answer the nagging question "has sample size and
barely audible difference hidden anything?" A summary of these data can be
found in The Proceedings of the 1990 AES Conference "The Sound of Audio" May
1990 in the paper "The Great Debate: Is Anyone Winning?" (www.aes.org)

In general larger sample sizes did not produce more significant results and
there wasn't a relationship of criterion score to sample size.

IME if there is a true just-audible difference scores tend to run high. For
example in tests I ran last summer scores were, as I recall, 21/23 and 17/21 in
two successive runs in a challenge where the session leader claimed a
transparent transfer. IOW results go from chance to strongly positive once
threshold has been reached.

You can test this for yourself at www.pcabx.com where Arny Krueger has
training sessions with increasing levels of difficulty. Also the codec testing
sites are a good place to investigate this issue.

  #15   Report Post  
Nousaine
 
Posts: n/a
Default Blindtest question

"Harry Lavo" wrote:

Thomas -

Thanks for the post of the Tag Mclaren test link (and to Tom for the other
references). I've looked at the Tag link and suspect it's going to add to
the controversy here.


Actually there's no 'controversey' here. No proponent of amp/wire-sound has
ever shown that nominally competent amps or wires have any sound of their own
when played back over loudspeakers.

The only 'controversey' is over whether Arny Kreuger's pcabx tests cab with
headphones and special programs can be extrapolated to commerically available
programs and speakers in a normally reverberant environment.

The Tag-M results are fully within those expected given the more than 2 dozen
published experiments of amps and wires.

y comments on the test follow.

From the tone of the web info on this test, one can presume that Tag set out
to show its relatively inexpensive gear was just as good as some
acknowledged industry standards. But....wonder why Tag choose the 99%
confidence level?


Why not? But you can analyze it any way your want. That's the wonderful thing
about published results.

Being careful *not* to say that it was prechosen in
advance? It is because had they used the more common and almost
universally-used 95% level it would have shown that:

* When cable A was the "X" it was recognized at a significant level by the
panel (and guess whose cable probably would "lose" in a preference test
versus a universally recognized standard of excellence chosen as "tops" by
both Stereophile and TAS, as well as by other industry publishers)

* One individual differentiated both cable A and combined cables at the
significant level

Results summarized as follows:

Tag Mclaren Published ABX Results

Sample 99% 95% Actual Confidence

Total Test

Cables
A 96 60 53 e 52 94.8% e
B 84 54 48 e 38 coin toss
Both 180 107 97 e 90 coin toss

Amps
A 96 60 53 e 47 coin toss
B 84 54 48 e 38 coin toss
Both 180 107 97 e 85 coin toss

Top Individuals

Cables
A 8 8 7 6 94.5%
B 7 7 7 5 83.6%
Both 15 13 11 11 95.8%

Amps
A 8 8 7 5 83.6%
B 7 7 7 5 83.6%
Both 15 13 11 10 90.8%

e = extrapolated based on scores for 100 and 50 sample size

In general, the test while seemingly objective has more negatives than
positives when measured against the consensus of the objectivists (and some
subjectivists) in this group as to what constitutes a good abx test:


This is what always happens with 'bad news.' Instead of giving us contradictory
evidence we get endless wishful 'data-dredging' to find any possible reason to
ignore the evidence.

In any other circle when one thinks the results of a given experiment are wrong
they just duplicate it showing the error OR produce a valid one with contrary
evidence.

TEST POSITIVES
*double blind
*level matched

TEST NEGATIVES
*short snippets
*no user control over switching and (apparently) no repeats
*no user control over content
*group test, no safeguards against visual interaction
*no group selection criteria apparent and no pre-training or testing


OK how many of your sighted 'tests' have ignored one or all of these positives
or negatives?

The results and the summary of positives/negatives above raise some
interesting questions:


No, not really. All of the true questions about bias controlled listening tests
have been addressed prior.


*why, for example, should one cable be significantly identified when "x" and
the other fail miserably to be identified. This has to be due and
interaction between the characteristics of the music samples chosen, the
characteristics of the cables under test, and perhaps aggravated by the use
of short snippets with an inadequate time frame to establish the proper
evaluation context. Did the test itself create the overall null where
people could not differentiate based soley on the test not favoring B as
much as A?

* do the differences in people scoring high on the two tests support the
idea that different people react to different attributes of the DUT's. Or
does it again suggest some interaction between the music chosen, the
characteristics of the individual pieces, and perhaps the evaluation time
frame.

* or is it possible that the abx test itself, when used with short snippets,
makes some kinds of differences more apparent and others less apparent and
thus by working against exposing *all* kinds of differences help create more
*no differences* than should be the result.

* since the panel is not identified and there was no training, do the
results suggest a "dumbing down" of differentiation from the scores of the
more able listeners? I am sure it will be suggested that the two different
high scorers were simply random outliers...I'm not so sure especially since
the individual scoring high on the cable test hears the cable differences
exactly like the general sample but at a higher level (required because of
smaller sample size) and the high scorer on the amp test is in much the same
position.

if some of these arguments sound familiar, they certainly raises echoes of
the issues raised here by subjectivists over the years...and yet these
specifics are rooted in the results of this one test.

I'd like to hear other views on this test.


These results are consistent with the 2 dozen and more other bias controlled
listening tests of power amplifiers and wires.


"Thomas A" wrote in message
news:ahwVa.6957$cF.2308@rwcrnsc53...
(Nousaine) wrote in message
...
(Thomas A) wrote:

Is there any published DBT of amps, CD players or cables where the
number of trials are greater than 500?

If there difference is miniscule there is likely that many "guesses"
are wrong and would require many trials to reveal any subtle
difference?

Thomas

With regard to amplifiers as of May 1990 there had been such tests. In

1978
QUAD published an erxperiment with 576 trials. In 1980 Smith peterson

and
Jackson published an experiment with 1104 trials; in 1989 Stereophile

published
a 3530 trial comparison. In 1986 Clark & Masters published an experiment

with
772 trials. All were null.

There's a misconception that blind tests tend to have very small sample

sizes.
As of 1990 the 23 published amplifier experiments had a mean average of

426 and
a median of 90 trials. If we exclude the 3530 trial experiment the mean

becomes
285 trials. The median remains unchanged.


Ok thanks. Is it possible to get the numbers for each test? I would
like to see if it possible to do a meta-analysis in the amplifier
case. The test by tagmclaren is an additional one:


Thanks for the reference.



  #16   Report Post  
Stewart Pinkerton
 
Posts: n/a
Default Blindtest question

On Wed, 30 Jul 2003 03:26:24 GMT, "Harry Lavo"
wrote:

From the tone of the web info on this test, one can presume that Tag set out
to show its relatively inexpensive gear was just as good as some
acknowledged industry standards. But....wonder why Tag choose the 99%
confidence level? Being careful *not* to say that it was prechosen in
advance? It is because had they used the more common and almost
universally-used 95% level it would have shown that:


Can anyone smell fish? Specifically, red herring?

* When cable A was the "X" it was recognized at a significant level by the
panel (and guess whose cable probably would "lose" in a preference test
versus a universally recognized standard of excellence chosen as "tops" by
both Stereophile and TAS, as well as by other industry publishers)


No Harry, *all* tests fell below the 95% level, except for one single
participant in the cable test, which just scraped in. Given that there
were 12 volunteers, there's less than 2:1 odds against this happening
when tossing coins. Interesting that you also failed to note that the
'best performers' in the cable test did *not* perform well in the
amplifier test, and vice versa.

You do love to cherry-pick in search of your *required* result, don't
you?

* One individual differentiated both cable A and combined cables at the
significant level

Results summarized as follows:

Tag Mclaren Published ABX Results

Sample 99% 95% Actual Confidence

Total Test

Cables
A 96 60 53 e 52 94.8% e
B 84 54 48 e 38 coin toss
Both 180 107 97 e 90 coin toss

Amps
A 96 60 53 e 47 coin toss
B 84 54 48 e 38 coin toss
Both 180 107 97 e 85 coin toss

Top Individuals

Cables
A 8 8 7 6 94.5%
B 7 7 7 5 83.6%
Both 15 13 11 11 95.8%

Amps
A 8 8 7 5 83.6%
B 7 7 7 5 83.6%
Both 15 13 11 10 90.8%

e = extrapolated based on scores for 100 and 50 sample size

In general, the test while seemingly objective has more negatives than
positives when measured against the consensus of the objectivists (and some
subjectivists) in this group as to what constitutes a good abx test:

TEST POSITIVES
*double blind
*level matched

TEST NEGATIVES
*short snippets
*no user control over switching and (apparently) no repeats
*no user control over content
*group test, no safeguards against visual interaction
*no group selection criteria apparent and no pre-training or testing

The results and the summary of positives/negatives above raise some
interesting questions:

*why, for example, should one cable be significantly identified when "x" and
the other fail miserably to be identified. This has to be due and
interaction between the characteristics of the music samples chosen, the
characteristics of the cables under test, and perhaps aggravated by the use
of short snippets with an inadequate time frame to establish the proper
evaluation context.


No it doen't Harry, I doesn't *have* to be due to anything but random
chance.

Did the test itself create the overall null where
people could not differentiate based soley on the test not favoring B as
much as A?

* do the differences in people scoring high on the two tests support the
idea that different people react to different attributes of the DUT's. Or
does it again suggest some interaction between the music chosen, the
characteristics of the individual pieces, and perhaps the evaluation time
frame.


No, since the high scorers on one test were not the high scorers in
the other test. It's called a distrinution, harry, and it is simply
more evidence that there were in fact no audible differences - as any
reasonable person would expect.

http://www.tagmclaren.com/members/news/news77.asp


--

Stewart Pinkerton | Music is Art - Audio is Engineering

  #17   Report Post  
Steven Sullivan
 
Posts: n/a
Default Blindtest question

Nousaine wrote:

This is what always happens with 'bad news.' Instead of giving us contradictory
evidence we get endless wishful 'data-dredging' to find any possible reason to
ignore the evidence.


In any other circle when one thinks the results of a given experiment are wrong
they just duplicate it showing the error OR produce a valid one with contrary
evidence.


Not necessarily. It's quite common for questions to be raised during peer
review of a scientific paper; it is then incumbent upon the *experimenter*, not
the critic, to justify his or her choice of protocol, or his/her explanation of
the results. Often this involves doing more experiments to address the reviewer's
concerns. Sometimes it merely involved explaining the results more clearly, or
in more qualified terms. If the experimenter feels the reviewer has ignored some
important point, that comes out too in the reply to the reviews.

I say all this having not yet visited the link, so I'm totally unbiased ;

  #18   Report Post  
Harry Lavo
 
Posts: n/a
Default Blindtest question

"Stewart Pinkerton" wrote in message
news:7jKVa.10234$Oz4.4174@rwcrnsc54...
On Wed, 30 Jul 2003 03:26:24 GMT, "Harry Lavo"
wrote:

From the tone of the web info on this test, one can presume that Tag set

out
to show its relatively inexpensive gear was just as good as some
acknowledged industry standards. But....wonder why Tag choose the 99%
confidence level? Being careful *not* to say that it was prechosen in
advance? It is because had they used the more common and almost
universally-used 95% level it would have shown that:


Can anyone smell fish? Specifically, red herring?


Are you an outlyer? Or are you simply sensitive to fish? Or did you not
conceive that thought double-blind and it is just your imagination? :=)

* When cable A was the "X" it was recognized at a significant level by

the
panel (and guess whose cable probably would "lose" in a preference test
versus a universally recognized standard of excellence chosen as "tops"

by
both Stereophile and TAS, as well as by other industry publishers)


No Harry, *all* tests fell below the 95% level, except for one single
participant in the cable test, which just scraped in. Given that there
were 12 volunteers, there's less than 2:1 odds against this happening
when tossing coins. Interesting that you also failed to note that the
'best performers' in the cable test did *not* perform well in the
amplifier test, and vice versa.


I'm sorry, but when rounded to whole numbers 94.8% is a lot closer than one
number higher which would be about 96% in the larger panels and 97% in the
smaller panels. The standard is 95%. To say that 94.8% doesn't qualify is
splitting hairs. I inclduded the actual numbers needed to pass the barrier
just to satisfy the purists, but you *ARE* splitting hairs here, Stewart.

You do love to cherry-pick in search of your *required* result, don't
you?


You mean not accepting the "received truth" without doing my own analysis is
cherry picking, is that it Stewart? We are not allowed to point out
anonomlies and ask "why"? "how come"? "what could be causing this?"

And would you explain why a significant level was reached on the "A" cable
test with 96 trials? Was that "cherry picking". C'mon, Stewart, you know
better. In fact the real issue here is: if one cable can be so readily
picked out, why can't the other be? What is it in the test, procedure,
quality of the cables, order bias, or what. Something is rotten in the
beloved state of ABX here!

* One individual differentiated both cable A and combined cables at the
significant level

Results summarized as follows:

Tag Mclaren Published ABX Results

Sample 99% 95% Actual Confidence

Total Test

Cables
A 96 60 53 e 52 94.8% e
B 84 54 48 e 38 coin toss
Both 180 107 97 e 90 coin toss

Amps
A 96 60 53 e 47 coin toss
B 84 54 48 e 38 coin toss
Both 180 107 97 e 85 coin toss

Top Individuals

Cables
A 8 8 7 6 94.5%
B 7 7 7 5 83.6%
Both 15 13 11 11 95.8%

Amps
A 8 8 7 5 83.6%
B 7 7 7 5 83.6%
Both 15 13 11 10 90.8%

e = extrapolated based on scores for 100 and 50 sample size

In general, the test while seemingly objective has more negatives than
positives when measured against the consensus of the objectivists (and

some
subjectivists) in this group as to what constitutes a good abx test:

TEST POSITIVES
*double blind
*level matched

TEST NEGATIVES
*short snippets
*no user control over switching and (apparently) no repeats
*no user control over content
*group test, no safeguards against visual interaction
*no group selection criteria apparent and no pre-training or testing

The results and the summary of positives/negatives above raise some
interesting questions:

*why, for example, should one cable be significantly identified when "x"

and
the other fail miserably to be identified. This has to be due and
interaction between the characteristics of the music samples chosen, the
characteristics of the cables under test, and perhaps aggravated by the

use
of short snippets with an inadequate time frame to establish the proper
evaluation context.


No it doen't Harry, I doesn't *have* to be due to anything but random
chance.

Did the test itself create the overall null where
people could not differentiate based soley on the test not favoring B as
much as A?

* do the differences in people scoring high on the two tests support the
idea that different people react to different attributes of the DUT's.

Or
does it again suggest some interaction between the music chosen, the
characteristics of the individual pieces, and perhaps the evaluation time
frame.


No, since the high scorers on one test were not the high scorers in
the other test. It's called a distrinution, harry, and it is simply
more evidence that there were in fact no audible differences - as any
reasonable person would expect.

http://www.tagmclaren.com/members/news/news77.asp


--

Stewart Pinkerton | Music is Art - Audio is Engineering


I notice no comment on this latter part, Stewart. That is the *SUBSTANCE*
of the interesting results of the test/techniques used and the questions
raised.


  #19   Report Post  
ludovic mirabel
 
Posts: n/a
Default Blindtest question

Steven Sullivan wrote in message et...
Nousaine wrote:

This is what always happens with 'bad news.' Instead of giving us contradictory
evidence we get endless wishful 'data-dredging' to find any possible reason to
ignore the evidence.


In any other circle when one thinks the results of a given experiment are wrong
they just duplicate it showing the error OR produce a valid one with contrary
evidence.


Not necessarily. It's quite common for questions to be raised during peer
review of a scientific paper; it is then incumbent upon the *experimenter*, not
the critic, to justify his or her choice of protocol, or his/her explanation of
the results. Often this involves doing more experiments to address the reviewer's
concerns. Sometimes it merely involved explaining the results more clearly, or
in more qualified terms. If the experimenter feels the reviewer has ignored some
important point, that comes out too in the reply to the reviews.

I say all this having not yet visited the link, so I'm totally unbiased ;


Bravo Mr. Sullivan. I hope you'll be as pleased to accept my applause
as I am to see your excellent exposure of the frequently-voiced
challenge to the ABX sceptics to "prove" their sceptical questions.
Exposure coming from an unexpected corner.
Perhaps we're seeing a revival of intellectual integrity in debate on
RAHE.

I promise to quote your summary when occasion warrants it.
Ludovic Mirabel

  #20   Report Post  
Steven Sullivan
 
Posts: n/a
Default Blindtest question

ludovic mirabel wrote:
Steven Sullivan wrote in message et...
Nousaine wrote:

This is what always happens with 'bad news.' Instead of giving us contradictory
evidence we get endless wishful 'data-dredging' to find any possible reason to
ignore the evidence.


In any other circle when one thinks the results of a given experiment are wrong
they just duplicate it showing the error OR produce a valid one with contrary
evidence.


Not necessarily. It's quite common for questions to be raised during peer
review of a scientific paper; it is then incumbent upon the *experimenter*, not
the critic, to justify his or her choice of protocol, or his/her explanation of
the results. Often this involves doing more experiments to address the reviewer's
concerns. Sometimes it merely involved explaining the results more clearly, or
in more qualified terms. If the experimenter feels the reviewer has ignored some
important point, that comes out too in the reply to the reviews.

I say all this having not yet visited the link, so I'm totally unbiased ;


Bravo Mr. Sullivan. I hope you'll be as pleased to accept my applause
as I am to see your excellent exposure of the frequently-voiced
challenge to the ABX sceptics to "prove" their sceptical questions.


Actually, ludovic, what tends to happen far more often, is that skeptics ask
subjectivists to prove *their* claims, which is quite proper.

Also, as I implied, there mere act of *questioning* does not make the question
well-founded or mean that it requires answering. In you're case, I have
observed that they almost never are. In a peer review process, the
poor foundation and/or bad understanding behind such queries would be noted by
the experimenter, who would make his case to the editor, and the points would
not be required to be addressed.

There is no 'exposure' involved, here, except of your own agenda, as usual.

--
-S.



  #21   Report Post  
Thomas A
 
Posts: n/a
Default Blindtest question

(Nousaine) wrote in message news:nWGVa.15987$YN5.14030@sccrnsc01...
(Thomas A) wrote:

(Nousaine) wrote in message
...
(Thomas A) wrote:

Is there any published DBT of amps, CD players or cables where the
number of trials are greater than 500?

If there difference is miniscule there is likely that many "guesses"
are wrong and would require many trials to reveal any subtle
difference?

Thomas

With regard to amplifiers as of May 1990 there had been such tests. In 1978
QUAD published an erxperiment with 576 trials. In 1980 Smith peterson and
Jackson published an experiment with 1104 trials; in 1989 Stereophile

published
a 3530 trial comparison. In 1986 Clark & Masters published an experiment

with
772 trials. All were null.

There's a misconception that blind tests tend to have very small sample

sizes.
As of 1990 the 23 published amplifier experiments had a mean average of 426

and
a median of 90 trials. If we exclude the 3530 trial experiment the mean

becomes
285 trials. The median remains unchanged.


Ok thanks. Is it possible to get the numbers for each test? I would
like to see if it possible to do a meta-analysis in the amplifier
case. The test by tagmclaren is an additional one:

http://www.tagmclaren.com/members/news/news77.asp

Thomas


I did just that in 1990 to answer the nagging question "has sample size and
barely audible difference hidden anything?" A summary of these data can be
found in The Proceedings of the 1990 AES Conference "The Sound of Audio" May
1990 in the paper "The Great Debate: Is Anyone Winning?" (www.aes.org)


Ok thanks. I'll look it up.


In general larger sample sizes did not produce more significant results and
there wasn't a relationship of criterion score to sample size.


Where the data from all experiments pooled? It might not be the best
way, if some experiments *did* include real audible differences but in
which the sample size was too small to reveal any statistically
significant difference whereas other did not include real audible
difference. Any measured responses in the experiments? Did any of test
include control tests where the difference was audible but subtle and
then comparing e.g. different subjects? Where the "best scorers"
allowed to repeat the experiments in the main experiment? Many
questions but they may be relevant when making a meta-analysis.

In addition, have any of the experiments used test signals in the LF
range (around 15-20 Hz) and high-capable subwoofers (120 dB SPL @ 20
Hz)? I've just curious since the tests from the Swedish
Audio-Technical Society frequently identifies amplfiers than roll of
in the low end using blind tests. It might not be said to be an
audible difference since the difference is percieved as a difference
in vibrations in the body. I think I mentioned this before. Also for
testing CD players, have anybody used a sin2 pulse in evaluating
audible differences?


IME if there is a true just-audible difference scores tend to run high. For
example in tests I ran last summer scores were, as I recall, 21/23 and 17/21 in
two successive runs in a challenge where the session leader claimed a
transparent transfer. IOW results go from chance to strongly positive once
threshold has been reached.


Yes, I have come to similar conclusions myself in my own system.


You can test this for yourself at www.pcabx.com where Arny Krueger has
training sessions with increasing levels of difficulty. Also the codec testing
sites are a good place to investigate this issue.


I've tried the tests at Arnys site a couple of times, but I feel I
need better hardware to do these tests more accurate.

  #22   Report Post  
Jim West
 
Posts: n/a
Default Blindtest question

In article nR%Va.24664$YN5.23125@sccrnsc01, Harry Lavo wrote:

You mean not accepting the "received truth" without doing my own analysis is
cherry picking, is that it Stewart? We are not allowed to point out
anonomlies and ask "why"? "how come"? "what could be causing this?"


You are indeed cherry picking. With 12 individuals the probability that
one would would appear to meet the 95% level is fairly high. Remember
that you can expect 1 in 20 to meet that level entirely by random. It
is not acceptable scientific practice to select specific data sub-sets
out of the complete set. Otherwise you could "prove" anything by simply
running enough trials and ignoring those you don't like. Check any peer
reviewed journal.

In any event, 11 out 15 has a probability of 5.9 % of occuring by chance.
That does not meet the 95 % confidence level. It would be rejected in
a peer reviewed statistical study. (If that was the only data more
trials would called for. But it wasn't the only data.)

And would you explain why a significant level was reached on the "A" cable
test with 96 trials? Was that "cherry picking". C'mon, Stewart, you know
better. In fact the real issue here is: if one cable can be so readily
picked out, why can't the other be? What is it in the test, procedure,
quality of the cables, order bias, or what. Something is rotten in the
beloved state of ABX here!


Where are you getting your numbers? The data they posted on the
web page showed that there were 52 correct answers in 96 trials.
At least 52 correct answers will occur entirely by chance 23.8 %
of the time. This is far from statistically significant.

  #23   Report Post  
Nousaine
 
Posts: n/a
Default Blindtest question

(Thomas A) wrote:

.....some snip.....

Where the data from all experiments pooled? It might not be the best
way, if some experiments *did* include real audible differences but in
which the sample size was too small to reveal any statistically
significant difference whereas other did not include real audible
difference.


Of the 23 tests only one had a sample size as small as 16. Three had sample
sizes of 40 or fewer.

Any measured responses in the experiments?

It was typical, but not universal, to verify frequency response. The most
common type of significance included amplifiers whihc were found to have
operating malfunction.

Did any of test
include control tests where the difference was audible but subtle and
then comparing e.g. different subjects?


These were power amplifiers remember. One of the earlier ones went to
significant effort to track down subtle, hey ...ANY, differences and were
unable to find them.

Where the "best scorers"
allowed to repeat the experiments in the main experiment?


This did not appear to be part of the protocol for any but subject analysis was
common.

Many
questions but they may be relevant when making a meta-analysis.


In addition, have any of the experiments used test signals in the LF
range (around 15-20 Hz) and high-capable subwoofers (120 dB SPL @ 20
Hz)?


No. But there are no commercially available subwoofers that will do 120+ dB at
2 meters in a real room. I've tested dozens and dozens and the only ones with
this capability are custom.

I've just curious since the tests from the Swedish
Audio-Technical Society frequently identifies amplfiers than roll of
in the low end using blind tests.


The typical half power point for my stock of a dozen power amplifiers is 6 Hz.
I've not seen the SATS data though.

It might not be said to be an
audible difference since the difference is percieved as a difference
in vibrations in the body. I think I mentioned this before. Also for
testing CD players, have anybody used a sin2 pulse in evaluating
audible differences?


Not that I know of.

  #24   Report Post  
John Corbett
 
Posts: n/a
Default Blindtest question

In article zVGVa.15179$Ho3.2323@sccrnsc03, "Harry Lavo"
wrote:

Thomas -

Thanks for the post of the Tag Mclaren test link (and to Tom for the other
references). I've looked at the Tag link and suspect it's going to add to
the controversy here. My comments on the test follow.

From the tone of the web info on this test, one can presume that Tag set out
to show its relatively inexpensive gear was just as good as some
acknowledged industry standards. But....wonder why Tag choose the 99%
confidence level? Being careful *not* to say that it was prechosen in
advance? It is because had they used the more common and almost
universally-used 95% level it would have shown that:

* When cable A was the "X" it was recognized at a significant level by the
panel (and guess whose cable probably would "lose" in a preference test
versus a universally recognized standard of excellence chosen as "tops" by
both Stereophile and TAS, as well as by other industry publishers)

* One individual differentiated both cable A and combined cables at the
significant level

Results summarized as follows:

Tag Mclaren Published ABX Results

Sample 99% 95% Actual Confidence

Total Test

Cables
A 96 60 53 e 52 94.8% e
B 84 54 48 e 38 coin toss
Both 180 107 97 e 90 coin toss

Amps
A 96 60 53 e 47 coin toss
B 84 54 48 e 38 coin toss
Both 180 107 97 e 85 coin toss

Top Individuals

Cables
A 8 8 7 6 94.5%
B 7 7 7 5 83.6%
Both 15 13 11 11 95.8%

Amps
A 8 8 7 5 83.6%
B 7 7 7 5 83.6%
Both 15 13 11 10 90.8%

e = extrapolated based on scores for 100 and 50 sample size

[snip]

I'd like to hear other views on this test.

Mr. Lavo, here are some comments on your numbers.

The short story: your numbers are bogus.

The long story follows.

I don't know how you came up with critical values for what you
think is a reasonable level of significance.

For n = 96 trials, the critical values a
60 for .01 level of significance
57 for .05 level of significance
53 for .20 level of significance

for n = 84 trials, the critical values a
54 for .01 level of significance
51 for .05 level of significance
47 for .20 level of significance

for n = 180 trials, the critical values a
107 for .01 level of significance
102 for .05 level of significance
97 for .20 level of significance

for n = 8 trials, the critical values a
8 for .01 level of significance
7 for .05 level of significance
6 for .20 level of significance

for n = 7 trials, the critical values a
7 for .01 level of significance
7 for .05 level of significance
6 for .20 level of significance

for n = 15 trials, the critical values a
13 for .01 level of significance
12 for .05 level of significance
10 for .20 level of significance

The values you provide for what you call 95% confidence (i.e., .05 level
of significance) are almost the correct values for 20% significance.

You make much of an apparently borderline significant result,
where the best individual cable test scores were 11 of 15 correct.

If that had been the entire experiment, we would have a p-value
of .059, reflecting the probability that one would do at least that
well in a single run of 15 trials.
That is, the probability that someone would score 11, 12, 13, 14, or 15
correct just by guessing is .059; also, the probability that the score is
less than 11 would be 1 - .059 = .941 for a single batch of 15 trials.

But what was reported was the best such performance in a dozen sets of
trials. That's not the same as a single run of 15 trials.

The probability of at least one of 12 subjects doing at least as well as
11 of 15 is 1 - [ probability that all 12 do worse than 11 of 15 ].
Thus we get 1 - (.941)^(12), which is about .52.

So, even your star performer is not doing better than chance suggests
he should.

Mr. Lavo, your conjecture (that that the test organizers have tried
to distort the results to fit an agenda) appears to be without support.

Now for some comments on the TAG McLaren report itself.

There are problems with some numbers provided by TAG McLaren, but they
are confined to background material.
There do not appear to be problems with the actual report of experimental
results.

TAG McLaren claims that you need more than 10 trials to obtain results
significant at the .01 level, but they are wrong. In fact, 7 trials suffice.
With 10 trials you can reach .001.
There is a table just before the results section of their report with some
discussion about small sample sizes. The first several rows of that table have
bogus numbers in the third column, and their sample size claims are based
on those wrong numbers.
However, the values for 11 or more trials are correct.

As I have already noted, the numbers used in the report itself appear to
be correct.

Some would argue that their conclusion should be that they found no
evidence to support a claim of audible difference, rather than concluding
that there was no audible difference, but that's another issue.

You have correctly noted concerns about the form of ABX presentation (not
the same as the usual ABX scheme discussed on rahe) but that does not
invalidate the experiment.

There are questions about how the sample of listeners was obtained.

For the most part, TAG McLaren seems to have designed a test, carried it
out, run the numbers properly, and then accurately reported what they did.
That's more than can be said for many tests.

JC

  #25   Report Post  
Stewart Pinkerton
 
Posts: n/a
Default Blindtest question

On Thu, 31 Jul 2003 03:15:31 GMT, "Harry Lavo"
wrote:

And would you explain why a significant level was reached on the "A" cable
test with 96 trials? Was that "cherry picking". C'mon, Stewart, you know
better. In fact the real issue here is: if one cable can be so readily
picked out, why can't the other be? What is it in the test, procedure,
quality of the cables, order bias, or what. Something is rotten in the
beloved state of ABX here!


All that the above proves (especially since the results were marginal
at best) is that there's a random distribution in the results which
favoured A *on this occasion*. Run that trial again, and I'll put an
even money bet that you'll have a slight bias in favour of B.

Anyone familiar with Statistical Process Control is well aware that
one swallow doesn't make a summer.
--

Stewart Pinkerton | Music is Art - Audio is Engineering



  #26   Report Post  
Harry Lavo
 
Posts: n/a
Default Blindtest question

"Jim West" wrote in message
news:fnaWa.19572$cF.7720@rwcrnsc53...
In article nR%Va.24664$YN5.23125@sccrnsc01, Harry Lavo wrote:

You mean not accepting the "received truth" without doing my own

analysis is
cherry picking, is that it Stewart? We are not allowed to point out
anonomlies and ask "why"? "how come"? "what could be causing this?"


You are indeed cherry picking. With 12 individuals the probability that
one would would appear to meet the 95% level is fairly high. Remember
that you can expect 1 in 20 to meet that level entirely by random. It
is not acceptable scientific practice to select specific data sub-sets
out of the complete set. Otherwise you could "prove" anything by simply
running enough trials and ignoring those you don't like. Check any peer
reviewed journal.


Yep it is more probable than one in twenty. But not so high that we have to
accept your assumption that he/she *IS* the one-in-twenty.

In any event, 11 out 15 has a probability of 5.9 % of occuring by chance.
That does not meet the 95 % confidence level. It would be rejected in
a peer reviewed statistical study. (If that was the only data more
trials would called for. But it wasn't the only data.)


My mistake..you are correct. So the number is only 94.1%. But 12 out of 15
yields 98.2%. Which is closer to the 95% standard?

And would you explain why a significant level was reached on the "A"

cable
test with 96 trials? Was that "cherry picking". C'mon, Stewart, you

know
better. In fact the real issue here is: if one cable can be so readily
picked out, why can't the other be? What is it in the test, procedure,
quality of the cables, order bias, or what. Something is rotten in the
beloved state of ABX here!



Where are you getting your numbers? The data they posted on the
web page showed that there were 52 correct answers in 96 trials.
At least 52 correct answers will occur entirely by chance 23.8 %
of the time. This is far from statistically significant.


Oops! Late at night and the only book of binomial probabilites I had handy
contained only raw data, not the cumulative error tables that I was used to
from my previous work. So I goofed. I agree that the three-to-one odds are
not statistically significant, and so the full panel results are null for
both cables. My apologies.

The issue with the individuals still stands, however.

  #27   Report Post  
Jim West
 
Posts: n/a
Default Blindtest question

In article l2mWa.36515$Ho3.6598@sccrnsc03, Harry Lavo wrote:
"Jim West" wrote in message
news:fnaWa.19572$cF.7720@rwcrnsc53...

You are indeed cherry picking. With 12 individuals the probability that
one would would appear to meet the 95% level is fairly high. Remember
that you can expect 1 in 20 to meet that level entirely by random. It
is not acceptable scientific practice to select specific data sub-sets
out of the complete set. Otherwise you could "prove" anything by simply
running enough trials and ignoring those you don't like. Check any peer
reviewed journal.


Yep it is more probable than one in twenty. But not so high that we have to
accept your assumption that he/she *IS* the one-in-twenty.


Where did I assume that?

In any event, 11 out 15 has a probability of 5.9 % of occuring by chance.
That does not meet the 95 % confidence level. It would be rejected in
a peer reviewed statistical study. (If that was the only data more
trials would called for. But it wasn't the only data.)


My mistake..you are correct. So the number is only 94.1%. But 12 out of 15
yields 98.2%. Which is closer to the 95% standard?


Irrelevant. You can't arbitrarily play with the numbers.


Where are you getting your numbers? The data they posted on the
web page showed that there were 52 correct answers in 96 trials.
At least 52 correct answers will occur entirely by chance 23.8 %
of the time. This is far from statistically significant.


Oops! Late at night and the only book of binomial probabilites I had handy
contained only raw data, not the cumulative error tables that I was used to
from my previous work. So I goofed. I agree that the three-to-one odds are
not statistically significant, and so the full panel results are null for
both cables. My apologies.

The issue with the individuals still stands, however.


No it doesn't. Even with cherry picking (which is not valid anyway)
no individual performed at a statistically signficant level. Period.

  #28   Report Post  
Arny Krueger
 
Posts: n/a
Default Blindtest question

"Harry Lavo" wrote in message
news:93mWa.36669$uu5.4445@sccrnsc04
"normanstrong" wrote in message
news:lfaWa.29024$uu5.3508@sccrnsc04...
Golly, after reading the entire article, I'm impressed. Faced with
these results, it's pretty hard to attack McLaren's article as being
poor science.

http://www.tagmclaren.com/members/news/news77.asp

I see nothing in the results that is inconsistent with the hypothesis
that there are no audible differences in both the cables and amps.


Then you have not looked closely at and thought about some of this
results/issues I have raised.


I know Norm pretty well and he's very heavy into statistics, courtesy of a
successful career at Fluke as a test equipment designer. He looks at
discussions of statistics with a very critical, practical eye.

As far as the critical issues that have been raised, I think that they speak
for themselves, pretty weakly.

  #29   Report Post  
Stewart Pinkerton
 
Posts: n/a
Default Blindtest question

On Fri, 01 Aug 2003 04:32:05 GMT, "Harry Lavo"
wrote:

"normanstrong" wrote in message
news:lfaWa.29024$uu5.3508@sccrnsc04...
Golly, after reading the entire article, I'm impressed. Faced with
these results, it's pretty hard to attack McLaren's article as being
poor science.

http://www.tagmclaren.com/members/news/news77.asp

I see nothing in the results that is inconsistent with the hypothesis
that there are no audible differences in both the cables and amps.


Then you have not looked closely at and thought about some of this
results/issues I have raised.


They have indeed been closely examined, and your distortions and wild
speculations have been debunked by several posters.

Now, since the TAG results are yet another nail in your 'everything
sounds different' coffin, exactly where is there one single *shred* of
evidence to support your position?

As a sad footnote, it must be observed that TAG-McLaren is one of the
very few 'high end' companies which brought genuine engineering talent
to bear on an attempt to improve the quality of music reproduction in
the home, backed by considerable financial muscle. Regrettably (but
perhaps predictably), they found that honesty and skill were *not* the
way to make money in so-called 'high end' audio. R.I.P.......
--

Stewart Pinkerton | Music is Art - Audio is Engineering
  #30   Report Post  
Thomas A
 
Posts: n/a
Default Blindtest question

(Nousaine) wrote in message news:1KlWa.36390$uu5.4253@sccrnsc04...
(Thomas A) wrote:

....some snip.....

some more snip...

Where the "best scorers"
allowed to repeat the experiments in the main experiment?


This did not appear to be part of the protocol for any but subject analysis was
common.


My experience with blindtests is that results do vary among subjects.
I made a test where the CD players where not level-matched (something
just below 0.5 dB difference). Two persons, including myself, could
verify a difference in blindtest in my home setup (bass-rich music, no
test signals used), whereas two other persons were unable to do it.
Thus a retest with the best-scorers in a test is something which might
be desirable.

Many
questions but they may be relevant when making a meta-analysis.


In addition, have any of the experiments used test signals in the LF
range (around 15-20 Hz) and high-capable subwoofers (120 dB SPL @ 20
Hz)?


No. But there are no commercially available subwoofers that will do 120+ dB at
2 meters in a real room. I've tested dozens and dozens and the only ones with
this capability are custom.


Agree that commercial subs with +120 dB in room is hard to find.


I've just curious since the tests from the Swedish
Audio-Technical Society frequently identifies amplfiers than roll of
in the low end using blind tests.


The typical half power point for my stock of a dozen power amplifiers is 6 Hz.
I've not seen the SATS data though.


Have you ever tested yourself? You have a quite bass-capable system if
I remember correctly. You would need music with very deep and
high-quality bass, or test-tones, and perhaps a setup as described in
the link (a reference amp). The reference amp used by SATS have during
many years has been NAD 208. Other amps rated good is e.g. Rotel
RB1090. I am not sure at the moment which ones that were rated
"not-so-good", but I can look it up. The method they use is a
"before-and-after" test.

http://www.sonicdesign.se/amptest.htm


It might not be said to be an
audible difference since the difference is percieved as a difference
in vibrations in the body. I think I mentioned this before. Also for
testing CD players, have anybody used a sin2 pulse in evaluating
audible differences?


Not that I know of.


Ok. Maybe somebody (Arny?) could present scope pictures of sin2 pulses
of various DACs and CD players? In addition present them as audio
files on the pcabx site? It would be interesting to see whether those
players or DACs with distorted pulses (they do exist...) could be
revealed in a DBT. Especially players with one-bit and true multi-bit.


  #31   Report Post  
Harry Lavo
 
Posts: n/a
Default Blindtest question

"Stewart Pinkerton" wrote in message
news:iRlWa.36496$uu5.4473@sccrnsc04...
On Thu, 31 Jul 2003 03:15:31 GMT, "Harry Lavo"
wrote:

And would you explain why a significant level was reached on the "A"

cable
test with 96 trials? Was that "cherry picking". C'mon, Stewart, you

know
better. In fact the real issue here is: if one cable can be so readily
picked out, why can't the other be? What is it in the test, procedure,
quality of the cables, order bias, or what. Something is rotten in the
beloved state of ABX here!


All that the above proves (especially since the results were marginal
at best) is that there's a random distribution in the results which
favoured A *on this occasion*. Run that trial again, and I'll put an
even money bet that you'll have a slight bias in favour of B.

Anyone familiar with Statistical Process Control is well aware that
one swallow doesn't make a summer.
--


This wasn't one swallow, Stewart...this was 96 swallows in one case and 84
in another. A whole pride of swallows. So the odds of such severe swings
in opposite directions (even if only marginal) is not very high over the
whole pride. We are not talking a sample of two here.
  #32   Report Post  
Harry Lavo
 
Posts: n/a
Default Blindtest question

"John Corbett" wrote in message
news:hQlWa.37169$YN5.32913@sccrnsc01...
In article zVGVa.15179$Ho3.2323@sccrnsc03, "Harry Lavo"
wrote:

Thomas -

Thanks for the post of the Tag Mclaren test link (and to Tom for the

other
references). I've looked at the Tag link and suspect it's going to add

to
the controversy here. My comments on the test follow.

From the tone of the web info on this test, one can presume that Tag set

out
to show its relatively inexpensive gear was just as good as some
acknowledged industry standards. But....wonder why Tag choose the 99%
confidence level? Being careful *not* to say that it was prechosen in
advance? It is because had they used the more common and almost
universally-used 95% level it would have shown that:

* When cable A was the "X" it was recognized at a significant level by

the
panel (and guess whose cable probably would "lose" in a preference test
versus a universally recognized standard of excellence chosen as "tops"

by
both Stereophile and TAS, as well as by other industry publishers)

* One individual differentiated both cable A and combined cables at the
significant level

Results summarized as follows:

Tag Mclaren Published ABX Results

Sample 99% 95% Actual Confidence

Total Test

Cables
A 96 60 53 e 52 94.8% e
B 84 54 48 e 38 coin toss
Both 180 107 97 e 90 coin toss

Amps
A 96 60 53 e 47 coin toss
B 84 54 48 e 38 coin toss
Both 180 107 97 e 85 coin toss

Top Individuals

Cables
A 8 8 7 6 94.5%
B 7 7 7 5 83.6%
Both 15 13 11 11 95.8%

Amps
A 8 8 7 5 83.6%
B 7 7 7 5 83.6%
Both 15 13 11 10 90.8%

e = extrapolated based on scores for 100 and 50 sample size

[snip]

I'd like to hear other views on this test.

Mr. Lavo, here are some comments on your numbers.

The short story: your numbers are bogus.

The long story follows.

I don't know how you came up with critical values for what you
think is a reasonable level of significance.

For n = 96 trials, the critical values a
60 for .01 level of significance
57 for .05 level of significance
53 for .20 level of significance

for n = 84 trials, the critical values a
54 for .01 level of significance
51 for .05 level of significance
47 for .20 level of significance

for n = 180 trials, the critical values a
107 for .01 level of significance
102 for .05 level of significance
97 for .20 level of significance

for n = 8 trials, the critical values a
8 for .01 level of significance
7 for .05 level of significance
6 for .20 level of significance

for n = 7 trials, the critical values a
7 for .01 level of significance
7 for .05 level of significance
6 for .20 level of significance

for n = 15 trials, the critical values a
13 for .01 level of significance
12 for .05 level of significance
10 for .20 level of significance

The values you provide for what you call 95% confidence (i.e., .05 level
of significance) are almost the correct values for 20% significance.

You make much of an apparently borderline significant result,
where the best individual cable test scores were 11 of 15 correct.

If that had been the entire experiment, we would have a p-value
of .059, reflecting the probability that one would do at least that
well in a single run of 15 trials.
That is, the probability that someone would score 11, 12, 13, 14, or 15
correct just by guessing is .059; also, the probability that the score is
less than 11 would be 1 - .059 = .941 for a single batch of 15 trials.

But what was reported was the best such performance in a dozen sets of
trials. That's not the same as a single run of 15 trials.

The probability of at least one of 12 subjects doing at least as well as
11 of 15 is 1 - [ probability that all 12 do worse than 11 of 15 ].
Thus we get 1 - (.941)^(12), which is about .52.

So, even your star performer is not doing better than chance suggests
he should.

Mr. Lavo, your conjecture (that that the test organizers have tried
to distort the results to fit an agenda) appears to be without support.

Now for some comments on the TAG McLaren report itself.

There are problems with some numbers provided by TAG McLaren, but they
are confined to background material.
There do not appear to be problems with the actual report of experimental
results.

TAG McLaren claims that you need more than 10 trials to obtain results
significant at the .01 level, but they are wrong. In fact, 7 trials

suffice.
With 10 trials you can reach .001.
There is a table just before the results section of their report with some
discussion about small sample sizes. The first several rows of that table

have
bogus numbers in the third column, and their sample size claims are based
on those wrong numbers.
However, the values for 11 or more trials are correct.

As I have already noted, the numbers used in the report itself appear to
be correct.

Some would argue that their conclusion should be that they found no
evidence to support a claim of audible difference, rather than concluding
that there was no audible difference, but that's another issue.

You have correctly noted concerns about the form of ABX presentation (not
the same as the usual ABX scheme discussed on rahe) but that does not
invalidate the experiment.

There are questions about how the sample of listeners was obtained.

For the most part, TAG McLaren seems to have designed a test, carried it
out, run the numbers properly, and then accurately reported what they did.
That's more than can be said for many tests.


John -

I have explained that I made an error, and I thank you for pointing it out.
I also explained how and why, but that is an explaination, not an excuse.

Perhaps you could bring your statistical skills to bear on the Greenhill
test as reported in Stereo Review in1983. The raw results were posted here
by me and by Ludovic about a year ago. As I recall one of the participants
in that test did very well across several tests..I'd be interested in your
calculation of the probability of his achieving those results by chance.
This is not a troll..the mathematics of it are simply beyond me and I tried
at the time to calculate the odds and apparently failed. The argument was:
outlyer, or golden ear.

  #33   Report Post  
Harry Lavo
 
Posts: n/a
Default Blindtest question

"Stewart Pinkerton" wrote in message
...
On Fri, 01 Aug 2003 04:31:13 GMT, "Harry Lavo"
wrote:

"Jim West" wrote in message
news:fnaWa.19572$cF.7720@rwcrnsc53...
In article nR%Va.24664$YN5.23125@sccrnsc01, Harry Lavo wrote:

You mean not accepting the "received truth" without doing my own

analysis is
cherry picking, is that it Stewart?


No, attempting to extract *only* those sub-tests which agree with your
prejudices is 'cherry picking', and even then, you can' t make it
stick on the numbers in that series of tests.

We are not allowed to point out
anonomlies and ask "why"? "how come"? "what could be causing this?"

You are indeed cherry picking. With 12 individuals the probability that
one would would appear to meet the 95% level is fairly high. Remember
that you can expect 1 in 20 to meet that level entirely by random. It
is not acceptable scientific practice to select specific data sub-sets
out of the complete set. Otherwise you could "prove" anything by simply
running enough trials and ignoring those you don't like. Check any peer
reviewed journal.

Yep it is more probable than one in twenty. But not so high that we have

to
accept your assumption that he/she *IS* the one-in-twenty.


Unfortunately for your speculation, that individual did poorly in the
amplifier test. Similarly, the best scorers in the amp test did poorly
on cables. IOW, the results were *random*, and did not show *any* sign
of a genuine audible difference, despite your many and convoluted
attempts to distort the data to fit your agenda.


Ah, the "received truth", better known as dogma. And why, pray tell,
Stewart, is your explanation any more valid than my supposition that the
cable test and the amp test may have revealed different attributes of
reproduction of that particular musical piece at the margin, and that the
two men were each more sensitive to one than the other.?

  #34   Report Post  
Harry Lavo
 
Posts: n/a
Default Blindtest question

"Jim West" wrote in message
et...
In article l2mWa.36515$Ho3.6598@sccrnsc03, Harry Lavo wrote:
"Jim West" wrote in message
news:fnaWa.19572$cF.7720@rwcrnsc53...

You are indeed cherry picking. With 12 individuals the probability that
one would would appear to meet the 95% level is fairly high. Remember
that you can expect 1 in 20 to meet that level entirely by random. It
is not acceptable scientific practice to select specific data sub-sets
out of the complete set. Otherwise you could "prove" anything by simply
running enough trials and ignoring those you don't like. Check any peer
reviewed journal.


Yep it is more probable than one in twenty. But not so high that we have

to
accept your assumption that he/she *IS* the one-in-twenty.


Where did I assume that?


The dismissal of any single recipient scoring at a significant or
near-significant level here is almost always described here as an outlyer.
in some cases this seems true...in others not so true. To say for sure that
somebody is an outlyer means one is assuming a certainty of "1", in other
words, 100% sure. As opposed to saying there is a 25% chance that he may be
an outlyer. Or a 50/50 chance. So implicit in the assertion, if the true
probablity for an indivual in a given trial is one-in-twenty, and you have
12 subjects, is that that "one" is the "one-in-twenty" is the
"one-in-twelve". Which may or may not be the case. But it is certainly no
sure thing.

In any event, 11 out 15 has a probability of 5.9 % of occuring by

chance.
That does not meet the 95 % confidence level. It would be rejected in
a peer reviewed statistical study. (If that was the only data more
trials would called for. But it wasn't the only data.)


My mistake..you are correct. So the number is only 94.1%. But 12 out

of 15
yields 98.2%. Which is closer to the 95% standard?


Irrelevant. You can't arbitrarily play with the numbers.


I'm not playing with numbers. When you establish a significance level you
are establishing a certainly level that the community considers accepatble
odds. 95% confidence level reflects one-in-twenty odds. A 98% confidence
level reflects a one-in-fifty odds. 94.1% reflects a one-in-eighteen and a
half odds. Which seems qualitatively closer to one-in-twenty to you.

With a large sample size, stating the number needed for significance is
okay..exceeding it by one doesn't distort results. But with small sample
sizes such as we are talking about here, there is a big statistical
difference beween 11 of 15 and 12 of 15. And the eleven of fifteen best
approaches the 95% standard.


Where are you getting your numbers? The data they posted on the
web page showed that there were 52 correct answers in 96 trials.
At least 52 correct answers will occur entirely by chance 23.8 %
of the time. This is far from statistically significant.


Oops! Late at night and the only book of binomial probabilites I had

handy
contained only raw data, not the cumulative error tables that I was used

to
from my previous work. So I goofed. I agree that the three-to-one odds

are
not statistically significant, and so the full panel results are null

for
both cables. My apologies.

The issue with the individuals still stands, however.


No it doesn't. Even with cherry picking (which is not valid anyway)
no individual performed at a statistically signficant level. Period.


Again, repeat the dogma (and the mantra). Don't bother to think about what
the numbers really translate to. I don't know of any social researcher
requiring odds of 98% or 99% confidence level. Do you really believe 95% is
"significant" while 94.2% is not, in any meaningful, non-dogmatic way?

  #35   Report Post  
Stewart Pinkerton
 
Posts: n/a
Default Blindtest question

On 2 Aug 2003 00:38:44 GMT, "Harry Lavo" wrote:

"Stewart Pinkerton" wrote in message
news:iRlWa.36496$uu5.4473@sccrnsc04...


All that the above proves (especially since the results were marginal
at best) is that there's a random distribution in the results which
favoured A *on this occasion*. Run that trial again, and I'll put an
even money bet that you'll have a slight bias in favour of B.

Anyone familiar with Statistical Process Control is well aware that
one swallow doesn't make a summer.


This wasn't one swallow, Stewart...this was 96 swallows in one case and 84
in another. A whole pride of swallows.


That's a very poor flocking argument, since a proper evaluation of the
test, using *all* the results, shows that the amps and cables were
most definitely *not* sonically distinguishable. Further, those test
subjects who scored well on cables scored poorly on amps, and vice
versa, proving the point that they could not reliably distinguish any
differences.

So the odds of such severe swings
in opposite directions (even if only marginal) is not very high over the
whole pride. We are not talking a sample of two here.


We are also not talking of any 'swings' that have statistical
significance. This has been pointed out to you several times by
several people, but you still insist on attempting to distort the
results to fit your own prejudices.
--

Stewart Pinkerton | Music is Art - Audio is Engineering


  #36   Report Post  
Stewart Pinkerton
 
Posts: n/a
Default Blindtest question

On Sat, 02 Aug 2003 05:02:21 GMT, "Harry Lavo"
wrote:

"Stewart Pinkerton" wrote in message
...


Unfortunately for your speculation, that individual did poorly in the
amplifier test. Similarly, the best scorers in the amp test did poorly
on cables. IOW, the results were *random*, and did not show *any* sign
of a genuine audible difference, despite your many and convoluted
attempts to distort the data to fit your agenda.


Ah, the "received truth", better known as dogma.


No, the simplest and most likely explanation of the results.

And why, pray tell,
Stewart, is your explanation any more valid than my supposition that the
cable test and the amp test may have revealed different attributes of
reproduction of that particular musical piece at the margin, and that the
two men were each more sensitive to one than the other.?


Occam's Razor. You are attempting to speculate that the dark side of
the Moon may indeed have large sections which are made of green
cheese. While possible, this is unlikely, and most reasonable people
would not put real money on it. The same applies to 'cable sound'.

--

Stewart Pinkerton | Music is Art - Audio is Engineering
  #37   Report Post  
Stewart Pinkerton
 
Posts: n/a
Default Blindtest question

On Sat, 02 Aug 2003 05:02:40 GMT, "Harry Lavo"
wrote:

"Jim West" wrote in message
. net...


Irrelevant. You can't arbitrarily play with the numbers.

I'm not playing with numbers. When you establish a significance level you
are establishing a certainly level that the community considers accepatble
odds. 95% confidence level reflects one-in-twenty odds. A 98% confidence
level reflects a one-in-fifty odds. 94.1% reflects a one-in-eighteen and a
half odds. Which seems qualitatively closer to one-in-twenty to you.

With a large sample size, stating the number needed for significance is
okay..exceeding it by one doesn't distort results. But with small sample
sizes such as we are talking about here, there is a big statistical
difference beween 11 of 15 and 12 of 15. And the eleven of fifteen best
approaches the 95% standard.


Fine. Now take the fact that there were a dozen volunteers, and you
find that you have a 12 out of 18.5 chance of achieving that result by
random chance. That's not much more than even odds. Now tell me that
this has any significance whatever.

The issue with the individuals still stands, however.


No it doesn't. Even with cherry picking (which is not valid anyway)
no individual performed at a statistically signficant level. Period.

Again, repeat the dogma (and the mantra). Don't bother to think about what
the numbers really translate to. I don't know of any social researcher
requiring odds of 98% or 99% confidence level. Do you really believe 95% is
"significant" while 94.2% is not, in any meaningful, non-dogmatic way?


More importantly, since there were 12 test subjects, the *real*
statistical probability of one subject scoring 11 out of 15 is not
94.2%, but 65%. That is simply not significant by *any* standard,
especially when combined with the fact that those test subjects who
did well on cables did poorly on amps, and vice versa. It's all just
random chance, and all the cherry picking in the world won't change
that.
--

Stewart Pinkerton | Music is Art - Audio is Engineering
  #38   Report Post  
Nousaine
 
Posts: n/a
Default Blindtest question

(Thomas A) wrote:

(Nousaine) wrote in message
news:1KlWa.36390$uu5.4253@sccrnsc04...
(Thomas A) wrote:

....some snip.....

some more snip...

Where the "best scorers"
allowed to repeat the experiments in the main experiment?


This did not appear to be part of the protocol for any but subject analysis

was
common.


My experience with blindtests is that results do vary among subjects.
I made a test where the CD players where not level-matched (something
just below 0.5 dB difference). Two persons, including myself, could
verify a difference in blindtest in my home setup (bass-rich music, no
test signals used), whereas two other persons were unable to do it.
Thus a retest with the best-scorers in a test is something which might
be desirable.


In a situation like this with 2 of 4 persons scoring significantly it's likely
that the overall score would be significant as well.

It was typical for analysis to be conducted on a subject by subject basis to
find significant individual scores. In tests where the overall result was null
there did not appear to be cases where individually positive scores were
covered by totals.

I have offered Retests in most of my personally conducted experiments. In these
I have experienced exactly one subject who asked to extend the number of trials
in an experiment, one who retook a test at my request and one who accepted an
opportunity for retest.

This covers perhaps a dozen formal tests and several dozen subjects.

Many
questions but they may be relevant when making a meta-analysis.


In addition, have any of the experiments used test signals in the LF
range (around 15-20 Hz) and high-capable subwoofers (120 dB SPL @ 20
Hz)?


No. But there are no commercially available subwoofers that will do 120+ dB

at
2 meters in a real room. I've tested dozens and dozens and the only ones

with
this capability are custom.


Agree that commercial subs with +120 dB in room is hard to find.


I've just curious since the tests from the Swedish
Audio-Technical Society frequently identifies amplfiers than roll of
in the low end using blind tests.


The typical half power point for my stock of a dozen power amplifiers is 6

Hz.
I've not seen the SATS data though.


Have you ever tested yourself? You have a quite bass-capable system if
I remember correctly. You would need music with very deep and
high-quality bass, or test-tones, and perhaps a setup as described in
the link (a reference amp).


I've measured the frequency response of the amplifiers and, yes, my subwoofer
will produce 120 dB + SPL from 12 to 62 Hz with less than 10% distortion.

Perhaps not surprisingly it takes a 5000 watt capable amplifier to make these
SPL levels but I've never felt 'cheated of bass' when the system is driven with
2 channels of a 250 wpc stereo amplifier with ordinary programs, some of which
have frequency content below 10 Hz.

The reference amp used by SATS have during
many years has been NAD 208. Other amps rated good is e.g. Rotel
RB1090. I am not sure at the moment which ones that were rated
"not-so-good", but I can look it up. The method they use is a
"before-and-after" test.

http://www.sonicdesign.se/amptest.htm


It might not be said to be an
audible difference since the difference is percieved as a difference
in vibrations in the body. I think I mentioned this before.


Basically you have to have the woofer displacement/ amp power to start. I
haven't conducted a formal test about this but I'm guessing that speaker
displacement is a bigger issue than amplifier bandwidth. IOW I'm guessing that
most modern SS amplifiers have low frequency bandwidth to cover modern programs
and that the basic limiting factor is the subwoofer transducer(s).

...snip remainder...
  #39   Report Post  
All Ears
 
Posts: n/a
Default Blindtest question

I can recommend blind testing beer, nobody can taste any difference within
the same type of beer anyway. Lots of money to save....

Reply
Thread Tools
Display Modes

Posting Rules

Smilies are On
[IMG] code is On
HTML code is Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
RCA out and Speaker Question in 2004 Ranger Edge Question magicianstalk Car Audio 0 March 10th 04 03:32 AM
capacitor + parallel wiring question? Chi Car Audio 2 March 7th 04 01:56 PM
Sub + amp wiring question Incog Car Audio 1 February 16th 04 01:49 AM
Subwoofer box question Joseph Luner Car Audio 5 December 30th 03 05:13 PM
Subwoofer position question Robert Morein Audio Opinions 1 August 24th 03 02:19 PM


All times are GMT +1. The time now is 11:20 AM.

Powered by: vBulletin
Copyright ©2000 - 2024, Jelsoft Enterprises Ltd.
Copyright ©2004-2024 AudioBanter.com.
The comments are property of their posters.
 

About Us

"It's about Audio and hi-fi"