View Single Post
  #24   Report Post  
John Corbett
 
Posts: n/a
Default Blindtest question

In article zVGVa.15179$Ho3.2323@sccrnsc03, "Harry Lavo"
wrote:

Thomas -

Thanks for the post of the Tag Mclaren test link (and to Tom for the other
references). I've looked at the Tag link and suspect it's going to add to
the controversy here. My comments on the test follow.

From the tone of the web info on this test, one can presume that Tag set out
to show its relatively inexpensive gear was just as good as some
acknowledged industry standards. But....wonder why Tag choose the 99%
confidence level? Being careful *not* to say that it was prechosen in
advance? It is because had they used the more common and almost
universally-used 95% level it would have shown that:

* When cable A was the "X" it was recognized at a significant level by the
panel (and guess whose cable probably would "lose" in a preference test
versus a universally recognized standard of excellence chosen as "tops" by
both Stereophile and TAS, as well as by other industry publishers)

* One individual differentiated both cable A and combined cables at the
significant level

Results summarized as follows:

Tag Mclaren Published ABX Results

Sample 99% 95% Actual Confidence

Total Test

Cables
A 96 60 53 e 52 94.8% e
B 84 54 48 e 38 coin toss
Both 180 107 97 e 90 coin toss

Amps
A 96 60 53 e 47 coin toss
B 84 54 48 e 38 coin toss
Both 180 107 97 e 85 coin toss

Top Individuals

Cables
A 8 8 7 6 94.5%
B 7 7 7 5 83.6%
Both 15 13 11 11 95.8%

Amps
A 8 8 7 5 83.6%
B 7 7 7 5 83.6%
Both 15 13 11 10 90.8%

e = extrapolated based on scores for 100 and 50 sample size

[snip]

I'd like to hear other views on this test.

Mr. Lavo, here are some comments on your numbers.

The short story: your numbers are bogus.

The long story follows.

I don't know how you came up with critical values for what you
think is a reasonable level of significance.

For n = 96 trials, the critical values a
60 for .01 level of significance
57 for .05 level of significance
53 for .20 level of significance

for n = 84 trials, the critical values a
54 for .01 level of significance
51 for .05 level of significance
47 for .20 level of significance

for n = 180 trials, the critical values a
107 for .01 level of significance
102 for .05 level of significance
97 for .20 level of significance

for n = 8 trials, the critical values a
8 for .01 level of significance
7 for .05 level of significance
6 for .20 level of significance

for n = 7 trials, the critical values a
7 for .01 level of significance
7 for .05 level of significance
6 for .20 level of significance

for n = 15 trials, the critical values a
13 for .01 level of significance
12 for .05 level of significance
10 for .20 level of significance

The values you provide for what you call 95% confidence (i.e., .05 level
of significance) are almost the correct values for 20% significance.

You make much of an apparently borderline significant result,
where the best individual cable test scores were 11 of 15 correct.

If that had been the entire experiment, we would have a p-value
of .059, reflecting the probability that one would do at least that
well in a single run of 15 trials.
That is, the probability that someone would score 11, 12, 13, 14, or 15
correct just by guessing is .059; also, the probability that the score is
less than 11 would be 1 - .059 = .941 for a single batch of 15 trials.

But what was reported was the best such performance in a dozen sets of
trials. That's not the same as a single run of 15 trials.

The probability of at least one of 12 subjects doing at least as well as
11 of 15 is 1 - [ probability that all 12 do worse than 11 of 15 ].
Thus we get 1 - (.941)^(12), which is about .52.

So, even your star performer is not doing better than chance suggests
he should.

Mr. Lavo, your conjecture (that that the test organizers have tried
to distort the results to fit an agenda) appears to be without support.

Now for some comments on the TAG McLaren report itself.

There are problems with some numbers provided by TAG McLaren, but they
are confined to background material.
There do not appear to be problems with the actual report of experimental
results.

TAG McLaren claims that you need more than 10 trials to obtain results
significant at the .01 level, but they are wrong. In fact, 7 trials suffice.
With 10 trials you can reach .001.
There is a table just before the results section of their report with some
discussion about small sample sizes. The first several rows of that table have
bogus numbers in the third column, and their sample size claims are based
on those wrong numbers.
However, the values for 11 or more trials are correct.

As I have already noted, the numbers used in the report itself appear to
be correct.

Some would argue that their conclusion should be that they found no
evidence to support a claim of audible difference, rather than concluding
that there was no audible difference, but that's another issue.

You have correctly noted concerns about the form of ABX presentation (not
the same as the usual ABX scheme discussed on rahe) but that does not
invalidate the experiment.

There are questions about how the sample of listeners was obtained.

For the most part, TAG McLaren seems to have designed a test, carried it
out, run the numbers properly, and then accurately reported what they did.
That's more than can be said for many tests.

JC