Home |
Search |
Today's Posts |
|
#1
|
|||
|
|||
Blindtest question
Is there any published DBT of amps, CD players or cables where the
number of trials are greater than 500? If there difference is miniscule there is likely that many "guesses" are wrong and would require many trials to reveal any subtle difference? Thomas |
#2
|
|||
|
|||
Blindtest question
Thomas A wrote:
Is there any published DBT of amps, CD players or cables where the number of trials are greater than 500? If there difference is miniscule there is likely that many "guesses" are wrong and would require many trials to reveal any subtle difference? There are published tests where people claimed they could hear the difference sighted , but when they were 'blinded' they could not. In this case the argument that 500 trials are needed would seem to be weak. However, a real and miniscule difference would certainly be discerned more reliably if there was specific training to hear it beforehand. -- -S. |
#3
|
|||
|
|||
Blindtest question
Steven Sullivan wrote in message news:d_XUa.142496$GL4.36308@rwcrnsc53...
Thomas A wrote: Is there any published DBT of amps, CD players or cables where the number of trials are greater than 500? If there difference is miniscule there is likely that many "guesses" are wrong and would require many trials to reveal any subtle difference? There are published tests where people claimed they could hear the difference sighted , but when they were 'blinded' they could not. In this case the argument that 500 trials are needed would seem to be weak. Yes, that's for sure. But how are scientific tests of just noticable difference set up? A difference, when very small, could introduce more incorrect answers from the test subjects. Thus I think the question is interesting. However, a real and miniscule difference would certainly be discerned more reliably if there was specific training to hear it beforehand. Yes, but still, if the difference is real and miniscule it could introduce incorrect answers even if there is specific training beforehand. If there would be an all or nothing thing, then the result would always be 100% correct (difference) or 50% (no difference). What if the answers are 60% correct? |
#4
|
|||
|
|||
Blindtest question
|
#6
|
|||
|
|||
Blindtest question
Thomas A wrote:
Steven Sullivan wrote in message news:d_XUa.142496$GL4.36308@rwcrnsc53... Thomas A wrote: Is there any published DBT of amps, CD players or cables where the number of trials are greater than 500? If there difference is miniscule there is likely that many "guesses" are wrong and would require many trials to reveal any subtle difference? There are published tests where people claimed they could hear the difference sighted , but when they were 'blinded' they could not. In this case the argument that 500 trials are needed would seem to be weak. Yes, that's for sure. But how are scientific tests of just noticable difference set up? A difference, when very small, could introduce more incorrect answers from the test subjects. Thus I think the question is interesting. However, a real and miniscule difference would certainly be discerned more reliably if there was specific training to hear it beforehand. Yes, but still, if the difference is real and miniscule it could introduce incorrect answers even if there is specific training beforehand. If there would be an all or nothing thing, then the result would always be 100% correct (difference) or 50% (no difference). What if the answers are 60% correct? What level of certitude are you looking for? Scientists use statistical tools to calculate probabilities of different kinds of error in such cases. -- -S. |
#7
|
|||
|
|||
Blindtest question
Steven Sullivan wrote in message news:UqnVa.4003$Oz4.1480@rwcrnsc54...
Thomas A wrote: Steven Sullivan wrote in message news:d_XUa.142496$GL4.36308@rwcrnsc53... Thomas A wrote: Is there any published DBT of amps, CD players or cables where the number of trials are greater than 500? If there difference is miniscule there is likely that many "guesses" are wrong and would require many trials to reveal any subtle difference? There are published tests where people claimed they could hear the difference sighted , but when they were 'blinded' they could not. In this case the argument that 500 trials are needed would seem to be weak. Yes, that's for sure. But how are scientific tests of just noticable difference set up? A difference, when very small, could introduce more incorrect answers from the test subjects. Thus I think the question is interesting. However, a real and miniscule difference would certainly be discerned more reliably if there was specific training to hear it beforehand. Yes, but still, if the difference is real and miniscule it could introduce incorrect answers even if there is specific training beforehand. If there would be an all or nothing thing, then the result would always be 100% correct (difference) or 50% (no difference). What if the answers are 60% correct? What level of certitude are you looking for? Scientists use statistical tools to calculate probabilities of different kinds of error in such cases. Well confidence limits of 95% or 99% are usually applied. The power of the test is however important when you approach the audible limit. Also, in sample sizes 200 you need not use correction for continuity in the statistical calculation. I am not sure, but I think this correction applies in this case when sample sizes are 25-200. Below 25, this correction is not sufficient. |
#8
|
|||
|
|||
Blindtest question
"Thomas A" wrote in message
newsBUUa.141509$OZ2.27088@rwcrnsc54... Is there any published DBT of amps, CD players or cables where the number of trials are greater than 500? I've never seen one. It would be difficult to get a single subject to do that many trials. So, it would have to be many subjects and they would have to be isolated to prevent subtle influence from one to the other. If there difference is miniscule there is likely that many "guesses" are wrong and would require many trials to reveal any subtle difference? (Note that the word is spelled "minuscule.") Norm Strong |
#9
|
|||
|
|||
Blindtest question
|
#11
|
|||
|
|||
Blindtest question
Thomas -
Thanks for the post of the Tag Mclaren test link (and to Tom for the other references). I've looked at the Tag link and suspect it's going to add to the controversy here. My comments on the test follow. From the tone of the web info on this test, one can presume that Tag set out to show its relatively inexpensive gear was just as good as some acknowledged industry standards. But....wonder why Tag choose the 99% confidence level? Being careful *not* to say that it was prechosen in advance? It is because had they used the more common and almost universally-used 95% level it would have shown that: * When cable A was the "X" it was recognized at a significant level by the panel (and guess whose cable probably would "lose" in a preference test versus a universally recognized standard of excellence chosen as "tops" by both Stereophile and TAS, as well as by other industry publishers) * One individual differentiated both cable A and combined cables at the significant level Results summarized as follows: Tag Mclaren Published ABX Results Sample 99% 95% Actual Confidence Total Test Cables A 96 60 53 e 52 94.8% e B 84 54 48 e 38 coin toss Both 180 107 97 e 90 coin toss Amps A 96 60 53 e 47 coin toss B 84 54 48 e 38 coin toss Both 180 107 97 e 85 coin toss Top Individuals Cables A 8 8 7 6 94.5% B 7 7 7 5 83.6% Both 15 13 11 11 95.8% Amps A 8 8 7 5 83.6% B 7 7 7 5 83.6% Both 15 13 11 10 90.8% e = extrapolated based on scores for 100 and 50 sample size In general, the test while seemingly objective has more negatives than positives when measured against the consensus of the objectivists (and some subjectivists) in this group as to what constitutes a good abx test: TEST POSITIVES *double blind *level matched TEST NEGATIVES *short snippets *no user control over switching and (apparently) no repeats *no user control over content *group test, no safeguards against visual interaction *no group selection criteria apparent and no pre-training or testing The results and the summary of positives/negatives above raise some interesting questions: *why, for example, should one cable be significantly identified when "x" and the other fail miserably to be identified. This has to be due and interaction between the characteristics of the music samples chosen, the characteristics of the cables under test, and perhaps aggravated by the use of short snippets with an inadequate time frame to establish the proper evaluation context. Did the test itself create the overall null where people could not differentiate based soley on the test not favoring B as much as A? * do the differences in people scoring high on the two tests support the idea that different people react to different attributes of the DUT's. Or does it again suggest some interaction between the music chosen, the characteristics of the individual pieces, and perhaps the evaluation time frame. * or is it possible that the abx test itself, when used with short snippets, makes some kinds of differences more apparent and others less apparent and thus by working against exposing *all* kinds of differences help create more *no differences* than should be the result. * since the panel is not identified and there was no training, do the results suggest a "dumbing down" of differentiation from the scores of the more able listeners? I am sure it will be suggested that the two different high scorers were simply random outliers...I'm not so sure especially since the individual scoring high on the cable test hears the cable differences exactly like the general sample but at a higher level (required because of smaller sample size) and the high scorer on the amp test is in much the same position. if some of these arguments sound familiar, they certainly raises echoes of the issues raised here by subjectivists over the years...and yet these specifics are rooted in the results of this one test. I'd like to hear other views on this test. "Thomas A" wrote in message news:ahwVa.6957$cF.2308@rwcrnsc53... (Nousaine) wrote in message ... (Thomas A) wrote: Is there any published DBT of amps, CD players or cables where the number of trials are greater than 500? If there difference is miniscule there is likely that many "guesses" are wrong and would require many trials to reveal any subtle difference? Thomas With regard to amplifiers as of May 1990 there had been such tests. In 1978 QUAD published an erxperiment with 576 trials. In 1980 Smith peterson and Jackson published an experiment with 1104 trials; in 1989 Stereophile published a 3530 trial comparison. In 1986 Clark & Masters published an experiment with 772 trials. All were null. There's a misconception that blind tests tend to have very small sample sizes. As of 1990 the 23 published amplifier experiments had a mean average of 426 and a median of 90 trials. If we exclude the 3530 trial experiment the mean becomes 285 trials. The median remains unchanged. Ok thanks. Is it possible to get the numbers for each test? I would like to see if it possible to do a meta-analysis in the amplifier case. The test by tagmclaren is an additional one: http://www.tagmclaren.com/members/news/news77.asp Thomas |
#12
|
|||
|
|||
Blindtest question
"Harry Lavo" wrote:
Thomas - Thanks for the post of the Tag Mclaren test link (and to Tom for the other references). I've looked at the Tag link and suspect it's going to add to the controversy here. Actually there's no 'controversey' here. No proponent of amp/wire-sound has ever shown that nominally competent amps or wires have any sound of their own when played back over loudspeakers. The only 'controversey' is over whether Arny Kreuger's pcabx tests cab with headphones and special programs can be extrapolated to commerically available programs and speakers in a normally reverberant environment. The Tag-M results are fully within those expected given the more than 2 dozen published experiments of amps and wires. y comments on the test follow. From the tone of the web info on this test, one can presume that Tag set out to show its relatively inexpensive gear was just as good as some acknowledged industry standards. But....wonder why Tag choose the 99% confidence level? Why not? But you can analyze it any way your want. That's the wonderful thing about published results. Being careful *not* to say that it was prechosen in advance? It is because had they used the more common and almost universally-used 95% level it would have shown that: * When cable A was the "X" it was recognized at a significant level by the panel (and guess whose cable probably would "lose" in a preference test versus a universally recognized standard of excellence chosen as "tops" by both Stereophile and TAS, as well as by other industry publishers) * One individual differentiated both cable A and combined cables at the significant level Results summarized as follows: Tag Mclaren Published ABX Results Sample 99% 95% Actual Confidence Total Test Cables A 96 60 53 e 52 94.8% e B 84 54 48 e 38 coin toss Both 180 107 97 e 90 coin toss Amps A 96 60 53 e 47 coin toss B 84 54 48 e 38 coin toss Both 180 107 97 e 85 coin toss Top Individuals Cables A 8 8 7 6 94.5% B 7 7 7 5 83.6% Both 15 13 11 11 95.8% Amps A 8 8 7 5 83.6% B 7 7 7 5 83.6% Both 15 13 11 10 90.8% e = extrapolated based on scores for 100 and 50 sample size In general, the test while seemingly objective has more negatives than positives when measured against the consensus of the objectivists (and some subjectivists) in this group as to what constitutes a good abx test: This is what always happens with 'bad news.' Instead of giving us contradictory evidence we get endless wishful 'data-dredging' to find any possible reason to ignore the evidence. In any other circle when one thinks the results of a given experiment are wrong they just duplicate it showing the error OR produce a valid one with contrary evidence. TEST POSITIVES *double blind *level matched TEST NEGATIVES *short snippets *no user control over switching and (apparently) no repeats *no user control over content *group test, no safeguards against visual interaction *no group selection criteria apparent and no pre-training or testing OK how many of your sighted 'tests' have ignored one or all of these positives or negatives? The results and the summary of positives/negatives above raise some interesting questions: No, not really. All of the true questions about bias controlled listening tests have been addressed prior. *why, for example, should one cable be significantly identified when "x" and the other fail miserably to be identified. This has to be due and interaction between the characteristics of the music samples chosen, the characteristics of the cables under test, and perhaps aggravated by the use of short snippets with an inadequate time frame to establish the proper evaluation context. Did the test itself create the overall null where people could not differentiate based soley on the test not favoring B as much as A? * do the differences in people scoring high on the two tests support the idea that different people react to different attributes of the DUT's. Or does it again suggest some interaction between the music chosen, the characteristics of the individual pieces, and perhaps the evaluation time frame. * or is it possible that the abx test itself, when used with short snippets, makes some kinds of differences more apparent and others less apparent and thus by working against exposing *all* kinds of differences help create more *no differences* than should be the result. * since the panel is not identified and there was no training, do the results suggest a "dumbing down" of differentiation from the scores of the more able listeners? I am sure it will be suggested that the two different high scorers were simply random outliers...I'm not so sure especially since the individual scoring high on the cable test hears the cable differences exactly like the general sample but at a higher level (required because of smaller sample size) and the high scorer on the amp test is in much the same position. if some of these arguments sound familiar, they certainly raises echoes of the issues raised here by subjectivists over the years...and yet these specifics are rooted in the results of this one test. I'd like to hear other views on this test. These results are consistent with the 2 dozen and more other bias controlled listening tests of power amplifiers and wires. "Thomas A" wrote in message news:ahwVa.6957$cF.2308@rwcrnsc53... (Nousaine) wrote in message ... (Thomas A) wrote: Is there any published DBT of amps, CD players or cables where the number of trials are greater than 500? If there difference is miniscule there is likely that many "guesses" are wrong and would require many trials to reveal any subtle difference? Thomas With regard to amplifiers as of May 1990 there had been such tests. In 1978 QUAD published an erxperiment with 576 trials. In 1980 Smith peterson and Jackson published an experiment with 1104 trials; in 1989 Stereophile published a 3530 trial comparison. In 1986 Clark & Masters published an experiment with 772 trials. All were null. There's a misconception that blind tests tend to have very small sample sizes. As of 1990 the 23 published amplifier experiments had a mean average of 426 and a median of 90 trials. If we exclude the 3530 trial experiment the mean becomes 285 trials. The median remains unchanged. Ok thanks. Is it possible to get the numbers for each test? I would like to see if it possible to do a meta-analysis in the amplifier case. The test by tagmclaren is an additional one: Thanks for the reference. |
#13
|
|||
|
|||
Blindtest question
Nousaine wrote:
This is what always happens with 'bad news.' Instead of giving us contradictory evidence we get endless wishful 'data-dredging' to find any possible reason to ignore the evidence. In any other circle when one thinks the results of a given experiment are wrong they just duplicate it showing the error OR produce a valid one with contrary evidence. Not necessarily. It's quite common for questions to be raised during peer review of a scientific paper; it is then incumbent upon the *experimenter*, not the critic, to justify his or her choice of protocol, or his/her explanation of the results. Often this involves doing more experiments to address the reviewer's concerns. Sometimes it merely involved explaining the results more clearly, or in more qualified terms. If the experimenter feels the reviewer has ignored some important point, that comes out too in the reply to the reviews. I say all this having not yet visited the link, so I'm totally unbiased ; |
#14
|
|||
|
|||
Blindtest question
On Wed, 30 Jul 2003 03:26:24 GMT, "Harry Lavo"
wrote: From the tone of the web info on this test, one can presume that Tag set out to show its relatively inexpensive gear was just as good as some acknowledged industry standards. But....wonder why Tag choose the 99% confidence level? Being careful *not* to say that it was prechosen in advance? It is because had they used the more common and almost universally-used 95% level it would have shown that: Can anyone smell fish? Specifically, red herring? * When cable A was the "X" it was recognized at a significant level by the panel (and guess whose cable probably would "lose" in a preference test versus a universally recognized standard of excellence chosen as "tops" by both Stereophile and TAS, as well as by other industry publishers) No Harry, *all* tests fell below the 95% level, except for one single participant in the cable test, which just scraped in. Given that there were 12 volunteers, there's less than 2:1 odds against this happening when tossing coins. Interesting that you also failed to note that the 'best performers' in the cable test did *not* perform well in the amplifier test, and vice versa. You do love to cherry-pick in search of your *required* result, don't you? * One individual differentiated both cable A and combined cables at the significant level Results summarized as follows: Tag Mclaren Published ABX Results Sample 99% 95% Actual Confidence Total Test Cables A 96 60 53 e 52 94.8% e B 84 54 48 e 38 coin toss Both 180 107 97 e 90 coin toss Amps A 96 60 53 e 47 coin toss B 84 54 48 e 38 coin toss Both 180 107 97 e 85 coin toss Top Individuals Cables A 8 8 7 6 94.5% B 7 7 7 5 83.6% Both 15 13 11 11 95.8% Amps A 8 8 7 5 83.6% B 7 7 7 5 83.6% Both 15 13 11 10 90.8% e = extrapolated based on scores for 100 and 50 sample size In general, the test while seemingly objective has more negatives than positives when measured against the consensus of the objectivists (and some subjectivists) in this group as to what constitutes a good abx test: TEST POSITIVES *double blind *level matched TEST NEGATIVES *short snippets *no user control over switching and (apparently) no repeats *no user control over content *group test, no safeguards against visual interaction *no group selection criteria apparent and no pre-training or testing The results and the summary of positives/negatives above raise some interesting questions: *why, for example, should one cable be significantly identified when "x" and the other fail miserably to be identified. This has to be due and interaction between the characteristics of the music samples chosen, the characteristics of the cables under test, and perhaps aggravated by the use of short snippets with an inadequate time frame to establish the proper evaluation context. No it doen't Harry, I doesn't *have* to be due to anything but random chance. Did the test itself create the overall null where people could not differentiate based soley on the test not favoring B as much as A? * do the differences in people scoring high on the two tests support the idea that different people react to different attributes of the DUT's. Or does it again suggest some interaction between the music chosen, the characteristics of the individual pieces, and perhaps the evaluation time frame. No, since the high scorers on one test were not the high scorers in the other test. It's called a distrinution, harry, and it is simply more evidence that there were in fact no audible differences - as any reasonable person would expect. http://www.tagmclaren.com/members/news/news77.asp -- Stewart Pinkerton | Music is Art - Audio is Engineering |
#15
|
|||
|
|||
Blindtest question
"Stewart Pinkerton" wrote in message
news:7jKVa.10234$Oz4.4174@rwcrnsc54... On Wed, 30 Jul 2003 03:26:24 GMT, "Harry Lavo" wrote: From the tone of the web info on this test, one can presume that Tag set out to show its relatively inexpensive gear was just as good as some acknowledged industry standards. But....wonder why Tag choose the 99% confidence level? Being careful *not* to say that it was prechosen in advance? It is because had they used the more common and almost universally-used 95% level it would have shown that: Can anyone smell fish? Specifically, red herring? Are you an outlyer? Or are you simply sensitive to fish? Or did you not conceive that thought double-blind and it is just your imagination? :=) * When cable A was the "X" it was recognized at a significant level by the panel (and guess whose cable probably would "lose" in a preference test versus a universally recognized standard of excellence chosen as "tops" by both Stereophile and TAS, as well as by other industry publishers) No Harry, *all* tests fell below the 95% level, except for one single participant in the cable test, which just scraped in. Given that there were 12 volunteers, there's less than 2:1 odds against this happening when tossing coins. Interesting that you also failed to note that the 'best performers' in the cable test did *not* perform well in the amplifier test, and vice versa. I'm sorry, but when rounded to whole numbers 94.8% is a lot closer than one number higher which would be about 96% in the larger panels and 97% in the smaller panels. The standard is 95%. To say that 94.8% doesn't qualify is splitting hairs. I inclduded the actual numbers needed to pass the barrier just to satisfy the purists, but you *ARE* splitting hairs here, Stewart. You do love to cherry-pick in search of your *required* result, don't you? You mean not accepting the "received truth" without doing my own analysis is cherry picking, is that it Stewart? We are not allowed to point out anonomlies and ask "why"? "how come"? "what could be causing this?" And would you explain why a significant level was reached on the "A" cable test with 96 trials? Was that "cherry picking". C'mon, Stewart, you know better. In fact the real issue here is: if one cable can be so readily picked out, why can't the other be? What is it in the test, procedure, quality of the cables, order bias, or what. Something is rotten in the beloved state of ABX here! * One individual differentiated both cable A and combined cables at the significant level Results summarized as follows: Tag Mclaren Published ABX Results Sample 99% 95% Actual Confidence Total Test Cables A 96 60 53 e 52 94.8% e B 84 54 48 e 38 coin toss Both 180 107 97 e 90 coin toss Amps A 96 60 53 e 47 coin toss B 84 54 48 e 38 coin toss Both 180 107 97 e 85 coin toss Top Individuals Cables A 8 8 7 6 94.5% B 7 7 7 5 83.6% Both 15 13 11 11 95.8% Amps A 8 8 7 5 83.6% B 7 7 7 5 83.6% Both 15 13 11 10 90.8% e = extrapolated based on scores for 100 and 50 sample size In general, the test while seemingly objective has more negatives than positives when measured against the consensus of the objectivists (and some subjectivists) in this group as to what constitutes a good abx test: TEST POSITIVES *double blind *level matched TEST NEGATIVES *short snippets *no user control over switching and (apparently) no repeats *no user control over content *group test, no safeguards against visual interaction *no group selection criteria apparent and no pre-training or testing The results and the summary of positives/negatives above raise some interesting questions: *why, for example, should one cable be significantly identified when "x" and the other fail miserably to be identified. This has to be due and interaction between the characteristics of the music samples chosen, the characteristics of the cables under test, and perhaps aggravated by the use of short snippets with an inadequate time frame to establish the proper evaluation context. No it doen't Harry, I doesn't *have* to be due to anything but random chance. Did the test itself create the overall null where people could not differentiate based soley on the test not favoring B as much as A? * do the differences in people scoring high on the two tests support the idea that different people react to different attributes of the DUT's. Or does it again suggest some interaction between the music chosen, the characteristics of the individual pieces, and perhaps the evaluation time frame. No, since the high scorers on one test were not the high scorers in the other test. It's called a distrinution, harry, and it is simply more evidence that there were in fact no audible differences - as any reasonable person would expect. http://www.tagmclaren.com/members/news/news77.asp -- Stewart Pinkerton | Music is Art - Audio is Engineering I notice no comment on this latter part, Stewart. That is the *SUBSTANCE* of the interesting results of the test/techniques used and the questions raised. |
#16
|
|||
|
|||
Blindtest question
In article zVGVa.15179$Ho3.2323@sccrnsc03, "Harry Lavo"
wrote: Thomas - Thanks for the post of the Tag Mclaren test link (and to Tom for the other references). I've looked at the Tag link and suspect it's going to add to the controversy here. My comments on the test follow. From the tone of the web info on this test, one can presume that Tag set out to show its relatively inexpensive gear was just as good as some acknowledged industry standards. But....wonder why Tag choose the 99% confidence level? Being careful *not* to say that it was prechosen in advance? It is because had they used the more common and almost universally-used 95% level it would have shown that: * When cable A was the "X" it was recognized at a significant level by the panel (and guess whose cable probably would "lose" in a preference test versus a universally recognized standard of excellence chosen as "tops" by both Stereophile and TAS, as well as by other industry publishers) * One individual differentiated both cable A and combined cables at the significant level Results summarized as follows: Tag Mclaren Published ABX Results Sample 99% 95% Actual Confidence Total Test Cables A 96 60 53 e 52 94.8% e B 84 54 48 e 38 coin toss Both 180 107 97 e 90 coin toss Amps A 96 60 53 e 47 coin toss B 84 54 48 e 38 coin toss Both 180 107 97 e 85 coin toss Top Individuals Cables A 8 8 7 6 94.5% B 7 7 7 5 83.6% Both 15 13 11 11 95.8% Amps A 8 8 7 5 83.6% B 7 7 7 5 83.6% Both 15 13 11 10 90.8% e = extrapolated based on scores for 100 and 50 sample size [snip] I'd like to hear other views on this test. Mr. Lavo, here are some comments on your numbers. The short story: your numbers are bogus. The long story follows. I don't know how you came up with critical values for what you think is a reasonable level of significance. For n = 96 trials, the critical values a 60 for .01 level of significance 57 for .05 level of significance 53 for .20 level of significance for n = 84 trials, the critical values a 54 for .01 level of significance 51 for .05 level of significance 47 for .20 level of significance for n = 180 trials, the critical values a 107 for .01 level of significance 102 for .05 level of significance 97 for .20 level of significance for n = 8 trials, the critical values a 8 for .01 level of significance 7 for .05 level of significance 6 for .20 level of significance for n = 7 trials, the critical values a 7 for .01 level of significance 7 for .05 level of significance 6 for .20 level of significance for n = 15 trials, the critical values a 13 for .01 level of significance 12 for .05 level of significance 10 for .20 level of significance The values you provide for what you call 95% confidence (i.e., .05 level of significance) are almost the correct values for 20% significance. You make much of an apparently borderline significant result, where the best individual cable test scores were 11 of 15 correct. If that had been the entire experiment, we would have a p-value of .059, reflecting the probability that one would do at least that well in a single run of 15 trials. That is, the probability that someone would score 11, 12, 13, 14, or 15 correct just by guessing is .059; also, the probability that the score is less than 11 would be 1 - .059 = .941 for a single batch of 15 trials. But what was reported was the best such performance in a dozen sets of trials. That's not the same as a single run of 15 trials. The probability of at least one of 12 subjects doing at least as well as 11 of 15 is 1 - [ probability that all 12 do worse than 11 of 15 ]. Thus we get 1 - (.941)^(12), which is about .52. So, even your star performer is not doing better than chance suggests he should. Mr. Lavo, your conjecture (that that the test organizers have tried to distort the results to fit an agenda) appears to be without support. Now for some comments on the TAG McLaren report itself. There are problems with some numbers provided by TAG McLaren, but they are confined to background material. There do not appear to be problems with the actual report of experimental results. TAG McLaren claims that you need more than 10 trials to obtain results significant at the .01 level, but they are wrong. In fact, 7 trials suffice. With 10 trials you can reach .001. There is a table just before the results section of their report with some discussion about small sample sizes. The first several rows of that table have bogus numbers in the third column, and their sample size claims are based on those wrong numbers. However, the values for 11 or more trials are correct. As I have already noted, the numbers used in the report itself appear to be correct. Some would argue that their conclusion should be that they found no evidence to support a claim of audible difference, rather than concluding that there was no audible difference, but that's another issue. You have correctly noted concerns about the form of ABX presentation (not the same as the usual ABX scheme discussed on rahe) but that does not invalidate the experiment. There are questions about how the sample of listeners was obtained. For the most part, TAG McLaren seems to have designed a test, carried it out, run the numbers properly, and then accurately reported what they did. That's more than can be said for many tests. JC |
#17
|
|||
|
|||
Blindtest question
"John Corbett" wrote in message
news:hQlWa.37169$YN5.32913@sccrnsc01... In article zVGVa.15179$Ho3.2323@sccrnsc03, "Harry Lavo" wrote: Thomas - Thanks for the post of the Tag Mclaren test link (and to Tom for the other references). I've looked at the Tag link and suspect it's going to add to the controversy here. My comments on the test follow. From the tone of the web info on this test, one can presume that Tag set out to show its relatively inexpensive gear was just as good as some acknowledged industry standards. But....wonder why Tag choose the 99% confidence level? Being careful *not* to say that it was prechosen in advance? It is because had they used the more common and almost universally-used 95% level it would have shown that: * When cable A was the "X" it was recognized at a significant level by the panel (and guess whose cable probably would "lose" in a preference test versus a universally recognized standard of excellence chosen as "tops" by both Stereophile and TAS, as well as by other industry publishers) * One individual differentiated both cable A and combined cables at the significant level Results summarized as follows: Tag Mclaren Published ABX Results Sample 99% 95% Actual Confidence Total Test Cables A 96 60 53 e 52 94.8% e B 84 54 48 e 38 coin toss Both 180 107 97 e 90 coin toss Amps A 96 60 53 e 47 coin toss B 84 54 48 e 38 coin toss Both 180 107 97 e 85 coin toss Top Individuals Cables A 8 8 7 6 94.5% B 7 7 7 5 83.6% Both 15 13 11 11 95.8% Amps A 8 8 7 5 83.6% B 7 7 7 5 83.6% Both 15 13 11 10 90.8% e = extrapolated based on scores for 100 and 50 sample size [snip] I'd like to hear other views on this test. Mr. Lavo, here are some comments on your numbers. The short story: your numbers are bogus. The long story follows. I don't know how you came up with critical values for what you think is a reasonable level of significance. For n = 96 trials, the critical values a 60 for .01 level of significance 57 for .05 level of significance 53 for .20 level of significance for n = 84 trials, the critical values a 54 for .01 level of significance 51 for .05 level of significance 47 for .20 level of significance for n = 180 trials, the critical values a 107 for .01 level of significance 102 for .05 level of significance 97 for .20 level of significance for n = 8 trials, the critical values a 8 for .01 level of significance 7 for .05 level of significance 6 for .20 level of significance for n = 7 trials, the critical values a 7 for .01 level of significance 7 for .05 level of significance 6 for .20 level of significance for n = 15 trials, the critical values a 13 for .01 level of significance 12 for .05 level of significance 10 for .20 level of significance The values you provide for what you call 95% confidence (i.e., .05 level of significance) are almost the correct values for 20% significance. You make much of an apparently borderline significant result, where the best individual cable test scores were 11 of 15 correct. If that had been the entire experiment, we would have a p-value of .059, reflecting the probability that one would do at least that well in a single run of 15 trials. That is, the probability that someone would score 11, 12, 13, 14, or 15 correct just by guessing is .059; also, the probability that the score is less than 11 would be 1 - .059 = .941 for a single batch of 15 trials. But what was reported was the best such performance in a dozen sets of trials. That's not the same as a single run of 15 trials. The probability of at least one of 12 subjects doing at least as well as 11 of 15 is 1 - [ probability that all 12 do worse than 11 of 15 ]. Thus we get 1 - (.941)^(12), which is about .52. So, even your star performer is not doing better than chance suggests he should. Mr. Lavo, your conjecture (that that the test organizers have tried to distort the results to fit an agenda) appears to be without support. Now for some comments on the TAG McLaren report itself. There are problems with some numbers provided by TAG McLaren, but they are confined to background material. There do not appear to be problems with the actual report of experimental results. TAG McLaren claims that you need more than 10 trials to obtain results significant at the .01 level, but they are wrong. In fact, 7 trials suffice. With 10 trials you can reach .001. There is a table just before the results section of their report with some discussion about small sample sizes. The first several rows of that table have bogus numbers in the third column, and their sample size claims are based on those wrong numbers. However, the values for 11 or more trials are correct. As I have already noted, the numbers used in the report itself appear to be correct. Some would argue that their conclusion should be that they found no evidence to support a claim of audible difference, rather than concluding that there was no audible difference, but that's another issue. You have correctly noted concerns about the form of ABX presentation (not the same as the usual ABX scheme discussed on rahe) but that does not invalidate the experiment. There are questions about how the sample of listeners was obtained. For the most part, TAG McLaren seems to have designed a test, carried it out, run the numbers properly, and then accurately reported what they did. That's more than can be said for many tests. John - I have explained that I made an error, and I thank you for pointing it out. I also explained how and why, but that is an explaination, not an excuse. Perhaps you could bring your statistical skills to bear on the Greenhill test as reported in Stereo Review in1983. The raw results were posted here by me and by Ludovic about a year ago. As I recall one of the participants in that test did very well across several tests..I'd be interested in your calculation of the probability of his achieving those results by chance. This is not a troll..the mathematics of it are simply beyond me and I tried at the time to calculate the odds and apparently failed. The argument was: outlyer, or golden ear. |
#18
|
|||
|
|||
Blindtest question
(Thomas A) wrote:
(Nousaine) wrote in message ... (Thomas A) wrote: Is there any published DBT of amps, CD players or cables where the number of trials are greater than 500? If there difference is miniscule there is likely that many "guesses" are wrong and would require many trials to reveal any subtle difference? Thomas With regard to amplifiers as of May 1990 there had been such tests. In 1978 QUAD published an erxperiment with 576 trials. In 1980 Smith peterson and Jackson published an experiment with 1104 trials; in 1989 Stereophile published a 3530 trial comparison. In 1986 Clark & Masters published an experiment with 772 trials. All were null. There's a misconception that blind tests tend to have very small sample sizes. As of 1990 the 23 published amplifier experiments had a mean average of 426 and a median of 90 trials. If we exclude the 3530 trial experiment the mean becomes 285 trials. The median remains unchanged. Ok thanks. Is it possible to get the numbers for each test? I would like to see if it possible to do a meta-analysis in the amplifier case. The test by tagmclaren is an additional one: http://www.tagmclaren.com/members/news/news77.asp Thomas I did just that in 1990 to answer the nagging question "has sample size and barely audible difference hidden anything?" A summary of these data can be found in The Proceedings of the 1990 AES Conference "The Sound of Audio" May 1990 in the paper "The Great Debate: Is Anyone Winning?" (www.aes.org) In general larger sample sizes did not produce more significant results and there wasn't a relationship of criterion score to sample size. IME if there is a true just-audible difference scores tend to run high. For example in tests I ran last summer scores were, as I recall, 21/23 and 17/21 in two successive runs in a challenge where the session leader claimed a transparent transfer. IOW results go from chance to strongly positive once threshold has been reached. You can test this for yourself at www.pcabx.com where Arny Krueger has training sessions with increasing levels of difficulty. Also the codec testing sites are a good place to investigate this issue. |
#19
|
|||
|
|||
Blindtest question
(Nousaine) wrote in message news:nWGVa.15987$YN5.14030@sccrnsc01...
(Thomas A) wrote: (Nousaine) wrote in message ... (Thomas A) wrote: Is there any published DBT of amps, CD players or cables where the number of trials are greater than 500? If there difference is miniscule there is likely that many "guesses" are wrong and would require many trials to reveal any subtle difference? Thomas With regard to amplifiers as of May 1990 there had been such tests. In 1978 QUAD published an erxperiment with 576 trials. In 1980 Smith peterson and Jackson published an experiment with 1104 trials; in 1989 Stereophile published a 3530 trial comparison. In 1986 Clark & Masters published an experiment with 772 trials. All were null. There's a misconception that blind tests tend to have very small sample sizes. As of 1990 the 23 published amplifier experiments had a mean average of 426 and a median of 90 trials. If we exclude the 3530 trial experiment the mean becomes 285 trials. The median remains unchanged. Ok thanks. Is it possible to get the numbers for each test? I would like to see if it possible to do a meta-analysis in the amplifier case. The test by tagmclaren is an additional one: http://www.tagmclaren.com/members/news/news77.asp Thomas I did just that in 1990 to answer the nagging question "has sample size and barely audible difference hidden anything?" A summary of these data can be found in The Proceedings of the 1990 AES Conference "The Sound of Audio" May 1990 in the paper "The Great Debate: Is Anyone Winning?" (www.aes.org) Ok thanks. I'll look it up. In general larger sample sizes did not produce more significant results and there wasn't a relationship of criterion score to sample size. Where the data from all experiments pooled? It might not be the best way, if some experiments *did* include real audible differences but in which the sample size was too small to reveal any statistically significant difference whereas other did not include real audible difference. Any measured responses in the experiments? Did any of test include control tests where the difference was audible but subtle and then comparing e.g. different subjects? Where the "best scorers" allowed to repeat the experiments in the main experiment? Many questions but they may be relevant when making a meta-analysis. In addition, have any of the experiments used test signals in the LF range (around 15-20 Hz) and high-capable subwoofers (120 dB SPL @ 20 Hz)? I've just curious since the tests from the Swedish Audio-Technical Society frequently identifies amplfiers than roll of in the low end using blind tests. It might not be said to be an audible difference since the difference is percieved as a difference in vibrations in the body. I think I mentioned this before. Also for testing CD players, have anybody used a sin2 pulse in evaluating audible differences? IME if there is a true just-audible difference scores tend to run high. For example in tests I ran last summer scores were, as I recall, 21/23 and 17/21 in two successive runs in a challenge where the session leader claimed a transparent transfer. IOW results go from chance to strongly positive once threshold has been reached. Yes, I have come to similar conclusions myself in my own system. You can test this for yourself at www.pcabx.com where Arny Krueger has training sessions with increasing levels of difficulty. Also the codec testing sites are a good place to investigate this issue. I've tried the tests at Arnys site a couple of times, but I feel I need better hardware to do these tests more accurate. |
#21
|
|||
|
|||
Blindtest question
"Thomas A" wrote in message
newsBUUa.141509$OZ2.27088@rwcrnsc54 Is there any published DBT of amps, CD players or cables where the number of trials are greater than 500? I think N = 200+ has been reached. If there difference is miniscule there is likely that many "guesses" are wrong and would require many trials to reveal any subtle difference? If you look at theory casually, you might reach that conclusion. However, what invariably happens in tests that produce questionable results with a small number of trials, is that adding more trials makes it clearer than ever that the small-sample results were due to random guessing. |
#22
|
|||
|
|||
Blindtest question
"Arny Krueger" wrote in message news:0xnVa.4274$cF.1296@rwcrnsc53...
"Thomas A" wrote in message newsBUUa.141509$OZ2.27088@rwcrnsc54 Is there any published DBT of amps, CD players or cables where the number of trials are greater than 500? I think N = 200+ has been reached. If there difference is miniscule there is likely that many "guesses" are wrong and would require many trials to reveal any subtle difference? If you look at theory casually, you might reach that conclusion. However, what invariably happens in tests that produce questionable results with a small number of trials, is that adding more trials makes it clearer than ever that the small-sample results were due to random guessing. So what happens when the situation is small but just audible? Has any such test situations been set up? Does the result end up with close to 100% correct or e.g. 55% correct? My question is what happens when test subjects are "forced" with differences that approach to the "audible limit". |
#23
|
|||
|
|||
Blindtest question
Ref: Blindtest issues...
For what its worth... Use about any criteria you desire regarding cables, amps, etc...also, if you feel better about it, put a sign on each component with its name in blazing qualities. It possibly will make you feel better about the system and strangely, the whole thing might well sound better. That is part of this whole experience regarding audio...if your prejudices are deep set from within..then give in to them and enjoy the music. Be happy with the most expensive equipment you can afford, it might well be pretty good..mentally, you might come to accept that fact..music will flourish, bloom and all will be right with the Universe!! All this "shadow-boxing" regarding "all is the same" is interesting in this strange dimension that surrounds Audio. Go with you own prejudices and be happy. Very important to your Audio happiness! Leonard... __________________________________________________ _____ On Sun, 27 Jul 2003 18:11:48 +0000, Thomas A wrote: Is there any published DBT of amps, CD players or cables where the number of trials are greater than 500? If there difference is miniscule there is likely that many "guesses" are wrong and would require many trials to reveal any subtle difference? Thomas |
Reply |
Thread Tools | |
Display Modes | |
|
|
Similar Threads | ||||
Thread | Forum | |||
RCA out and Speaker Question in 2004 Ranger Edge Question | Car Audio | |||
capacitor + parallel wiring question? | Car Audio | |||
Sub + amp wiring question | Car Audio | |||
Subwoofer box question | Car Audio | |||
Subwoofer position question | Audio Opinions |