Home |
Search |
Today's Posts |
#1
![]() |
|||
|
|||
![]()
Is there any published DBT of amps, CD players or cables where the
number of trials are greater than 500? If there difference is miniscule there is likely that many "guesses" are wrong and would require many trials to reveal any subtle difference? Thomas |
#2
![]() |
|||
|
|||
![]()
Thomas A wrote:
Is there any published DBT of amps, CD players or cables where the number of trials are greater than 500? If there difference is miniscule there is likely that many "guesses" are wrong and would require many trials to reveal any subtle difference? There are published tests where people claimed they could hear the difference sighted , but when they were 'blinded' they could not. In this case the argument that 500 trials are needed would seem to be weak. However, a real and miniscule difference would certainly be discerned more reliably if there was specific training to hear it beforehand. -- -S. |
#3
![]() |
|||
|
|||
![]()
Steven Sullivan wrote in message news:d_XUa.142496$GL4.36308@rwcrnsc53...
Thomas A wrote: Is there any published DBT of amps, CD players or cables where the number of trials are greater than 500? If there difference is miniscule there is likely that many "guesses" are wrong and would require many trials to reveal any subtle difference? There are published tests where people claimed they could hear the difference sighted , but when they were 'blinded' they could not. In this case the argument that 500 trials are needed would seem to be weak. Yes, that's for sure. But how are scientific tests of just noticable difference set up? A difference, when very small, could introduce more incorrect answers from the test subjects. Thus I think the question is interesting. However, a real and miniscule difference would certainly be discerned more reliably if there was specific training to hear it beforehand. Yes, but still, if the difference is real and miniscule it could introduce incorrect answers even if there is specific training beforehand. If there would be an all or nothing thing, then the result would always be 100% correct (difference) or 50% (no difference). What if the answers are 60% correct? |
#4
![]() |
|||
|
|||
![]()
"Thomas A" wrote in message
news ![]() Is there any published DBT of amps, CD players or cables where the number of trials are greater than 500? I've never seen one. It would be difficult to get a single subject to do that many trials. So, it would have to be many subjects and they would have to be isolated to prevent subtle influence from one to the other. If there difference is miniscule there is likely that many "guesses" are wrong and would require many trials to reveal any subtle difference? (Note that the word is spelled "minuscule.") Norm Strong |
#6
![]() |
|||
|
|||
![]() |
#7
![]() |
|||
|
|||
![]() |
#8
![]() |
|||
|
|||
![]()
Thomas A wrote:
Steven Sullivan wrote in message news:d_XUa.142496$GL4.36308@rwcrnsc53... Thomas A wrote: Is there any published DBT of amps, CD players or cables where the number of trials are greater than 500? If there difference is miniscule there is likely that many "guesses" are wrong and would require many trials to reveal any subtle difference? There are published tests where people claimed they could hear the difference sighted , but when they were 'blinded' they could not. In this case the argument that 500 trials are needed would seem to be weak. Yes, that's for sure. But how are scientific tests of just noticable difference set up? A difference, when very small, could introduce more incorrect answers from the test subjects. Thus I think the question is interesting. However, a real and miniscule difference would certainly be discerned more reliably if there was specific training to hear it beforehand. Yes, but still, if the difference is real and miniscule it could introduce incorrect answers even if there is specific training beforehand. If there would be an all or nothing thing, then the result would always be 100% correct (difference) or 50% (no difference). What if the answers are 60% correct? What level of certitude are you looking for? Scientists use statistical tools to calculate probabilities of different kinds of error in such cases. -- -S. |
#9
![]() |
|||
|
|||
![]()
"Thomas A" wrote in message
news ![]() Is there any published DBT of amps, CD players or cables where the number of trials are greater than 500? I think N = 200+ has been reached. If there difference is miniscule there is likely that many "guesses" are wrong and would require many trials to reveal any subtle difference? If you look at theory casually, you might reach that conclusion. However, what invariably happens in tests that produce questionable results with a small number of trials, is that adding more trials makes it clearer than ever that the small-sample results were due to random guessing. |
#11
![]() |
|||
|
|||
![]()
"Arny Krueger" wrote in message news:0xnVa.4274$cF.1296@rwcrnsc53...
"Thomas A" wrote in message news ![]() Is there any published DBT of amps, CD players or cables where the number of trials are greater than 500? I think N = 200+ has been reached. If there difference is miniscule there is likely that many "guesses" are wrong and would require many trials to reveal any subtle difference? If you look at theory casually, you might reach that conclusion. However, what invariably happens in tests that produce questionable results with a small number of trials, is that adding more trials makes it clearer than ever that the small-sample results were due to random guessing. So what happens when the situation is small but just audible? Has any such test situations been set up? Does the result end up with close to 100% correct or e.g. 55% correct? My question is what happens when test subjects are "forced" with differences that approach to the "audible limit". |
#12
![]() |
|||
|
|||
![]()
Steven Sullivan wrote in message news:UqnVa.4003$Oz4.1480@rwcrnsc54...
Thomas A wrote: Steven Sullivan wrote in message news:d_XUa.142496$GL4.36308@rwcrnsc53... Thomas A wrote: Is there any published DBT of amps, CD players or cables where the number of trials are greater than 500? If there difference is miniscule there is likely that many "guesses" are wrong and would require many trials to reveal any subtle difference? There are published tests where people claimed they could hear the difference sighted , but when they were 'blinded' they could not. In this case the argument that 500 trials are needed would seem to be weak. Yes, that's for sure. But how are scientific tests of just noticable difference set up? A difference, when very small, could introduce more incorrect answers from the test subjects. Thus I think the question is interesting. However, a real and miniscule difference would certainly be discerned more reliably if there was specific training to hear it beforehand. Yes, but still, if the difference is real and miniscule it could introduce incorrect answers even if there is specific training beforehand. If there would be an all or nothing thing, then the result would always be 100% correct (difference) or 50% (no difference). What if the answers are 60% correct? What level of certitude are you looking for? Scientists use statistical tools to calculate probabilities of different kinds of error in such cases. Well confidence limits of 95% or 99% are usually applied. The power of the test is however important when you approach the audible limit. Also, in sample sizes 200 you need not use correction for continuity in the statistical calculation. I am not sure, but I think this correction applies in this case when sample sizes are 25-200. Below 25, this correction is not sufficient. |
#13
![]() |
|||
|
|||
![]()
Thomas -
Thanks for the post of the Tag Mclaren test link (and to Tom for the other references). I've looked at the Tag link and suspect it's going to add to the controversy here. My comments on the test follow. From the tone of the web info on this test, one can presume that Tag set out to show its relatively inexpensive gear was just as good as some acknowledged industry standards. But....wonder why Tag choose the 99% confidence level? Being careful *not* to say that it was prechosen in advance? It is because had they used the more common and almost universally-used 95% level it would have shown that: * When cable A was the "X" it was recognized at a significant level by the panel (and guess whose cable probably would "lose" in a preference test versus a universally recognized standard of excellence chosen as "tops" by both Stereophile and TAS, as well as by other industry publishers) * One individual differentiated both cable A and combined cables at the significant level Results summarized as follows: Tag Mclaren Published ABX Results Sample 99% 95% Actual Confidence Total Test Cables A 96 60 53 e 52 94.8% e B 84 54 48 e 38 coin toss Both 180 107 97 e 90 coin toss Amps A 96 60 53 e 47 coin toss B 84 54 48 e 38 coin toss Both 180 107 97 e 85 coin toss Top Individuals Cables A 8 8 7 6 94.5% B 7 7 7 5 83.6% Both 15 13 11 11 95.8% Amps A 8 8 7 5 83.6% B 7 7 7 5 83.6% Both 15 13 11 10 90.8% e = extrapolated based on scores for 100 and 50 sample size In general, the test while seemingly objective has more negatives than positives when measured against the consensus of the objectivists (and some subjectivists) in this group as to what constitutes a good abx test: TEST POSITIVES *double blind *level matched TEST NEGATIVES *short snippets *no user control over switching and (apparently) no repeats *no user control over content *group test, no safeguards against visual interaction *no group selection criteria apparent and no pre-training or testing The results and the summary of positives/negatives above raise some interesting questions: *why, for example, should one cable be significantly identified when "x" and the other fail miserably to be identified. This has to be due and interaction between the characteristics of the music samples chosen, the characteristics of the cables under test, and perhaps aggravated by the use of short snippets with an inadequate time frame to establish the proper evaluation context. Did the test itself create the overall null where people could not differentiate based soley on the test not favoring B as much as A? * do the differences in people scoring high on the two tests support the idea that different people react to different attributes of the DUT's. Or does it again suggest some interaction between the music chosen, the characteristics of the individual pieces, and perhaps the evaluation time frame. * or is it possible that the abx test itself, when used with short snippets, makes some kinds of differences more apparent and others less apparent and thus by working against exposing *all* kinds of differences help create more *no differences* than should be the result. * since the panel is not identified and there was no training, do the results suggest a "dumbing down" of differentiation from the scores of the more able listeners? I am sure it will be suggested that the two different high scorers were simply random outliers...I'm not so sure especially since the individual scoring high on the cable test hears the cable differences exactly like the general sample but at a higher level (required because of smaller sample size) and the high scorer on the amp test is in much the same position. if some of these arguments sound familiar, they certainly raises echoes of the issues raised here by subjectivists over the years...and yet these specifics are rooted in the results of this one test. I'd like to hear other views on this test. "Thomas A" wrote in message news:ahwVa.6957$cF.2308@rwcrnsc53... (Nousaine) wrote in message ... (Thomas A) wrote: Is there any published DBT of amps, CD players or cables where the number of trials are greater than 500? If there difference is miniscule there is likely that many "guesses" are wrong and would require many trials to reveal any subtle difference? Thomas With regard to amplifiers as of May 1990 there had been such tests. In 1978 QUAD published an erxperiment with 576 trials. In 1980 Smith peterson and Jackson published an experiment with 1104 trials; in 1989 Stereophile published a 3530 trial comparison. In 1986 Clark & Masters published an experiment with 772 trials. All were null. There's a misconception that blind tests tend to have very small sample sizes. As of 1990 the 23 published amplifier experiments had a mean average of 426 and a median of 90 trials. If we exclude the 3530 trial experiment the mean becomes 285 trials. The median remains unchanged. Ok thanks. Is it possible to get the numbers for each test? I would like to see if it possible to do a meta-analysis in the amplifier case. The test by tagmclaren is an additional one: http://www.tagmclaren.com/members/news/news77.asp Thomas |
#14
![]() |
|||
|
|||
![]()
(Thomas A) wrote:
(Nousaine) wrote in message ... (Thomas A) wrote: Is there any published DBT of amps, CD players or cables where the number of trials are greater than 500? If there difference is miniscule there is likely that many "guesses" are wrong and would require many trials to reveal any subtle difference? Thomas With regard to amplifiers as of May 1990 there had been such tests. In 1978 QUAD published an erxperiment with 576 trials. In 1980 Smith peterson and Jackson published an experiment with 1104 trials; in 1989 Stereophile published a 3530 trial comparison. In 1986 Clark & Masters published an experiment with 772 trials. All were null. There's a misconception that blind tests tend to have very small sample sizes. As of 1990 the 23 published amplifier experiments had a mean average of 426 and a median of 90 trials. If we exclude the 3530 trial experiment the mean becomes 285 trials. The median remains unchanged. Ok thanks. Is it possible to get the numbers for each test? I would like to see if it possible to do a meta-analysis in the amplifier case. The test by tagmclaren is an additional one: http://www.tagmclaren.com/members/news/news77.asp Thomas I did just that in 1990 to answer the nagging question "has sample size and barely audible difference hidden anything?" A summary of these data can be found in The Proceedings of the 1990 AES Conference "The Sound of Audio" May 1990 in the paper "The Great Debate: Is Anyone Winning?" (www.aes.org) In general larger sample sizes did not produce more significant results and there wasn't a relationship of criterion score to sample size. IME if there is a true just-audible difference scores tend to run high. For example in tests I ran last summer scores were, as I recall, 21/23 and 17/21 in two successive runs in a challenge where the session leader claimed a transparent transfer. IOW results go from chance to strongly positive once threshold has been reached. You can test this for yourself at www.pcabx.com where Arny Krueger has training sessions with increasing levels of difficulty. Also the codec testing sites are a good place to investigate this issue. |
#15
![]() |
|||
|
|||
![]()
"Harry Lavo" wrote:
Thomas - Thanks for the post of the Tag Mclaren test link (and to Tom for the other references). I've looked at the Tag link and suspect it's going to add to the controversy here. Actually there's no 'controversey' here. No proponent of amp/wire-sound has ever shown that nominally competent amps or wires have any sound of their own when played back over loudspeakers. The only 'controversey' is over whether Arny Kreuger's pcabx tests cab with headphones and special programs can be extrapolated to commerically available programs and speakers in a normally reverberant environment. The Tag-M results are fully within those expected given the more than 2 dozen published experiments of amps and wires. y comments on the test follow. From the tone of the web info on this test, one can presume that Tag set out to show its relatively inexpensive gear was just as good as some acknowledged industry standards. But....wonder why Tag choose the 99% confidence level? Why not? But you can analyze it any way your want. That's the wonderful thing about published results. Being careful *not* to say that it was prechosen in advance? It is because had they used the more common and almost universally-used 95% level it would have shown that: * When cable A was the "X" it was recognized at a significant level by the panel (and guess whose cable probably would "lose" in a preference test versus a universally recognized standard of excellence chosen as "tops" by both Stereophile and TAS, as well as by other industry publishers) * One individual differentiated both cable A and combined cables at the significant level Results summarized as follows: Tag Mclaren Published ABX Results Sample 99% 95% Actual Confidence Total Test Cables A 96 60 53 e 52 94.8% e B 84 54 48 e 38 coin toss Both 180 107 97 e 90 coin toss Amps A 96 60 53 e 47 coin toss B 84 54 48 e 38 coin toss Both 180 107 97 e 85 coin toss Top Individuals Cables A 8 8 7 6 94.5% B 7 7 7 5 83.6% Both 15 13 11 11 95.8% Amps A 8 8 7 5 83.6% B 7 7 7 5 83.6% Both 15 13 11 10 90.8% e = extrapolated based on scores for 100 and 50 sample size In general, the test while seemingly objective has more negatives than positives when measured against the consensus of the objectivists (and some subjectivists) in this group as to what constitutes a good abx test: This is what always happens with 'bad news.' Instead of giving us contradictory evidence we get endless wishful 'data-dredging' to find any possible reason to ignore the evidence. In any other circle when one thinks the results of a given experiment are wrong they just duplicate it showing the error OR produce a valid one with contrary evidence. TEST POSITIVES *double blind *level matched TEST NEGATIVES *short snippets *no user control over switching and (apparently) no repeats *no user control over content *group test, no safeguards against visual interaction *no group selection criteria apparent and no pre-training or testing OK how many of your sighted 'tests' have ignored one or all of these positives or negatives? The results and the summary of positives/negatives above raise some interesting questions: No, not really. All of the true questions about bias controlled listening tests have been addressed prior. *why, for example, should one cable be significantly identified when "x" and the other fail miserably to be identified. This has to be due and interaction between the characteristics of the music samples chosen, the characteristics of the cables under test, and perhaps aggravated by the use of short snippets with an inadequate time frame to establish the proper evaluation context. Did the test itself create the overall null where people could not differentiate based soley on the test not favoring B as much as A? * do the differences in people scoring high on the two tests support the idea that different people react to different attributes of the DUT's. Or does it again suggest some interaction between the music chosen, the characteristics of the individual pieces, and perhaps the evaluation time frame. * or is it possible that the abx test itself, when used with short snippets, makes some kinds of differences more apparent and others less apparent and thus by working against exposing *all* kinds of differences help create more *no differences* than should be the result. * since the panel is not identified and there was no training, do the results suggest a "dumbing down" of differentiation from the scores of the more able listeners? I am sure it will be suggested that the two different high scorers were simply random outliers...I'm not so sure especially since the individual scoring high on the cable test hears the cable differences exactly like the general sample but at a higher level (required because of smaller sample size) and the high scorer on the amp test is in much the same position. if some of these arguments sound familiar, they certainly raises echoes of the issues raised here by subjectivists over the years...and yet these specifics are rooted in the results of this one test. I'd like to hear other views on this test. These results are consistent with the 2 dozen and more other bias controlled listening tests of power amplifiers and wires. "Thomas A" wrote in message news:ahwVa.6957$cF.2308@rwcrnsc53... (Nousaine) wrote in message ... (Thomas A) wrote: Is there any published DBT of amps, CD players or cables where the number of trials are greater than 500? If there difference is miniscule there is likely that many "guesses" are wrong and would require many trials to reveal any subtle difference? Thomas With regard to amplifiers as of May 1990 there had been such tests. In 1978 QUAD published an erxperiment with 576 trials. In 1980 Smith peterson and Jackson published an experiment with 1104 trials; in 1989 Stereophile published a 3530 trial comparison. In 1986 Clark & Masters published an experiment with 772 trials. All were null. There's a misconception that blind tests tend to have very small sample sizes. As of 1990 the 23 published amplifier experiments had a mean average of 426 and a median of 90 trials. If we exclude the 3530 trial experiment the mean becomes 285 trials. The median remains unchanged. Ok thanks. Is it possible to get the numbers for each test? I would like to see if it possible to do a meta-analysis in the amplifier case. The test by tagmclaren is an additional one: Thanks for the reference. |
#16
![]() |
|||
|
|||
![]()
On Wed, 30 Jul 2003 03:26:24 GMT, "Harry Lavo"
wrote: From the tone of the web info on this test, one can presume that Tag set out to show its relatively inexpensive gear was just as good as some acknowledged industry standards. But....wonder why Tag choose the 99% confidence level? Being careful *not* to say that it was prechosen in advance? It is because had they used the more common and almost universally-used 95% level it would have shown that: Can anyone smell fish? Specifically, red herring? * When cable A was the "X" it was recognized at a significant level by the panel (and guess whose cable probably would "lose" in a preference test versus a universally recognized standard of excellence chosen as "tops" by both Stereophile and TAS, as well as by other industry publishers) No Harry, *all* tests fell below the 95% level, except for one single participant in the cable test, which just scraped in. Given that there were 12 volunteers, there's less than 2:1 odds against this happening when tossing coins. Interesting that you also failed to note that the 'best performers' in the cable test did *not* perform well in the amplifier test, and vice versa. You do love to cherry-pick in search of your *required* result, don't you? * One individual differentiated both cable A and combined cables at the significant level Results summarized as follows: Tag Mclaren Published ABX Results Sample 99% 95% Actual Confidence Total Test Cables A 96 60 53 e 52 94.8% e B 84 54 48 e 38 coin toss Both 180 107 97 e 90 coin toss Amps A 96 60 53 e 47 coin toss B 84 54 48 e 38 coin toss Both 180 107 97 e 85 coin toss Top Individuals Cables A 8 8 7 6 94.5% B 7 7 7 5 83.6% Both 15 13 11 11 95.8% Amps A 8 8 7 5 83.6% B 7 7 7 5 83.6% Both 15 13 11 10 90.8% e = extrapolated based on scores for 100 and 50 sample size In general, the test while seemingly objective has more negatives than positives when measured against the consensus of the objectivists (and some subjectivists) in this group as to what constitutes a good abx test: TEST POSITIVES *double blind *level matched TEST NEGATIVES *short snippets *no user control over switching and (apparently) no repeats *no user control over content *group test, no safeguards against visual interaction *no group selection criteria apparent and no pre-training or testing The results and the summary of positives/negatives above raise some interesting questions: *why, for example, should one cable be significantly identified when "x" and the other fail miserably to be identified. This has to be due and interaction between the characteristics of the music samples chosen, the characteristics of the cables under test, and perhaps aggravated by the use of short snippets with an inadequate time frame to establish the proper evaluation context. No it doen't Harry, I doesn't *have* to be due to anything but random chance. Did the test itself create the overall null where people could not differentiate based soley on the test not favoring B as much as A? * do the differences in people scoring high on the two tests support the idea that different people react to different attributes of the DUT's. Or does it again suggest some interaction between the music chosen, the characteristics of the individual pieces, and perhaps the evaluation time frame. No, since the high scorers on one test were not the high scorers in the other test. It's called a distrinution, harry, and it is simply more evidence that there were in fact no audible differences - as any reasonable person would expect. http://www.tagmclaren.com/members/news/news77.asp -- Stewart Pinkerton | Music is Art - Audio is Engineering |
#17
![]() |
|||
|
|||
![]()
Nousaine wrote:
This is what always happens with 'bad news.' Instead of giving us contradictory evidence we get endless wishful 'data-dredging' to find any possible reason to ignore the evidence. In any other circle when one thinks the results of a given experiment are wrong they just duplicate it showing the error OR produce a valid one with contrary evidence. Not necessarily. It's quite common for questions to be raised during peer review of a scientific paper; it is then incumbent upon the *experimenter*, not the critic, to justify his or her choice of protocol, or his/her explanation of the results. Often this involves doing more experiments to address the reviewer's concerns. Sometimes it merely involved explaining the results more clearly, or in more qualified terms. If the experimenter feels the reviewer has ignored some important point, that comes out too in the reply to the reviews. I say all this having not yet visited the link, so I'm totally unbiased ; |
#18
![]() |
|||
|
|||
![]()
"Stewart Pinkerton" wrote in message
news:7jKVa.10234$Oz4.4174@rwcrnsc54... On Wed, 30 Jul 2003 03:26:24 GMT, "Harry Lavo" wrote: From the tone of the web info on this test, one can presume that Tag set out to show its relatively inexpensive gear was just as good as some acknowledged industry standards. But....wonder why Tag choose the 99% confidence level? Being careful *not* to say that it was prechosen in advance? It is because had they used the more common and almost universally-used 95% level it would have shown that: Can anyone smell fish? Specifically, red herring? Are you an outlyer? Or are you simply sensitive to fish? Or did you not conceive that thought double-blind and it is just your imagination? :=) * When cable A was the "X" it was recognized at a significant level by the panel (and guess whose cable probably would "lose" in a preference test versus a universally recognized standard of excellence chosen as "tops" by both Stereophile and TAS, as well as by other industry publishers) No Harry, *all* tests fell below the 95% level, except for one single participant in the cable test, which just scraped in. Given that there were 12 volunteers, there's less than 2:1 odds against this happening when tossing coins. Interesting that you also failed to note that the 'best performers' in the cable test did *not* perform well in the amplifier test, and vice versa. I'm sorry, but when rounded to whole numbers 94.8% is a lot closer than one number higher which would be about 96% in the larger panels and 97% in the smaller panels. The standard is 95%. To say that 94.8% doesn't qualify is splitting hairs. I inclduded the actual numbers needed to pass the barrier just to satisfy the purists, but you *ARE* splitting hairs here, Stewart. You do love to cherry-pick in search of your *required* result, don't you? You mean not accepting the "received truth" without doing my own analysis is cherry picking, is that it Stewart? We are not allowed to point out anonomlies and ask "why"? "how come"? "what could be causing this?" And would you explain why a significant level was reached on the "A" cable test with 96 trials? Was that "cherry picking". C'mon, Stewart, you know better. In fact the real issue here is: if one cable can be so readily picked out, why can't the other be? What is it in the test, procedure, quality of the cables, order bias, or what. Something is rotten in the beloved state of ABX here! * One individual differentiated both cable A and combined cables at the significant level Results summarized as follows: Tag Mclaren Published ABX Results Sample 99% 95% Actual Confidence Total Test Cables A 96 60 53 e 52 94.8% e B 84 54 48 e 38 coin toss Both 180 107 97 e 90 coin toss Amps A 96 60 53 e 47 coin toss B 84 54 48 e 38 coin toss Both 180 107 97 e 85 coin toss Top Individuals Cables A 8 8 7 6 94.5% B 7 7 7 5 83.6% Both 15 13 11 11 95.8% Amps A 8 8 7 5 83.6% B 7 7 7 5 83.6% Both 15 13 11 10 90.8% e = extrapolated based on scores for 100 and 50 sample size In general, the test while seemingly objective has more negatives than positives when measured against the consensus of the objectivists (and some subjectivists) in this group as to what constitutes a good abx test: TEST POSITIVES *double blind *level matched TEST NEGATIVES *short snippets *no user control over switching and (apparently) no repeats *no user control over content *group test, no safeguards against visual interaction *no group selection criteria apparent and no pre-training or testing The results and the summary of positives/negatives above raise some interesting questions: *why, for example, should one cable be significantly identified when "x" and the other fail miserably to be identified. This has to be due and interaction between the characteristics of the music samples chosen, the characteristics of the cables under test, and perhaps aggravated by the use of short snippets with an inadequate time frame to establish the proper evaluation context. No it doen't Harry, I doesn't *have* to be due to anything but random chance. Did the test itself create the overall null where people could not differentiate based soley on the test not favoring B as much as A? * do the differences in people scoring high on the two tests support the idea that different people react to different attributes of the DUT's. Or does it again suggest some interaction between the music chosen, the characteristics of the individual pieces, and perhaps the evaluation time frame. No, since the high scorers on one test were not the high scorers in the other test. It's called a distrinution, harry, and it is simply more evidence that there were in fact no audible differences - as any reasonable person would expect. http://www.tagmclaren.com/members/news/news77.asp -- Stewart Pinkerton | Music is Art - Audio is Engineering I notice no comment on this latter part, Stewart. That is the *SUBSTANCE* of the interesting results of the test/techniques used and the questions raised. |
#19
![]() |
|||
|
|||
![]()
Steven Sullivan wrote in message et...
Nousaine wrote: This is what always happens with 'bad news.' Instead of giving us contradictory evidence we get endless wishful 'data-dredging' to find any possible reason to ignore the evidence. In any other circle when one thinks the results of a given experiment are wrong they just duplicate it showing the error OR produce a valid one with contrary evidence. Not necessarily. It's quite common for questions to be raised during peer review of a scientific paper; it is then incumbent upon the *experimenter*, not the critic, to justify his or her choice of protocol, or his/her explanation of the results. Often this involves doing more experiments to address the reviewer's concerns. Sometimes it merely involved explaining the results more clearly, or in more qualified terms. If the experimenter feels the reviewer has ignored some important point, that comes out too in the reply to the reviews. I say all this having not yet visited the link, so I'm totally unbiased ; Bravo Mr. Sullivan. I hope you'll be as pleased to accept my applause as I am to see your excellent exposure of the frequently-voiced challenge to the ABX sceptics to "prove" their sceptical questions. Exposure coming from an unexpected corner. Perhaps we're seeing a revival of intellectual integrity in debate on RAHE. I promise to quote your summary when occasion warrants it. Ludovic Mirabel |
#20
![]() |
|||
|
|||
![]()
ludovic mirabel wrote:
Steven Sullivan wrote in message et... Nousaine wrote: This is what always happens with 'bad news.' Instead of giving us contradictory evidence we get endless wishful 'data-dredging' to find any possible reason to ignore the evidence. In any other circle when one thinks the results of a given experiment are wrong they just duplicate it showing the error OR produce a valid one with contrary evidence. Not necessarily. It's quite common for questions to be raised during peer review of a scientific paper; it is then incumbent upon the *experimenter*, not the critic, to justify his or her choice of protocol, or his/her explanation of the results. Often this involves doing more experiments to address the reviewer's concerns. Sometimes it merely involved explaining the results more clearly, or in more qualified terms. If the experimenter feels the reviewer has ignored some important point, that comes out too in the reply to the reviews. I say all this having not yet visited the link, so I'm totally unbiased ; Bravo Mr. Sullivan. I hope you'll be as pleased to accept my applause as I am to see your excellent exposure of the frequently-voiced challenge to the ABX sceptics to "prove" their sceptical questions. Actually, ludovic, what tends to happen far more often, is that skeptics ask subjectivists to prove *their* claims, which is quite proper. Also, as I implied, there mere act of *questioning* does not make the question well-founded or mean that it requires answering. In you're case, I have observed that they almost never are. In a peer review process, the poor foundation and/or bad understanding behind such queries would be noted by the experimenter, who would make his case to the editor, and the points would not be required to be addressed. There is no 'exposure' involved, here, except of your own agenda, as usual. -- -S. |
#21
![]() |
|||
|
|||
![]()
(Nousaine) wrote in message news:nWGVa.15987$YN5.14030@sccrnsc01...
(Thomas A) wrote: (Nousaine) wrote in message ... (Thomas A) wrote: Is there any published DBT of amps, CD players or cables where the number of trials are greater than 500? If there difference is miniscule there is likely that many "guesses" are wrong and would require many trials to reveal any subtle difference? Thomas With regard to amplifiers as of May 1990 there had been such tests. In 1978 QUAD published an erxperiment with 576 trials. In 1980 Smith peterson and Jackson published an experiment with 1104 trials; in 1989 Stereophile published a 3530 trial comparison. In 1986 Clark & Masters published an experiment with 772 trials. All were null. There's a misconception that blind tests tend to have very small sample sizes. As of 1990 the 23 published amplifier experiments had a mean average of 426 and a median of 90 trials. If we exclude the 3530 trial experiment the mean becomes 285 trials. The median remains unchanged. Ok thanks. Is it possible to get the numbers for each test? I would like to see if it possible to do a meta-analysis in the amplifier case. The test by tagmclaren is an additional one: http://www.tagmclaren.com/members/news/news77.asp Thomas I did just that in 1990 to answer the nagging question "has sample size and barely audible difference hidden anything?" A summary of these data can be found in The Proceedings of the 1990 AES Conference "The Sound of Audio" May 1990 in the paper "The Great Debate: Is Anyone Winning?" (www.aes.org) Ok thanks. I'll look it up. In general larger sample sizes did not produce more significant results and there wasn't a relationship of criterion score to sample size. Where the data from all experiments pooled? It might not be the best way, if some experiments *did* include real audible differences but in which the sample size was too small to reveal any statistically significant difference whereas other did not include real audible difference. Any measured responses in the experiments? Did any of test include control tests where the difference was audible but subtle and then comparing e.g. different subjects? Where the "best scorers" allowed to repeat the experiments in the main experiment? Many questions but they may be relevant when making a meta-analysis. In addition, have any of the experiments used test signals in the LF range (around 15-20 Hz) and high-capable subwoofers (120 dB SPL @ 20 Hz)? I've just curious since the tests from the Swedish Audio-Technical Society frequently identifies amplfiers than roll of in the low end using blind tests. It might not be said to be an audible difference since the difference is percieved as a difference in vibrations in the body. I think I mentioned this before. Also for testing CD players, have anybody used a sin2 pulse in evaluating audible differences? IME if there is a true just-audible difference scores tend to run high. For example in tests I ran last summer scores were, as I recall, 21/23 and 17/21 in two successive runs in a challenge where the session leader claimed a transparent transfer. IOW results go from chance to strongly positive once threshold has been reached. Yes, I have come to similar conclusions myself in my own system. You can test this for yourself at www.pcabx.com where Arny Krueger has training sessions with increasing levels of difficulty. Also the codec testing sites are a good place to investigate this issue. I've tried the tests at Arnys site a couple of times, but I feel I need better hardware to do these tests more accurate. |
#22
![]() |
|||
|
|||
![]()
In article nR%Va.24664$YN5.23125@sccrnsc01, Harry Lavo wrote:
You mean not accepting the "received truth" without doing my own analysis is cherry picking, is that it Stewart? We are not allowed to point out anonomlies and ask "why"? "how come"? "what could be causing this?" You are indeed cherry picking. With 12 individuals the probability that one would would appear to meet the 95% level is fairly high. Remember that you can expect 1 in 20 to meet that level entirely by random. It is not acceptable scientific practice to select specific data sub-sets out of the complete set. Otherwise you could "prove" anything by simply running enough trials and ignoring those you don't like. Check any peer reviewed journal. In any event, 11 out 15 has a probability of 5.9 % of occuring by chance. That does not meet the 95 % confidence level. It would be rejected in a peer reviewed statistical study. (If that was the only data more trials would called for. But it wasn't the only data.) And would you explain why a significant level was reached on the "A" cable test with 96 trials? Was that "cherry picking". C'mon, Stewart, you know better. In fact the real issue here is: if one cable can be so readily picked out, why can't the other be? What is it in the test, procedure, quality of the cables, order bias, or what. Something is rotten in the beloved state of ABX here! Where are you getting your numbers? The data they posted on the web page showed that there were 52 correct answers in 96 trials. At least 52 correct answers will occur entirely by chance 23.8 % of the time. This is far from statistically significant. |
#23
![]() |
|||
|
|||
![]() |
#24
![]() |
|||
|
|||
![]()
In article zVGVa.15179$Ho3.2323@sccrnsc03, "Harry Lavo"
wrote: Thomas - Thanks for the post of the Tag Mclaren test link (and to Tom for the other references). I've looked at the Tag link and suspect it's going to add to the controversy here. My comments on the test follow. From the tone of the web info on this test, one can presume that Tag set out to show its relatively inexpensive gear was just as good as some acknowledged industry standards. But....wonder why Tag choose the 99% confidence level? Being careful *not* to say that it was prechosen in advance? It is because had they used the more common and almost universally-used 95% level it would have shown that: * When cable A was the "X" it was recognized at a significant level by the panel (and guess whose cable probably would "lose" in a preference test versus a universally recognized standard of excellence chosen as "tops" by both Stereophile and TAS, as well as by other industry publishers) * One individual differentiated both cable A and combined cables at the significant level Results summarized as follows: Tag Mclaren Published ABX Results Sample 99% 95% Actual Confidence Total Test Cables A 96 60 53 e 52 94.8% e B 84 54 48 e 38 coin toss Both 180 107 97 e 90 coin toss Amps A 96 60 53 e 47 coin toss B 84 54 48 e 38 coin toss Both 180 107 97 e 85 coin toss Top Individuals Cables A 8 8 7 6 94.5% B 7 7 7 5 83.6% Both 15 13 11 11 95.8% Amps A 8 8 7 5 83.6% B 7 7 7 5 83.6% Both 15 13 11 10 90.8% e = extrapolated based on scores for 100 and 50 sample size [snip] I'd like to hear other views on this test. Mr. Lavo, here are some comments on your numbers. The short story: your numbers are bogus. The long story follows. I don't know how you came up with critical values for what you think is a reasonable level of significance. For n = 96 trials, the critical values a 60 for .01 level of significance 57 for .05 level of significance 53 for .20 level of significance for n = 84 trials, the critical values a 54 for .01 level of significance 51 for .05 level of significance 47 for .20 level of significance for n = 180 trials, the critical values a 107 for .01 level of significance 102 for .05 level of significance 97 for .20 level of significance for n = 8 trials, the critical values a 8 for .01 level of significance 7 for .05 level of significance 6 for .20 level of significance for n = 7 trials, the critical values a 7 for .01 level of significance 7 for .05 level of significance 6 for .20 level of significance for n = 15 trials, the critical values a 13 for .01 level of significance 12 for .05 level of significance 10 for .20 level of significance The values you provide for what you call 95% confidence (i.e., .05 level of significance) are almost the correct values for 20% significance. You make much of an apparently borderline significant result, where the best individual cable test scores were 11 of 15 correct. If that had been the entire experiment, we would have a p-value of .059, reflecting the probability that one would do at least that well in a single run of 15 trials. That is, the probability that someone would score 11, 12, 13, 14, or 15 correct just by guessing is .059; also, the probability that the score is less than 11 would be 1 - .059 = .941 for a single batch of 15 trials. But what was reported was the best such performance in a dozen sets of trials. That's not the same as a single run of 15 trials. The probability of at least one of 12 subjects doing at least as well as 11 of 15 is 1 - [ probability that all 12 do worse than 11 of 15 ]. Thus we get 1 - (.941)^(12), which is about .52. So, even your star performer is not doing better than chance suggests he should. Mr. Lavo, your conjecture (that that the test organizers have tried to distort the results to fit an agenda) appears to be without support. Now for some comments on the TAG McLaren report itself. There are problems with some numbers provided by TAG McLaren, but they are confined to background material. There do not appear to be problems with the actual report of experimental results. TAG McLaren claims that you need more than 10 trials to obtain results significant at the .01 level, but they are wrong. In fact, 7 trials suffice. With 10 trials you can reach .001. There is a table just before the results section of their report with some discussion about small sample sizes. The first several rows of that table have bogus numbers in the third column, and their sample size claims are based on those wrong numbers. However, the values for 11 or more trials are correct. As I have already noted, the numbers used in the report itself appear to be correct. Some would argue that their conclusion should be that they found no evidence to support a claim of audible difference, rather than concluding that there was no audible difference, but that's another issue. You have correctly noted concerns about the form of ABX presentation (not the same as the usual ABX scheme discussed on rahe) but that does not invalidate the experiment. There are questions about how the sample of listeners was obtained. For the most part, TAG McLaren seems to have designed a test, carried it out, run the numbers properly, and then accurately reported what they did. That's more than can be said for many tests. JC |
#25
![]() |
|||
|
|||
![]()
On Thu, 31 Jul 2003 03:15:31 GMT, "Harry Lavo"
wrote: And would you explain why a significant level was reached on the "A" cable test with 96 trials? Was that "cherry picking". C'mon, Stewart, you know better. In fact the real issue here is: if one cable can be so readily picked out, why can't the other be? What is it in the test, procedure, quality of the cables, order bias, or what. Something is rotten in the beloved state of ABX here! All that the above proves (especially since the results were marginal at best) is that there's a random distribution in the results which favoured A *on this occasion*. Run that trial again, and I'll put an even money bet that you'll have a slight bias in favour of B. Anyone familiar with Statistical Process Control is well aware that one swallow doesn't make a summer. -- Stewart Pinkerton | Music is Art - Audio is Engineering |
#26
![]() |
|||
|
|||
![]()
"Jim West" wrote in message
news:fnaWa.19572$cF.7720@rwcrnsc53... In article nR%Va.24664$YN5.23125@sccrnsc01, Harry Lavo wrote: You mean not accepting the "received truth" without doing my own analysis is cherry picking, is that it Stewart? We are not allowed to point out anonomlies and ask "why"? "how come"? "what could be causing this?" You are indeed cherry picking. With 12 individuals the probability that one would would appear to meet the 95% level is fairly high. Remember that you can expect 1 in 20 to meet that level entirely by random. It is not acceptable scientific practice to select specific data sub-sets out of the complete set. Otherwise you could "prove" anything by simply running enough trials and ignoring those you don't like. Check any peer reviewed journal. Yep it is more probable than one in twenty. But not so high that we have to accept your assumption that he/she *IS* the one-in-twenty. In any event, 11 out 15 has a probability of 5.9 % of occuring by chance. That does not meet the 95 % confidence level. It would be rejected in a peer reviewed statistical study. (If that was the only data more trials would called for. But it wasn't the only data.) My mistake..you are correct. So the number is only 94.1%. But 12 out of 15 yields 98.2%. Which is closer to the 95% standard? And would you explain why a significant level was reached on the "A" cable test with 96 trials? Was that "cherry picking". C'mon, Stewart, you know better. In fact the real issue here is: if one cable can be so readily picked out, why can't the other be? What is it in the test, procedure, quality of the cables, order bias, or what. Something is rotten in the beloved state of ABX here! Where are you getting your numbers? The data they posted on the web page showed that there were 52 correct answers in 96 trials. At least 52 correct answers will occur entirely by chance 23.8 % of the time. This is far from statistically significant. Oops! Late at night and the only book of binomial probabilites I had handy contained only raw data, not the cumulative error tables that I was used to from my previous work. So I goofed. I agree that the three-to-one odds are not statistically significant, and so the full panel results are null for both cables. My apologies. The issue with the individuals still stands, however. |
#27
![]() |
|||
|
|||
![]()
In article l2mWa.36515$Ho3.6598@sccrnsc03, Harry Lavo wrote:
"Jim West" wrote in message news:fnaWa.19572$cF.7720@rwcrnsc53... You are indeed cherry picking. With 12 individuals the probability that one would would appear to meet the 95% level is fairly high. Remember that you can expect 1 in 20 to meet that level entirely by random. It is not acceptable scientific practice to select specific data sub-sets out of the complete set. Otherwise you could "prove" anything by simply running enough trials and ignoring those you don't like. Check any peer reviewed journal. Yep it is more probable than one in twenty. But not so high that we have to accept your assumption that he/she *IS* the one-in-twenty. Where did I assume that? In any event, 11 out 15 has a probability of 5.9 % of occuring by chance. That does not meet the 95 % confidence level. It would be rejected in a peer reviewed statistical study. (If that was the only data more trials would called for. But it wasn't the only data.) My mistake..you are correct. So the number is only 94.1%. But 12 out of 15 yields 98.2%. Which is closer to the 95% standard? Irrelevant. You can't arbitrarily play with the numbers. Where are you getting your numbers? The data they posted on the web page showed that there were 52 correct answers in 96 trials. At least 52 correct answers will occur entirely by chance 23.8 % of the time. This is far from statistically significant. Oops! Late at night and the only book of binomial probabilites I had handy contained only raw data, not the cumulative error tables that I was used to from my previous work. So I goofed. I agree that the three-to-one odds are not statistically significant, and so the full panel results are null for both cables. My apologies. The issue with the individuals still stands, however. No it doesn't. Even with cherry picking (which is not valid anyway) no individual performed at a statistically signficant level. Period. |
#28
![]() |
|||
|
|||
![]()
"Harry Lavo" wrote in message
news:93mWa.36669$uu5.4445@sccrnsc04 "normanstrong" wrote in message news:lfaWa.29024$uu5.3508@sccrnsc04... Golly, after reading the entire article, I'm impressed. Faced with these results, it's pretty hard to attack McLaren's article as being poor science. http://www.tagmclaren.com/members/news/news77.asp I see nothing in the results that is inconsistent with the hypothesis that there are no audible differences in both the cables and amps. Then you have not looked closely at and thought about some of this results/issues I have raised. I know Norm pretty well and he's very heavy into statistics, courtesy of a successful career at Fluke as a test equipment designer. He looks at discussions of statistics with a very critical, practical eye. As far as the critical issues that have been raised, I think that they speak for themselves, pretty weakly. |
#29
![]() |
|||
|
|||
![]()
On Fri, 01 Aug 2003 04:32:05 GMT, "Harry Lavo"
wrote: "normanstrong" wrote in message news:lfaWa.29024$uu5.3508@sccrnsc04... Golly, after reading the entire article, I'm impressed. Faced with these results, it's pretty hard to attack McLaren's article as being poor science. http://www.tagmclaren.com/members/news/news77.asp I see nothing in the results that is inconsistent with the hypothesis that there are no audible differences in both the cables and amps. Then you have not looked closely at and thought about some of this results/issues I have raised. They have indeed been closely examined, and your distortions and wild speculations have been debunked by several posters. Now, since the TAG results are yet another nail in your 'everything sounds different' coffin, exactly where is there one single *shred* of evidence to support your position? As a sad footnote, it must be observed that TAG-McLaren is one of the very few 'high end' companies which brought genuine engineering talent to bear on an attempt to improve the quality of music reproduction in the home, backed by considerable financial muscle. Regrettably (but perhaps predictably), they found that honesty and skill were *not* the way to make money in so-called 'high end' audio. R.I.P....... -- Stewart Pinkerton | Music is Art - Audio is Engineering |
#30
![]() |
|||
|
|||
![]()
(Nousaine) wrote in message news:1KlWa.36390$uu5.4253@sccrnsc04...
(Thomas A) wrote: ....some snip..... some more snip... Where the "best scorers" allowed to repeat the experiments in the main experiment? This did not appear to be part of the protocol for any but subject analysis was common. My experience with blindtests is that results do vary among subjects. I made a test where the CD players where not level-matched (something just below 0.5 dB difference). Two persons, including myself, could verify a difference in blindtest in my home setup (bass-rich music, no test signals used), whereas two other persons were unable to do it. Thus a retest with the best-scorers in a test is something which might be desirable. Many questions but they may be relevant when making a meta-analysis. In addition, have any of the experiments used test signals in the LF range (around 15-20 Hz) and high-capable subwoofers (120 dB SPL @ 20 Hz)? No. But there are no commercially available subwoofers that will do 120+ dB at 2 meters in a real room. I've tested dozens and dozens and the only ones with this capability are custom. Agree that commercial subs with +120 dB in room is hard to find. I've just curious since the tests from the Swedish Audio-Technical Society frequently identifies amplfiers than roll of in the low end using blind tests. The typical half power point for my stock of a dozen power amplifiers is 6 Hz. I've not seen the SATS data though. Have you ever tested yourself? You have a quite bass-capable system if I remember correctly. You would need music with very deep and high-quality bass, or test-tones, and perhaps a setup as described in the link (a reference amp). The reference amp used by SATS have during many years has been NAD 208. Other amps rated good is e.g. Rotel RB1090. I am not sure at the moment which ones that were rated "not-so-good", but I can look it up. The method they use is a "before-and-after" test. http://www.sonicdesign.se/amptest.htm It might not be said to be an audible difference since the difference is percieved as a difference in vibrations in the body. I think I mentioned this before. Also for testing CD players, have anybody used a sin2 pulse in evaluating audible differences? Not that I know of. Ok. Maybe somebody (Arny?) could present scope pictures of sin2 pulses of various DACs and CD players? In addition present them as audio files on the pcabx site? It would be interesting to see whether those players or DACs with distorted pulses (they do exist...) could be revealed in a DBT. Especially players with one-bit and true multi-bit. |
#31
![]() |
|||
|
|||
![]()
"Stewart Pinkerton" wrote in message
news:iRlWa.36496$uu5.4473@sccrnsc04... On Thu, 31 Jul 2003 03:15:31 GMT, "Harry Lavo" wrote: And would you explain why a significant level was reached on the "A" cable test with 96 trials? Was that "cherry picking". C'mon, Stewart, you know better. In fact the real issue here is: if one cable can be so readily picked out, why can't the other be? What is it in the test, procedure, quality of the cables, order bias, or what. Something is rotten in the beloved state of ABX here! All that the above proves (especially since the results were marginal at best) is that there's a random distribution in the results which favoured A *on this occasion*. Run that trial again, and I'll put an even money bet that you'll have a slight bias in favour of B. Anyone familiar with Statistical Process Control is well aware that one swallow doesn't make a summer. -- This wasn't one swallow, Stewart...this was 96 swallows in one case and 84 in another. A whole pride of swallows. So the odds of such severe swings in opposite directions (even if only marginal) is not very high over the whole pride. We are not talking a sample of two here. |
#32
![]() |
|||
|
|||
![]()
"John Corbett" wrote in message
news:hQlWa.37169$YN5.32913@sccrnsc01... In article zVGVa.15179$Ho3.2323@sccrnsc03, "Harry Lavo" wrote: Thomas - Thanks for the post of the Tag Mclaren test link (and to Tom for the other references). I've looked at the Tag link and suspect it's going to add to the controversy here. My comments on the test follow. From the tone of the web info on this test, one can presume that Tag set out to show its relatively inexpensive gear was just as good as some acknowledged industry standards. But....wonder why Tag choose the 99% confidence level? Being careful *not* to say that it was prechosen in advance? It is because had they used the more common and almost universally-used 95% level it would have shown that: * When cable A was the "X" it was recognized at a significant level by the panel (and guess whose cable probably would "lose" in a preference test versus a universally recognized standard of excellence chosen as "tops" by both Stereophile and TAS, as well as by other industry publishers) * One individual differentiated both cable A and combined cables at the significant level Results summarized as follows: Tag Mclaren Published ABX Results Sample 99% 95% Actual Confidence Total Test Cables A 96 60 53 e 52 94.8% e B 84 54 48 e 38 coin toss Both 180 107 97 e 90 coin toss Amps A 96 60 53 e 47 coin toss B 84 54 48 e 38 coin toss Both 180 107 97 e 85 coin toss Top Individuals Cables A 8 8 7 6 94.5% B 7 7 7 5 83.6% Both 15 13 11 11 95.8% Amps A 8 8 7 5 83.6% B 7 7 7 5 83.6% Both 15 13 11 10 90.8% e = extrapolated based on scores for 100 and 50 sample size [snip] I'd like to hear other views on this test. Mr. Lavo, here are some comments on your numbers. The short story: your numbers are bogus. The long story follows. I don't know how you came up with critical values for what you think is a reasonable level of significance. For n = 96 trials, the critical values a 60 for .01 level of significance 57 for .05 level of significance 53 for .20 level of significance for n = 84 trials, the critical values a 54 for .01 level of significance 51 for .05 level of significance 47 for .20 level of significance for n = 180 trials, the critical values a 107 for .01 level of significance 102 for .05 level of significance 97 for .20 level of significance for n = 8 trials, the critical values a 8 for .01 level of significance 7 for .05 level of significance 6 for .20 level of significance for n = 7 trials, the critical values a 7 for .01 level of significance 7 for .05 level of significance 6 for .20 level of significance for n = 15 trials, the critical values a 13 for .01 level of significance 12 for .05 level of significance 10 for .20 level of significance The values you provide for what you call 95% confidence (i.e., .05 level of significance) are almost the correct values for 20% significance. You make much of an apparently borderline significant result, where the best individual cable test scores were 11 of 15 correct. If that had been the entire experiment, we would have a p-value of .059, reflecting the probability that one would do at least that well in a single run of 15 trials. That is, the probability that someone would score 11, 12, 13, 14, or 15 correct just by guessing is .059; also, the probability that the score is less than 11 would be 1 - .059 = .941 for a single batch of 15 trials. But what was reported was the best such performance in a dozen sets of trials. That's not the same as a single run of 15 trials. The probability of at least one of 12 subjects doing at least as well as 11 of 15 is 1 - [ probability that all 12 do worse than 11 of 15 ]. Thus we get 1 - (.941)^(12), which is about .52. So, even your star performer is not doing better than chance suggests he should. Mr. Lavo, your conjecture (that that the test organizers have tried to distort the results to fit an agenda) appears to be without support. Now for some comments on the TAG McLaren report itself. There are problems with some numbers provided by TAG McLaren, but they are confined to background material. There do not appear to be problems with the actual report of experimental results. TAG McLaren claims that you need more than 10 trials to obtain results significant at the .01 level, but they are wrong. In fact, 7 trials suffice. With 10 trials you can reach .001. There is a table just before the results section of their report with some discussion about small sample sizes. The first several rows of that table have bogus numbers in the third column, and their sample size claims are based on those wrong numbers. However, the values for 11 or more trials are correct. As I have already noted, the numbers used in the report itself appear to be correct. Some would argue that their conclusion should be that they found no evidence to support a claim of audible difference, rather than concluding that there was no audible difference, but that's another issue. You have correctly noted concerns about the form of ABX presentation (not the same as the usual ABX scheme discussed on rahe) but that does not invalidate the experiment. There are questions about how the sample of listeners was obtained. For the most part, TAG McLaren seems to have designed a test, carried it out, run the numbers properly, and then accurately reported what they did. That's more than can be said for many tests. John - I have explained that I made an error, and I thank you for pointing it out. I also explained how and why, but that is an explaination, not an excuse. Perhaps you could bring your statistical skills to bear on the Greenhill test as reported in Stereo Review in1983. The raw results were posted here by me and by Ludovic about a year ago. As I recall one of the participants in that test did very well across several tests..I'd be interested in your calculation of the probability of his achieving those results by chance. This is not a troll..the mathematics of it are simply beyond me and I tried at the time to calculate the odds and apparently failed. The argument was: outlyer, or golden ear. |
#33
![]() |
|||
|
|||
![]()
"Stewart Pinkerton" wrote in message
... On Fri, 01 Aug 2003 04:31:13 GMT, "Harry Lavo" wrote: "Jim West" wrote in message news:fnaWa.19572$cF.7720@rwcrnsc53... In article nR%Va.24664$YN5.23125@sccrnsc01, Harry Lavo wrote: You mean not accepting the "received truth" without doing my own analysis is cherry picking, is that it Stewart? No, attempting to extract *only* those sub-tests which agree with your prejudices is 'cherry picking', and even then, you can' t make it stick on the numbers in that series of tests. We are not allowed to point out anonomlies and ask "why"? "how come"? "what could be causing this?" You are indeed cherry picking. With 12 individuals the probability that one would would appear to meet the 95% level is fairly high. Remember that you can expect 1 in 20 to meet that level entirely by random. It is not acceptable scientific practice to select specific data sub-sets out of the complete set. Otherwise you could "prove" anything by simply running enough trials and ignoring those you don't like. Check any peer reviewed journal. Yep it is more probable than one in twenty. But not so high that we have to accept your assumption that he/she *IS* the one-in-twenty. Unfortunately for your speculation, that individual did poorly in the amplifier test. Similarly, the best scorers in the amp test did poorly on cables. IOW, the results were *random*, and did not show *any* sign of a genuine audible difference, despite your many and convoluted attempts to distort the data to fit your agenda. Ah, the "received truth", better known as dogma. And why, pray tell, Stewart, is your explanation any more valid than my supposition that the cable test and the amp test may have revealed different attributes of reproduction of that particular musical piece at the margin, and that the two men were each more sensitive to one than the other.? |
#34
![]() |
|||
|
|||
![]()
"Jim West" wrote in message
et... In article l2mWa.36515$Ho3.6598@sccrnsc03, Harry Lavo wrote: "Jim West" wrote in message news:fnaWa.19572$cF.7720@rwcrnsc53... You are indeed cherry picking. With 12 individuals the probability that one would would appear to meet the 95% level is fairly high. Remember that you can expect 1 in 20 to meet that level entirely by random. It is not acceptable scientific practice to select specific data sub-sets out of the complete set. Otherwise you could "prove" anything by simply running enough trials and ignoring those you don't like. Check any peer reviewed journal. Yep it is more probable than one in twenty. But not so high that we have to accept your assumption that he/she *IS* the one-in-twenty. Where did I assume that? The dismissal of any single recipient scoring at a significant or near-significant level here is almost always described here as an outlyer. in some cases this seems true...in others not so true. To say for sure that somebody is an outlyer means one is assuming a certainty of "1", in other words, 100% sure. As opposed to saying there is a 25% chance that he may be an outlyer. Or a 50/50 chance. So implicit in the assertion, if the true probablity for an indivual in a given trial is one-in-twenty, and you have 12 subjects, is that that "one" is the "one-in-twenty" is the "one-in-twelve". Which may or may not be the case. But it is certainly no sure thing. In any event, 11 out 15 has a probability of 5.9 % of occuring by chance. That does not meet the 95 % confidence level. It would be rejected in a peer reviewed statistical study. (If that was the only data more trials would called for. But it wasn't the only data.) My mistake..you are correct. So the number is only 94.1%. But 12 out of 15 yields 98.2%. Which is closer to the 95% standard? Irrelevant. You can't arbitrarily play with the numbers. I'm not playing with numbers. When you establish a significance level you are establishing a certainly level that the community considers accepatble odds. 95% confidence level reflects one-in-twenty odds. A 98% confidence level reflects a one-in-fifty odds. 94.1% reflects a one-in-eighteen and a half odds. Which seems qualitatively closer to one-in-twenty to you. With a large sample size, stating the number needed for significance is okay..exceeding it by one doesn't distort results. But with small sample sizes such as we are talking about here, there is a big statistical difference beween 11 of 15 and 12 of 15. And the eleven of fifteen best approaches the 95% standard. Where are you getting your numbers? The data they posted on the web page showed that there were 52 correct answers in 96 trials. At least 52 correct answers will occur entirely by chance 23.8 % of the time. This is far from statistically significant. Oops! Late at night and the only book of binomial probabilites I had handy contained only raw data, not the cumulative error tables that I was used to from my previous work. So I goofed. I agree that the three-to-one odds are not statistically significant, and so the full panel results are null for both cables. My apologies. The issue with the individuals still stands, however. No it doesn't. Even with cherry picking (which is not valid anyway) no individual performed at a statistically signficant level. Period. Again, repeat the dogma (and the mantra). Don't bother to think about what the numbers really translate to. I don't know of any social researcher requiring odds of 98% or 99% confidence level. Do you really believe 95% is "significant" while 94.2% is not, in any meaningful, non-dogmatic way? |
#35
![]() |
|||
|
|||
![]()
On 2 Aug 2003 00:38:44 GMT, "Harry Lavo" wrote:
"Stewart Pinkerton" wrote in message news:iRlWa.36496$uu5.4473@sccrnsc04... All that the above proves (especially since the results were marginal at best) is that there's a random distribution in the results which favoured A *on this occasion*. Run that trial again, and I'll put an even money bet that you'll have a slight bias in favour of B. Anyone familiar with Statistical Process Control is well aware that one swallow doesn't make a summer. This wasn't one swallow, Stewart...this was 96 swallows in one case and 84 in another. A whole pride of swallows. That's a very poor flocking argument, since a proper evaluation of the test, using *all* the results, shows that the amps and cables were most definitely *not* sonically distinguishable. Further, those test subjects who scored well on cables scored poorly on amps, and vice versa, proving the point that they could not reliably distinguish any differences. So the odds of such severe swings in opposite directions (even if only marginal) is not very high over the whole pride. We are not talking a sample of two here. We are also not talking of any 'swings' that have statistical significance. This has been pointed out to you several times by several people, but you still insist on attempting to distort the results to fit your own prejudices. -- Stewart Pinkerton | Music is Art - Audio is Engineering |
#36
![]() |
|||
|
|||
![]()
On Sat, 02 Aug 2003 05:02:21 GMT, "Harry Lavo"
wrote: "Stewart Pinkerton" wrote in message ... Unfortunately for your speculation, that individual did poorly in the amplifier test. Similarly, the best scorers in the amp test did poorly on cables. IOW, the results were *random*, and did not show *any* sign of a genuine audible difference, despite your many and convoluted attempts to distort the data to fit your agenda. Ah, the "received truth", better known as dogma. No, the simplest and most likely explanation of the results. And why, pray tell, Stewart, is your explanation any more valid than my supposition that the cable test and the amp test may have revealed different attributes of reproduction of that particular musical piece at the margin, and that the two men were each more sensitive to one than the other.? Occam's Razor. You are attempting to speculate that the dark side of the Moon may indeed have large sections which are made of green cheese. While possible, this is unlikely, and most reasonable people would not put real money on it. The same applies to 'cable sound'. -- Stewart Pinkerton | Music is Art - Audio is Engineering |
#37
![]() |
|||
|
|||
![]()
On Sat, 02 Aug 2003 05:02:40 GMT, "Harry Lavo"
wrote: "Jim West" wrote in message . net... Irrelevant. You can't arbitrarily play with the numbers. I'm not playing with numbers. When you establish a significance level you are establishing a certainly level that the community considers accepatble odds. 95% confidence level reflects one-in-twenty odds. A 98% confidence level reflects a one-in-fifty odds. 94.1% reflects a one-in-eighteen and a half odds. Which seems qualitatively closer to one-in-twenty to you. With a large sample size, stating the number needed for significance is okay..exceeding it by one doesn't distort results. But with small sample sizes such as we are talking about here, there is a big statistical difference beween 11 of 15 and 12 of 15. And the eleven of fifteen best approaches the 95% standard. Fine. Now take the fact that there were a dozen volunteers, and you find that you have a 12 out of 18.5 chance of achieving that result by random chance. That's not much more than even odds. Now tell me that this has any significance whatever. The issue with the individuals still stands, however. No it doesn't. Even with cherry picking (which is not valid anyway) no individual performed at a statistically signficant level. Period. Again, repeat the dogma (and the mantra). Don't bother to think about what the numbers really translate to. I don't know of any social researcher requiring odds of 98% or 99% confidence level. Do you really believe 95% is "significant" while 94.2% is not, in any meaningful, non-dogmatic way? More importantly, since there were 12 test subjects, the *real* statistical probability of one subject scoring 11 out of 15 is not 94.2%, but 65%. That is simply not significant by *any* standard, especially when combined with the fact that those test subjects who did well on cables did poorly on amps, and vice versa. It's all just random chance, and all the cherry picking in the world won't change that. -- Stewart Pinkerton | Music is Art - Audio is Engineering |
#38
![]() |
|||
|
|||
![]()
(Thomas A) wrote:
(Nousaine) wrote in message news:1KlWa.36390$uu5.4253@sccrnsc04... (Thomas A) wrote: ....some snip..... some more snip... Where the "best scorers" allowed to repeat the experiments in the main experiment? This did not appear to be part of the protocol for any but subject analysis was common. My experience with blindtests is that results do vary among subjects. I made a test where the CD players where not level-matched (something just below 0.5 dB difference). Two persons, including myself, could verify a difference in blindtest in my home setup (bass-rich music, no test signals used), whereas two other persons were unable to do it. Thus a retest with the best-scorers in a test is something which might be desirable. In a situation like this with 2 of 4 persons scoring significantly it's likely that the overall score would be significant as well. It was typical for analysis to be conducted on a subject by subject basis to find significant individual scores. In tests where the overall result was null there did not appear to be cases where individually positive scores were covered by totals. I have offered Retests in most of my personally conducted experiments. In these I have experienced exactly one subject who asked to extend the number of trials in an experiment, one who retook a test at my request and one who accepted an opportunity for retest. This covers perhaps a dozen formal tests and several dozen subjects. Many questions but they may be relevant when making a meta-analysis. In addition, have any of the experiments used test signals in the LF range (around 15-20 Hz) and high-capable subwoofers (120 dB SPL @ 20 Hz)? No. But there are no commercially available subwoofers that will do 120+ dB at 2 meters in a real room. I've tested dozens and dozens and the only ones with this capability are custom. Agree that commercial subs with +120 dB in room is hard to find. I've just curious since the tests from the Swedish Audio-Technical Society frequently identifies amplfiers than roll of in the low end using blind tests. The typical half power point for my stock of a dozen power amplifiers is 6 Hz. I've not seen the SATS data though. Have you ever tested yourself? You have a quite bass-capable system if I remember correctly. You would need music with very deep and high-quality bass, or test-tones, and perhaps a setup as described in the link (a reference amp). I've measured the frequency response of the amplifiers and, yes, my subwoofer will produce 120 dB + SPL from 12 to 62 Hz with less than 10% distortion. Perhaps not surprisingly it takes a 5000 watt capable amplifier to make these SPL levels but I've never felt 'cheated of bass' when the system is driven with 2 channels of a 250 wpc stereo amplifier with ordinary programs, some of which have frequency content below 10 Hz. The reference amp used by SATS have during many years has been NAD 208. Other amps rated good is e.g. Rotel RB1090. I am not sure at the moment which ones that were rated "not-so-good", but I can look it up. The method they use is a "before-and-after" test. http://www.sonicdesign.se/amptest.htm It might not be said to be an audible difference since the difference is percieved as a difference in vibrations in the body. I think I mentioned this before. Basically you have to have the woofer displacement/ amp power to start. I haven't conducted a formal test about this but I'm guessing that speaker displacement is a bigger issue than amplifier bandwidth. IOW I'm guessing that most modern SS amplifiers have low frequency bandwidth to cover modern programs and that the basic limiting factor is the subwoofer transducer(s). ...snip remainder... |
#39
![]() |
|||
|
|||
![]()
I can recommend blind testing beer, nobody can taste any difference within
the same type of beer anyway. Lots of money to save.... ![]() |
Reply |
|
Thread Tools | |
Display Modes | |
|
|
![]() |
||||
Thread | Forum | |||
RCA out and Speaker Question in 2004 Ranger Edge Question | Car Audio | |||
capacitor + parallel wiring question? | Car Audio | |||
Sub + amp wiring question | Car Audio | |||
Subwoofer box question | Car Audio | |||
Subwoofer position question | Audio Opinions |