Home |
Search |
Today's Posts |
#1
![]() |
|||
|
|||
![]()
Just so that we don't misunderstand exactly what is happening during
one of these comparative tests: Note the implementation of ABC/Hr being used: http://ff123.net/abchr/abchr.html The reference is not being presented as another codec. In the process of rating each codec, the listener must choose from the hidden reference. This is different from MUSHRA, where there is a hidden reference, but which is typically presented as another codec. I think we're both on the same page when talking about ABC/Hr, but I just want to make sure. When I speak of a bias, I am mostly concerned that one codec is not falsely preferred over another. In the ABC/Hr implementation we're discussing, suppose a listener downrates the hidden reference for one or more codecs. Such a listener is not an *outlier* -- he is clearly *incorrect*. Then, the procedure used would be to discard that particular listener's entire set of results for all codecs. After checking the test description, I see that the test has a built in assumption that the reference is best, and asks the subject to rate the codec in terms of the amount of "annoying degredation". If a subject is consistently rating a hidden reference as degraded under those conditions, I would think that what's happening is this: The subject hears that particular codec as euphonic. That doesn't fit the test assumption, and there isn't any way to rate the codec as better than the reference. The easy way out of that dilemma, rather than thinking about it enough to recognize what has happened, is to assume the reference is the better one as the tester says it should be, and if the subject thinks otherwise they must have misidentified it. So they pick the better one as the reference and report degradation on the other one. If you discard the data from the people who did this, rather than recognizing what it means, wouldn't that create a severe bias in this testing procedure against any codec that sounds euphonic to some people? Bob |
#2
![]() |
|||
|
|||
![]()
wrote in message
After checking the test description, I see that the test has a built in assumption that the reference is best, and asks the subject to rate the codec in terms of the amount of "annoying degredation". I think it is safe to say that the ABC/hr test is based on the widely-accepted concept of "sonic accuracy". If a subject is consistently rating a hidden reference as degraded under those conditions, I would think that what's happening is this: The subject hears that particular codec as euphonic. That doesn't fit the test assumption, and there isn't any way to rate the codec as better than the reference. The easy way out of that dilemma, rather than thinking about it enough to recognize what has happened, is to assume the reference is the better one as the tester says it should be, and if the subject thinks otherwise they must have misidentified it. So they pick the better one as the reference and report degradation on the other one. Within the context of the concept of sonic accuracy there is no way that anything can sound better than perfect reproduction of the reference. If you discard the data from the people who did this, rather than recognizing what it means, wouldn't that create a severe bias in this testing procedure against any codec that sounds euphonic to some people? Within the context of the sonic accuracy model, there is no such thing as euphonic coloration. All audible changes represent inaccurate reproduction, which is undesirable. This is reasonable because there is no universal agreement about what constitutes euphonic coloration. |
#3
![]() |
|||
|
|||
![]()
I said
Wrong.In the Dave Clark test listener #2 got 30/48 correct with a statistical relaibility of hearing a difference of 94% Listener #6 got 26/48 with a statistical probablity of 84% chance of hearing differences. Listener #15 got 15/21 correct with an 81% chance of hearing a difference. Keith hughs said I'd like to know how these probabilities were calculated. For example, in the case where the listener heard a difference in 30 of 48 trials, one can do a simple F-test between populations (i.e. 30/48 versus 24/48 - the expectation if results are due to random guessing). For this example, F-critical is 1.623, with F-calculated of 1.067, i.e. the population variances are not different at the .05 level. One can then compare the means of the populations using a simple t-test for populations with equal variances. The t-critical is then 1.99 - two-tailed, or 1.66 - one-tailed, with t-calculated = 1.23. Thus the means (i.e. RESULTS of the listener vs. random guessing) are not statistically different, at the 0.05 (95% probability) level. Nor are they significant at the 0.10 (90% confidence) level. I got the numbers from the article. Bob said I don't recall this article, but this conclusion seems to be well supported by the data you cite. If the best performance of the group wasn't statistically significant at a 95% confidence level, then it's perfectly reasonable to say that no listener was able to identify the amps in the test. (Note: Saying they couldn't is not the same as saying they can't. As I noted above, we can never say definitively that they can't; we can only surmise from their--and everybody else's--inability to do so.) I said That is ridiculous. If all of them scored 94% it would be reasonable to say this? Kieth said I believe Bob is talking about 95% confidence interval, *not* 95% scores. Yes I know. Kieth said And yes, it is very common to require a 95% confidence level. How often is 94% confidence level results regarded as a null when one is seeking 95% confidence results which is in fact a subjective choice. Bottom line is the tests were inconclusive in this particlular case. I said No. It all depends on how they fall into the bell curve. but even this is problematic for two reasons. 1. the listeners were never tested for sensitivty to subtle diferences. The abilities of the participants will profoundly affect any bell curve. Kieth said No, the variance in those abilities are the *cause* of the bell curve. I was using the predicted bell curve outcome one would get if there are no audible differences. That bell curve is dictated by the number of samples. Kieth said That's why sample size (panel size in this context) is so important, and why the weight given to one or two individuals performances, in any test, must be limited. I took that into consideration when I claimed the test results were inconclusive. That is why I suggested the thing that should have been done was further testing of those individuals and the equipment that scored near or beyond that which the bell curve predicts. That is why i don't run around claiming this test proves that some people hear differences. Let's also not forget that we have no tests on listener or system sensitivity to subtle audible differences. So we have unknown variables. Further the use of many amps in no obvious pattern for comparisons introduces another variable that could profoundly affect the bell curve in unknown ways if some amps sound like each other and some don't. All of this leaves the interpretation of those results wide open. Kieth said Also why such high criteria such as a 95% CI are applied. Having a limited population, one cannot alway assume a bell curve, making outliers more difficult to identify. Sorry but this is in some ways an arbitrary number. If a dozen people took such tests and half of them scored between 90% and 94% confidence then this 95% limmit would be an unreasonable one. OTOH if a hundred people took the test and one scored at 95% confidence I think you could argue that this does fall within the predicted bell curve of a null result. As it is, as far as i can tell some of the results of the Clark test are close enough to the edge of the bell curve prediction or beyond it to warrent further investigation and to preclude anyone of drawing definitive conclusions. Statistical analysis is not my strength in math but that is how it looks o me. I said 2. many different amps were used. We have no way of knowing that we didn't have a mix of some amps sounding different and some sounding the same. The Counterpoint amp not only was identified with a probablity of 94% . Given there wre 8 idfferent combinations it fell out of the predicted bell curve if my math is right. Kieth said What math did you use? Not having the data your using to hand, I'm curious how you calculated such a high probability. I used the numbers given in the article. The article gave a 94% confidence level on the Counterpoint amp compared to an NAD amp. There were a total of 8 comparisons between different amps. I was rounding by the way. The number is actually 94.4%. I said Bottom line is you cannot draw definitive conclusions either way. If the one listener had made one more correct ID he would have been well above 94%. I doubt that one can simply make 95% probability a barrier of truth. Kieth said You should read more scientific literature then. It is the most commonly used confidence level IME. Especially with limited population size. I will look into it but i will be quite surprised if that number does not heavily depend on the sample sizes. It has to vary with sample size. I said It is ridiculous. Bell curves don't work that way. Kieth said Nonsense, of course they do. The fact that there are two-tailed bell curves, in most population responses, is the genesis for use of high confidence intervals. Because there *are* tails - way out on the edges of the population, allowance for these tails must be made when comparing populations for significant heterogeneity. Maybe you didn't get what i was trying to say. What does and does not fall within the predictions of a bell curve depends heavily on the number of samples. I said No foot stamping is needed. The test was quite inconclusive. Kieth said From what's been presented here, I'd say is was quite conclusive. The data do not support rejecting the null hypothesis (i.e. that the amps sound the same), for the test population, under the test protocol and conditions used. Not knowing the population size, or protocol used, I wouldn't venture an opinion on whethere it may or may not be applicable to the broader population. So you are drawing definitive conclusions from one test without even knowing the sample size? I think you are leaping without looking. You are entitled to your opinions. I don't find your arguements convincing so far. I said Had the test been put infront of a scientific peer review with the same data and the same conclusions that panel would have sent it back for corrections. The analysis was wrong scientifically speaking. Kieth said Doesn't appear to be based on what's been posted here. You appear to think that definitive results are required, in support of some postulate, for the conclusion to be valid (or acceptable for a review board). This is simply incorrect. Oftentimes, the data obtained from a test are not useful, or positive, relative to supporting a postulate, but that does not invalidate the data. Clearly, failure to reject the null hypothesis, in any experiment, does not invalidate the test, nor is it something requiring "correction". It is merely part of the database that the author, and others, build on in future research. Where do I claim definitve results are required in support of some postulate? I would say definitve results are required for one to make claims of definitve results. Certainly such a test as a part of a body of evidence could be seen as supportive but even that I think is dodgy given some of the results as far as I can see would call for further testing and the absense of listener sensitivty alone makes it impossible to make any specific claims about the results. I never said the test was not valid due to it's failure to reject the null. I simply said I think the results are such that calls for further investigation and are on the border between a null a positive. Not the sort of thing one can base definitive conclusions on. Kieth said It's instructive to also note that peer reviewed journals publish, not infrequently, two or more studies that contradict one another. You seem to believe this can't be the case, because the "wrong" ones would be sent back for "correction". Nonsense. what I think will be sent back is a poor analysis. If the analysis fits the data it won't be sent back. If someone is making claims of definitive conclusions based on this test one is jumping the gun. Scientific research papers with huge sample sizes and apparent definitive rsults are ususally carefully worded to not make definitive claims. That is a part of propper scientific prudence. Kieht said ones would be sent back for "correction". A look through some select peer reviewed journals, such as the "Journal of Pharmacuetical Science and Technology", will show that to be mistaken. Really? I'd like to see any peer reviewed published studies thatmake definitive claims based on sample sizes of 25 participants especially when one had a 94% confidence score. Kieth said Often in the case of biological studies, conflicting data are obtained, often due to unknown (at the time) and/or uncontrolled variables (often inherent to the specific population under study). These data, while contradictory, are still of great value for future studies. I never suggested the data should be ignored or was valueless, only that it was inconclusive. I think the many uncontroled variables in this specific test are problematic. Don't you? |
#4
![]() |
|||
|
|||
![]() |
#5
![]() |
|||
|
|||
![]()
S888Wheel wrote:
snip I said That is ridiculous. If all of them scored 94% it would be reasonable to say this? Kieth said I believe Bob is talking about 95% confidence interval, *not* 95% scores. Yes I know. Then talking about "scoring 94%" is meaningless. Kieth said And yes, it is very common to require a 95% confidence level. How often is 94% confidence level results regarded as a null when one is seeking 95% confidence results which is in fact a subjective choice. Bottom line is the tests were inconclusive in this particlular case. You clearly don't understand statistical analysis. There is a null hypothesis that states, basically, that the populations (in this case, say the sound of Amp A and the sound of Amp B) are not different. You set your confidence level *prior* to your analysis, and yes 95% is ubiquitous, and you either meet that level, or you don't. There is no gray area. Irrespective of how "close" you may get to reaching the confidence level, you're either in, or out. When out, the results *ARE* conclusive - always. That conclusion is "you cannot reject the null hypothesis at the 0.05 level", *by definition*. Period. You cannot say for example "well, it was close, so they're probably different". You *can* calculate what confidence level within which you *can* reject the null hypothesis, if you want (it's approximately 78%, by my calculations, for a result of 30/48 versus baseline expectations of 24/48). A number that would be considered insignificant by all statisticians I've worked with. But again, by expanding the CI, you risk erroneously rejecting the null hypothesis unless the population is very large (i.e. sufficiently so to approximate a two-tailed bell curve). snip No, the variance in those abilities are the *cause* of the bell curve. I was using the predicted bell curve outcome one would get if there are no audible differences. That bell curve is dictated by the number of samples. The "Bell" curve is dictated by varying response levels. The degree to which the specific curve approximates a normal distribution *is* a function of population size (or number of samples if you prefer). Hence the need for a very high CI when using small populations (like 15 panelists for eg.), as you do *not* have a high degree of confidence that the population has a normal distribution. Kieth said That's why sample size (panel size in this context) is so important, and why the weight given to one or two individuals performances, in any test, must be limited. I took that into consideration when I claimed the test results were inconclusive. Well, again, I think your missing the common usage of "conclusive" relative to statistical analysis. As stated previously, there are only two outcomes of statistical analysis (i.e. typical ANOVA), reject or accept the null hypothesis. Either is conclusive, as this one apparently was. The data don't allow you to reject the null hypothesis at even the 0.1 level. So, you cannot say a difference was shown. You can't say there was no difference either. This appears to be the genesis of your "inconclusive" apellation. But this is an incorrect interpretation, as detailed above. That is why I suggested the thing that should have been done was further testing of those individuals and the equipment that scored near or beyond that which the bell curve predicts. That is why i don't run around claiming this test proves that some people hear differences. Let's also not forget that we have no tests on listener or system sensitivity to subtle audible differences. Acutally, we *assume* variation in listener abilities and responses. Without such, there would be no bell curve, and statistical analysis would be impossible. The *only* requirement is that the test population is sufficiently large such that they approximate the population as a whole, relative to reponse to the stimuli under test. Again, the smaller the test population, the tighter the CI *must* be due to the lower confidence in the test population having the same distribution as the total population. So we have unknown variables. Yes, always. Given appropriate controls and statistics, the effects of most are accounted for in the analysis. Further the use of many amps in no obvious pattern for comparisons introduces another variable that could profoundly affect the bell curve in unknown ways if some amps sound like each other and some don't. Well, not having the article, I can't comment one way or another. All of this leaves the interpretation of those results wide open. Well, no, as stated previously. It may, however, leave the *question* of audibility open to further study. snip You should read more scientific literature then. It is the most commonly used confidence level IME. Especially with limited population size. I will look into it but i will be quite surprised if that number does not heavily depend on the sample sizes. It has to vary with sample size. Not usually. It varies with the criticality of the results. The confidence one has in extrapolating the results to the general population, irrespective of CI, increases as a function of sample size. snip Nonsense, of course they do. The fact that there are two-tailed bell curves, in most population responses, is the genesis for use of high confidence intervals. Because there *are* tails - way out on the edges of the population, allowance for these tails must be made when comparing populations for significant heterogeneity. Maybe you didn't get what i was trying to say. What does and does not fall within the predictions of a bell curve depends heavily on the number of samples. You don't seem to understand what a Bell curve is. ALL data is in the curve, so I don't know what you're trying to say by "does and does not fall within the predictions of a bell curve". Bell curves don't predict anything, they merely incorporate all data points, and the shape is the result of a two-tailed normal distribution. You can, with a normal population (your Bell curve) predict the % of the population that fits within any X standard deviations from the mean. That's what were doing with CI's, we're *assuming* a normal distribution - a requirement for statistics to work (unfortunately). snip So you are drawing definitive conclusions from one test without even knowing the sample size? Only from the data you presented (as I've said, I don't have the article). You picked individual results (such as the 30/48 vs baseline of 24/48) as indicative of a significant outcome. I've shown you, through the most common statistical tool used, that that 30/48 and 24/48 are not significantly different. Based on the data you provided, the conclusion of not rejecting the null hypothesis seems perfectly sound. I think you are leaping without looking. You are entitled to your opinions. I don't find your arguements convincing so far. Maybe you need a better understanding of statistical analysis, and its limitations. Where do I claim definitve results are required in support of some postulate? Everywhere, as far as I can tell. I would say definitve results are required for one to make claims of definitve results. The definitive result of failing to reject the null hypothesis, to a chosen confidence level, requires no more data than was apparently presented in the article. You seem to be confusing "no difference was shown at the x.xx level" with "there are no differences". The latter can never be said based on failure to reject the null. Certainly such a test as a part of a body of evidence could be seen as supportive but even that I think is dodgy given some of the results as far as I can see would call for further testing and the absense of listener sensitivty alone makes it impossible to make any specific claims about the results. I As long as the panel sensistivity is representative of the general population, sensitivity is irrelevant. No matter the test protocol, one *must* assume the sensitivities are representative, unless the sensitivity of the whole population is known, and a panel is then chosen that represents a normal distribution congruent with the overall population. And this never happens. never said the test was not valid due to it's failure to reject the null. I simply said I think the results are such that calls for further investigation and are on the border between a null a positive. Not the sort of thing one can base definitive conclusions on. It is clearly definitive for the panel, under the conditions of the test. It's instructive to also note that peer reviewed journals publish, not infrequently, two or more studies that contradict one another. You seem to believe this can't be the case, because the "wrong" ones would be sent back for "correction". Nonsense. what I think will be sent back is a poor analysis. Which you've yet to illustrate in this case. If the analysis fits the data it won't be sent back. If someone is making claims of definitive conclusions based on this test one is jumping the gun. Scientific research papers with huge sample sizes and apparent definitive rsults are ususally carefully worded to not make definitive claims. That is a part of propper scientific prudence. True. Does this article say "there are no sonic differences between amps", or does it say (paraphrasing of course) "differences between amps couldn't be demonstrated"? I understood it to be the latter. Kieht said snip Kieth said snip I never suggested the data should be ignored or was valueless, only that it was inconclusive. I think the many uncontroled variables in this specific test are problematic. Don't you? I don't have the article, so I don't know how many variables were uncontrolled. The sensitivity issue is, I believe, a red herring however. BTW, if you're going to repeatedly use my name in replying, common courtesy would be to spell it right - at least once. Keith Hughes |
Reply |
Thread Tools | |
Display Modes | |
|
|
![]() |
||||
Thread | Forum | |||
Crazy market saturation! | Car Audio | |||
FAQ: RAM LISTING OF SCAMMERS, SLAMMERS, AND N'EER DO WELLS! V. 8.1 | Audio Opinions | |||
A quick study in very recent RAHE moderator inconsistency | Audio Opinions | |||
Why DBTs in audio do not deliver (was: Finally ... The Furutech CD-do-something) | High End Audio | |||
System balance for LP? | Audio Opinions |