Reply
 
Thread Tools Display Modes
  #1   Report Post  
Harry Lavo
 
Posts: n/a
Default A comparative versus evaluative, double-blind vs. sighted control test

Hi RAHE'rs -

I've had many inquires and some interest in my proposal that before
comparative dbt'ng is crowned "the" test for audio evaluation, it needs to
be validated by a control test. While I have sketched such a test in
several different posts/threads, there seems to be enough confusion over
what I have said that it is worth outlining here in a definitive post on the
subject.

In addition, at the end I will respond to Tom's offer to join together in
such a test.

WHAT IS THE ISSUE?

As I have analyzed my own and others arguments here for and against
comparative dbt'ng, it seems to me that the issue has much less to be with
being blind than it does with being comparative. In other words, does a
test "forcing" a choice under uncertainty duplicate the results that would
be obtained by listening and evaluating components at home in a relaxed
atmosphere, whether blind or sighted. I have accordingly proposed that the
only way to validate the comparative dbt as the definitive tool is to remove
this question mark. And it could be done, with enough time and resources
devoted to it.

As such, the control test must separate out and test two variables -

* evaluative (blind) vs. comparative (blind) ,,, a test of evaluative
testing versus comparative testing
* evaluative (blind) vs. evaluative (sighted) ,,, a test of blind vs.
sighted testing

With the answers to these two comparisons, it should be able to answer the
following questions?

* Does blinding give better bias control? (presumably yes)
* How close can open-ended, relaxed, sighted evaluative testing (the
traditional home "sighted" tests which are believed worthless by the
objectivists) come to duplicating the results of open-ended, relaxed, but
blinded evaluative testing. Same test technique, but blinded, which
objectivist presumably would support.
* Do traditional comparative dbt tests give identical results to more
relaxed and evaluative dbt tests? (answer simply not known, but postulated
by subjectivists as "no", thinking that the test itself is different enough
to get in the way).

Essentially, the blinded (dbt), relaxed, evaluative test is "the missing
link" between the current dbt camp and the current subjectivist camp as it
helps resolve both the "blind" issue and the "comparative vs. evaluative"
issue. Using components playing music, not artifacts or pink noise.

GENERAL TEST CONDITIONS
* Participants must take place in all three tests...open end sighted, open
end blind, and comparative blind.
* There has to be enough trials of each type to allow statistical
evaluation.
* Musical selections and media must be agreed to in advance by all parties
as being sufficiently varied to reveal all types of significant audio
reproduction qualities. (Dynamic range, soundstaging, depth,
dimensionality, bass quality, treble quality, midrange quality, etc.)
* Equipment under test must be believed by most participants to sound
different from one another under sighted conditions and to have some degree
of objectivist skepticism about same.
* Equipment under test, everything else being equal, should make testing
under home/similar to home conditions as simple as possible, including
time-synched switching.
* Tests must either be done in-home of participants, or at a site accessible
to participants over long periods of time on a sighted basis before test
ratings collected.

EVALUATIVE TEST CONDITIONS
* Open-ended home listening must supplant informal note taking with formal
rating of components on evaluative scale, in order to be able to
statistically correlate with blind evaluative testing.
* Evaluative scale should draw from and reflect all significant variables
suggested by RAHE participants, reduced to a manageable number by
consolidating very similar qualities.

COMPARATIVE TEST CONDITIONS
* Test should be a-b, rather than a-b-x, in order to better approximate the
evaluative tests
* Test should ask for overall preference and preference on comparative
version of evaluative scales (at least those found significantly different
in the evaluative testing.)

BLIND TEST CONDITIONS
* Participants should be allowed substantial "warm up" time on sighted basis
to listen to the test equipment using the musical selections to be used in
the test.
* Participants should be allowed to control switching of test.
* Participants should be left alone in room during test ideally, and should
"turn in" ratings to out of room proctor who also has recorded the actual
a-b assignment for each trial.
* a-b assignments shall be based on random drawings and then adjusted, if
needed, slightly to assure equal positioning and no chance of order bias.

* * * * * * * * * * * * * * * *

With those general conditions established, I would like to discuss actual
test implementation practicalities. This is where it gets complicated.

THE OPEN-ENDED SIGHTED EVALUATIVE TEST
Essentially, as I described in an earlier post, the typical audiophile puts
a new piece of equipment in the system, listens open-ended for awhile,
switches back, does the same, and by doing this a few times over several
selections of music begins to hone in on what characteristics the new
equipment has in his system versus the old. These may be improvements; they
may be deficiencies. He continues to do this until a) he has to return
equipment, or b)reaches a definitive preference for one or the other (a
preference growing organically out of the evaluation and the emergence of
defining audio characteristics).

How to best approximate this test on a slightly more structured basis, so
that results may be compared to later tests?

The first and probably only thing required, seems to me, is to substitute
formal evaluation rating scales for the informal notes done during this
process. My suggestion is that the evaluator would have perhaps
half-a-dozen interim rating sheets that he/she would use, lets say for six
weeks. Then at the end, he/she would review those sheets and put together a
"final" rating for the two pieces of equipment. These would be on an
absolute scale for the two pieces. For example, both might be rated high on
"throw a wide soundstage beyond the outside edge of the speakers". One
would be rated "5" and the other "4" on a "1" to "5" scale. So this score
can be used both as a numeric rating and as a comparative rating, e.g.. both
same, one higher (different, higher) on that characteristic. Their would
also be a similar rating "preference overall" that might be "4" and "3"
(different, better). Or perhaps "4" and "4" (no preference).

However, one can immediately see one problem. With a sighted test, there is
no such thing as doing 16 independent trials, since presumably once the
person "locks in" his future ratings would be very similar since he knows
which equipment is which. Even allowing for differences in moods, climates,
etc. these would not be sixteen independent tests.

The implications of this are that for the "relaxed, evaluative, sighted"
versus "relaxed, evaluative, blind" tests, more than one person must be
tested....probably at least twenty. In the food industry we used to
consider 100 as the smallest test size we considered reasonable. This adds
enormously to the cost, time, and complexity of running such a test if one
is to do it in-home.

It would be a little more manageable doing it out-of-home at a central
facility, and having sixteen audiophiles do it. But this is fraught with
problems...an unfamiliar system probably requiring more time to reach a
final evaluation for each respondent, the need to maintain the setup for
several weeks to allow all respondents to have multiple exposures before
doing so, etc.

Problems, problems.

THE OPEN-ENDED, BLIND EVALUATIVE TEST
This test would be very similar to the open-ended sighted test, but
double-blind. Once a warm up period of perhaps a few hours was over,
however, the respondent would take a trial, rate, turn in, take a break,
start another trial, etc up to four in a row. If repeated four days or
four weeks in a row, this could result in sixteen trials, enough to
determine significance of differences in ratings. The ratings would be the
same used in the sighted testing. The results of this test would be: were
differences between the equipment found, and were they statistically
significant at the 95th percentile. What characteristics, if any, came
through as significantly different.

Once a respondents results were determined (different, same) overall and for
each characteristic, they could be compared to open-ended scores and a
correlation established (or not). Since the open-ended sighted test only
had one score, it would be hard to evaluate significance for an individual
person on these correlations, but if done across 20-100 people, a
statistical correlation could be established. For this to be a true
"scientific" test, it would have to be done across a substantial population
of audiophiles as has already been pointed out.

THE COMPARATIVE, BLIND TEST
The main blind (a-b) test would use the evaluative factors of the sighted
and blind evaluative tests, but on a comparative basis (e.g. which did you
prefer overall, which had the widest soundstage, etc.). The comparative
evaluation test could be directly correlated with the blind evaluative test,
as well as within itself over sixteen trials. Again, these probably should
be done in groups of four since they require a fair number of ratings.

Not essential but of possible interest, would be to do a traditional a-b-x
test as well, to see if it correlated with the overall preference a-b test
(% of respondents noting a difference in each/statistical significance of
same).

* * * * * * * * * * * * * * * *
* * * * * * *

IMPLICATIONS

As noted, to truly be significant, this test has to be done across a sample
of audiophiles, probably at least two dozen in-home evaluations and
subsequent test follows up. This would kept Tom and I busy for a year.

From a practical standpoint, the blind comparative vs. blind evaluative
tests are easier to do, since multiple trials allows for internal
statistical validity. I would be willing to develop and be the initial
testee of such a test along with Tom, whom I would also ask to do the same,
and perhaps a "neutral" third party.
I would also do the sighted test, but the results would be strictly
"anecdotal" until an appropriate database of RAHE participants was built up,
and would request that Tom and the "neutral" do the same.

I would also suggest that a good and most interesting vehicle for this test
would be a SACD player using stereo mix SACD and CD layer, on disks and
tracks judged appropriate and "identical" in mix. The test would be easy to
run...two identical side-by-side SACD players into a preamp input, with
control box switching or manual switching, automatically volume matched, no
impedance problems a la speaker cables, and perhaps some ultimate insight
into "is there a difference in SACD vs CD". I have a SACD player; Tom
would have to buy one or borrow one; same for neutral third party.

If SACD is judged impractical, then I would suggest a CD test between two CD
players judged to be likely audibly different...say an Arcam 27 versus a
Sony $300 job. However, the equipment would have to be on long term loan,
since it would take probably at least six months to complete the testing.

We would also need neutral proctors to run the test and record scores.

* * * * * * * * * * * * * * * *
* * * *

CONCLUSION
There would be a fair amount of work needed to get this off the ground, but
it is doable. In particular, I would want broad agreement within RAHE that
it was worthwhile doing, and I would want input from members of appropriate
test SACDS or CDs and tracks for testing, and I would want myself, Tom, and
the other participant to agree on the selections to be used.

Your comments and suggestions and questions are hereby solicited.

Harry Lavo
"it don't mean a thing if it ain't got that swing" - Duke Ellington

  #2   Report Post  
chung
 
Posts: n/a
Default A comparative versus evaluative, double-blind vs. sighted control

Harry Lavo wrote:

Hi RAHE'rs -

I've had many inquires and some interest in my proposal that before
comparative dbt'ng is crowned "the" test for audio evaluation, it needs to
be validated by a control test. While I have sketched such a test in
several different posts/threads, there seems to be enough confusion over
what I have said that it is worth outlining here in a definitive post on the
subject.

In addition, at the end I will respond to Tom's offer to join together in
such a test.

WHAT IS THE ISSUE?

As I have analyzed my own and others arguments here for and against
comparative dbt'ng, it seems to me that the issue has much less to be with
being blind than it does with being comparative. In other words, does a
test "forcing" a choice under uncertainty duplicate the results that would
be obtained by listening and evaluating components at home in a relaxed
atmosphere, whether blind or sighted.


Why would DBT duplicate the results of sighted tests?

I have accordingly proposed that the
only way to validate the comparative dbt as the definitive tool is to remove
this question mark.


Sorry, I don't see the question mark at all.

And it could be done, with enough time and resources
devoted to it.

As such, the control test must separate out and test two variables -

* evaluative (blind) vs. comparative (blind) ,,, a test of evaluative
testing versus comparative testing


You evaluate and compare. They are not mutually exlusive.

* evaluative (blind) vs. evaluative (sighted) ,,, a test of blind vs.
sighted testing

With the answers to these two comparisons, it should be able to answer the
following questions?

* Does blinding give better bias control? (presumably yes)


And you still think that it has not been answered?

* How close can open-ended, relaxed, sighted evaluative testing (the
traditional home "sighted" tests which are believed worthless by the
objectivists) come to duplicating the results of open-ended, relaxed, but
blinded evaluative testing.


Not a question worth answering. We all know that sighted and blind can
give different results. How close is irrelevant.

Same test technique, but blinded, which
objectivist presumably would support.
* Do traditional comparative dbt tests give identical results to more
relaxed and evaluative dbt tests? (answer simply not known, but postulated
by subjectivists as "no", thinking that the test itself is different enough
to get in the way).


Not a question worth answering since the comparative test, as you put
it, can be as relaxed and evaluative as you make it.


Essentially, the blinded (dbt), relaxed, evaluative test is "the missing
link" between the current dbt camp and the current subjectivist camp as it
helps resolve both the "blind" issue and the "comparative vs. evaluative"
issue.


Big OSAF. Prove that first. Others claim that DBT's don't work because
the snippets are too short, the snippers are too long, the switching is
too quick, the switching is too slow, the system does not have enough
resolution, etc, etc. Your position as stated is not shared by the
majority of DBT opponents. Even if you remove your concerns, others will
have a different set of objections.

snip

CONCLUSION
There would be a fair amount of work needed to get this off the ground, but
it is doable. In particular, I would want broad agreement within RAHE that
it was worthwhile doing, and I would want input from members of appropriate
test SACDS or CDs and tracks for testing, and I would want myself, Tom, and
the other participant to agree on the selections to be used.

Your comments and suggestions and questions are hereby solicited.


Given my comments above, I myself don't find it worth doing. Since you
seem to claim DBT's are not effective for audio, the burden of proof is
on you. In other words, I don't find it worth doing, but go ahead if you
think you can learn from doing this. I simply see no sense for me to
waste effort proving something that has been proven.



Harry Lavo
"it don't mean a thing if it ain't got that swing" - Duke Ellington

  #3   Report Post  
 
Posts: n/a
Default A comparative versus evaluative, double-blind vs. sighted control test

There is no reason to establish what has been demonstrated in all areas of
human behavior research, including human hearing. But there is a simple
direct way to get at the validity of the "evaluation" listening test, more
often said to be an audition. Using the traditional stereophile
experience of one man in a room with a notepad, a blind test can easily be
done. As notes are said to have been taken each time listening was done
over a period of days/weeks, use the notes as the test data. Using the
current well known wire and the new wire to be "auditioned", simply
randomly insert either wire into the system on each day of the "audition".
If on the same days the same current wire was randomly used, remarkable
differences in the perception of the music were said to be heard, or, if
when either is used and the same remarkable perceptions were said to have
been experienced; well ... If there is some reality of perception changes
because of the new wire, it should stand out in the notes like a sore
thumb, the same case if the current wire has them but the new doesn't.
So, are any of the "traditional audition" mags up for this test, or
individuals for that matter?
  #4   Report Post  
Bruce Abrams
 
Posts: n/a
Default A comparative versus evaluative,

Here's the real question, as I see it. Lets say that two CD players, A and
B, are being evaluated and during sighted tests, evaluative or comparative,
a subject states a preference for unit A, yet when blinded is unable to
duplicate the sighted result. What conclusion will be drawn? My suspicion
is that both subjectivist and objectivist alike will wring their hands with
glee shouting, "See...just like I told you."

"Harry Lavo" wrote in message
news:wrDVb.249701$I06.2756526@attbi_s01...
Hi RAHE'rs -

I've had many inquires and some interest in my proposal that before
comparative dbt'ng is crowned "the" test for audio evaluation, it needs to
be validated by a control test. While I have sketched such a test in
several different posts/threads, there seems to be enough confusion over
what I have said that it is worth outlining here in a definitive post on

the
subject.

In addition, at the end I will respond to Tom's offer to join together in
such a test.

WHAT IS THE ISSUE?

As I have analyzed my own and others arguments here for and against
comparative dbt'ng, it seems to me that the issue has much less to be with
being blind than it does with being comparative. In other words, does a
test "forcing" a choice under uncertainty duplicate the results that would
be obtained by listening and evaluating components at home in a relaxed
atmosphere, whether blind or sighted. I have accordingly proposed that

the
only way to validate the comparative dbt as the definitive tool is to

remove
this question mark. And it could be done, with enough time and resources
devoted to it.

As such, the control test must separate out and test two variables -

* evaluative (blind) vs. comparative (blind) ,,, a test of evaluative
testing versus comparative testing
* evaluative (blind) vs. evaluative (sighted) ,,, a test of blind vs.
sighted testing

With the answers to these two comparisons, it should be able to answer the
following questions?

* Does blinding give better bias control? (presumably yes)
* How close can open-ended, relaxed, sighted evaluative testing (the
traditional home "sighted" tests which are believed worthless by the
objectivists) come to duplicating the results of open-ended, relaxed, but
blinded evaluative testing. Same test technique, but blinded, which
objectivist presumably would support.
* Do traditional comparative dbt tests give identical results to more
relaxed and evaluative dbt tests? (answer simply not known, but postulated
by subjectivists as "no", thinking that the test itself is different

enough
to get in the way).

Essentially, the blinded (dbt), relaxed, evaluative test is "the missing
link" between the current dbt camp and the current subjectivist camp as it
helps resolve both the "blind" issue and the "comparative vs. evaluative"
issue. Using components playing music, not artifacts or pink noise.

GENERAL TEST CONDITIONS
* Participants must take place in all three tests...open end sighted, open
end blind, and comparative blind.
* There has to be enough trials of each type to allow statistical
evaluation.
* Musical selections and media must be agreed to in advance by all parties
as being sufficiently varied to reveal all types of significant audio
reproduction qualities. (Dynamic range, soundstaging, depth,
dimensionality, bass quality, treble quality, midrange quality, etc.)
* Equipment under test must be believed by most participants to sound
different from one another under sighted conditions and to have some

degree
of objectivist skepticism about same.
* Equipment under test, everything else being equal, should make testing
under home/similar to home conditions as simple as possible, including
time-synched switching.
* Tests must either be done in-home of participants, or at a site

accessible
to participants over long periods of time on a sighted basis before test
ratings collected.

EVALUATIVE TEST CONDITIONS
* Open-ended home listening must supplant informal note taking with formal
rating of components on evaluative scale, in order to be able to
statistically correlate with blind evaluative testing.
* Evaluative scale should draw from and reflect all significant variables
suggested by RAHE participants, reduced to a manageable number by
consolidating very similar qualities.

COMPARATIVE TEST CONDITIONS
* Test should be a-b, rather than a-b-x, in order to better approximate

the
evaluative tests
* Test should ask for overall preference and preference on comparative
version of evaluative scales (at least those found significantly different
in the evaluative testing.)

BLIND TEST CONDITIONS
* Participants should be allowed substantial "warm up" time on sighted

basis
to listen to the test equipment using the musical selections to be used in
the test.
* Participants should be allowed to control switching of test.
* Participants should be left alone in room during test ideally, and

should
"turn in" ratings to out of room proctor who also has recorded the actual
a-b assignment for each trial.
* a-b assignments shall be based on random drawings and then adjusted, if
needed, slightly to assure equal positioning and no chance of order bias.

* * * * * * * * * * * * * * *

*

With those general conditions established, I would like to discuss actual
test implementation practicalities. This is where it gets complicated.

THE OPEN-ENDED SIGHTED EVALUATIVE TEST
Essentially, as I described in an earlier post, the typical audiophile

puts
a new piece of equipment in the system, listens open-ended for awhile,
switches back, does the same, and by doing this a few times over several
selections of music begins to hone in on what characteristics the new
equipment has in his system versus the old. These may be improvements;

they
may be deficiencies. He continues to do this until a) he has to return
equipment, or b)reaches a definitive preference for one or the other (a
preference growing organically out of the evaluation and the emergence of
defining audio characteristics).

How to best approximate this test on a slightly more structured basis, so
that results may be compared to later tests?

The first and probably only thing required, seems to me, is to substitute
formal evaluation rating scales for the informal notes done during this
process. My suggestion is that the evaluator would have perhaps
half-a-dozen interim rating sheets that he/she would use, lets say for six
weeks. Then at the end, he/she would review those sheets and put together

a
"final" rating for the two pieces of equipment. These would be on an
absolute scale for the two pieces. For example, both might be rated high

on
"throw a wide soundstage beyond the outside edge of the speakers". One
would be rated "5" and the other "4" on a "1" to "5" scale. So this score
can be used both as a numeric rating and as a comparative rating, e.g..

both
same, one higher (different, higher) on that characteristic. Their would
also be a similar rating "preference overall" that might be "4" and "3"
(different, better). Or perhaps "4" and "4" (no preference).

However, one can immediately see one problem. With a sighted test, there

is
no such thing as doing 16 independent trials, since presumably once the
person "locks in" his future ratings would be very similar since he knows
which equipment is which. Even allowing for differences in moods,

climates,
etc. these would not be sixteen independent tests.

The implications of this are that for the "relaxed, evaluative, sighted"
versus "relaxed, evaluative, blind" tests, more than one person must be
tested....probably at least twenty. In the food industry we used to
consider 100 as the smallest test size we considered reasonable. This

adds
enormously to the cost, time, and complexity of running such a test if one
is to do it in-home.

It would be a little more manageable doing it out-of-home at a central
facility, and having sixteen audiophiles do it. But this is fraught with
problems...an unfamiliar system probably requiring more time to reach a
final evaluation for each respondent, the need to maintain the setup for
several weeks to allow all respondents to have multiple exposures before
doing so, etc.

Problems, problems.

THE OPEN-ENDED, BLIND EVALUATIVE TEST
This test would be very similar to the open-ended sighted test, but
double-blind. Once a warm up period of perhaps a few hours was over,
however, the respondent would take a trial, rate, turn in, take a break,
start another trial, etc up to four in a row. If repeated four days or
four weeks in a row, this could result in sixteen trials, enough to
determine significance of differences in ratings. The ratings would be

the
same used in the sighted testing. The results of this test would be: were
differences between the equipment found, and were they statistically
significant at the 95th percentile. What characteristics, if any, came
through as significantly different.

Once a respondents results were determined (different, same) overall and

for
each characteristic, they could be compared to open-ended scores and a
correlation established (or not). Since the open-ended sighted test only
had one score, it would be hard to evaluate significance for an individual
person on these correlations, but if done across 20-100 people, a
statistical correlation could be established. For this to be a true
"scientific" test, it would have to be done across a substantial

population
of audiophiles as has already been pointed out.

THE COMPARATIVE, BLIND TEST
The main blind (a-b) test would use the evaluative factors of the sighted
and blind evaluative tests, but on a comparative basis (e.g. which did you
prefer overall, which had the widest soundstage, etc.). The comparative
evaluation test could be directly correlated with the blind evaluative

test,
as well as within itself over sixteen trials. Again, these probably

should
be done in groups of four since they require a fair number of ratings.

Not essential but of possible interest, would be to do a traditional a-b-x
test as well, to see if it correlated with the overall preference a-b test
(% of respondents noting a difference in each/statistical significance of
same).

* * * * * * * * * * * * * * *

*
* * * * * * *

IMPLICATIONS

As noted, to truly be significant, this test has to be done across a

sample
of audiophiles, probably at least two dozen in-home evaluations and
subsequent test follows up. This would kept Tom and I busy for a year.

From a practical standpoint, the blind comparative vs. blind evaluative
tests are easier to do, since multiple trials allows for internal
statistical validity. I would be willing to develop and be the initial
testee of such a test along with Tom, whom I would also ask to do the

same,
and perhaps a "neutral" third party.
I would also do the sighted test, but the results would be strictly
"anecdotal" until an appropriate database of RAHE participants was built

up,
and would request that Tom and the "neutral" do the same.

I would also suggest that a good and most interesting vehicle for this

test
would be a SACD player using stereo mix SACD and CD layer, on disks and
tracks judged appropriate and "identical" in mix. The test would be easy

to
run...two identical side-by-side SACD players into a preamp input, with
control box switching or manual switching, automatically volume matched,

no
impedance problems a la speaker cables, and perhaps some ultimate insight
into "is there a difference in SACD vs CD". I have a SACD player; Tom
would have to buy one or borrow one; same for neutral third party.

If SACD is judged impractical, then I would suggest a CD test between two

CD
players judged to be likely audibly different...say an Arcam 27 versus a
Sony $300 job. However, the equipment would have to be on long term loan,
since it would take probably at least six months to complete the testing.

We would also need neutral proctors to run the test and record scores.

* * * * * * * * * * * * * * *

*
* * * *

CONCLUSION
There would be a fair amount of work needed to get this off the ground,

but
it is doable. In particular, I would want broad agreement within RAHE

that
it was worthwhile doing, and I would want input from members of

appropriate
test SACDS or CDs and tracks for testing, and I would want myself, Tom,

and
the other participant to agree on the selections to be used.

Your comments and suggestions and questions are hereby solicited.

Harry Lavo
"it don't mean a thing if it ain't got that swing" - Duke Ellington


  #5   Report Post  
Nousaine
 
Posts: n/a
Default A comparative versus evaluative, double-blind vs. sighted control test

"Harry Lavo" wrote:

Hi RAHE'rs -

I've had many inquires and some interest in my proposal that before
comparative dbt'ng is crowned "the" test for audio evaluation, it needs to
be validated by a control test. While I have sketched such a test in
several different posts/threads, there seems to be enough confusion over
what I have said that it is worth outlining here in a definitive post on the
subject.

In addition, at the end I will respond to Tom's offer to join together in
such a test.

WHAT IS THE ISSUE?

As I have analyzed my own and others arguments here for and against
comparative dbt'ng, it seems to me that the issue has much less to be with
being blind than it does with being comparative. In other words, does a
test "forcing" a choice under uncertainty duplicate the results that would
be obtained by listening and evaluating components at home in a relaxed
atmosphere, whether blind or sighted. I have accordingly proposed that the
only way to validate the comparative dbt as the definitive tool is to remove
this question mark. And it could be done, with enough time and resources
devoted to it.


Others have expressed similar thoughts but it occurs to me that this isn't a
question that needs an answer unless one is concerned about "results" that are
due to sound quality alone and unfettered by other factors. It seems clear from
other research that sound quality is best evaluated with closely spaced short
program segments. So; why would we care if we would get the "same" results ....
whatever 'results' would mean.

As such, the control test must separate out and test two variables -

* evaluative (blind) vs. comparative (blind) ,,, a test of evaluative
testing versus comparative testing
* evaluative (blind) vs. evaluative (sighted) ,,, a test of blind vs.
sighted testing


Evaluative vs Comparative is an artifical distinction. All 'results' of any
comparison are evaluative. If they weren't than noone would ever need to
compare. This seems simply to be a re-statement of the short vs long argument.

With the answers to these two comparisons, it should be able to answer the
following questions?

* Does blinding give better bias control? (presumably yes)
* How close can open-ended, relaxed, sighted evaluative testing (the
traditional home "sighted" tests which are believed worthless by the
objectivists) come to duplicating the results of open-ended, relaxed, but
blinded evaluative testing.


We already know that un-controlled tests will be clouded by non-sonic factors.
This would simply be a test of relative effectivness of marketing and
ergonomics.

Same test technique, but blinded, which
objectivist presumably would support.
* Do traditional comparative dbt tests give identical results to more
relaxed and evaluative dbt tests?


IME 5-weeks shows identical results with power amplifiers.

(answer simply not known, but postulated
by subjectivists as "no", thinking that the test itself is different enough
to get in the way).


Asked and answered, so to speak. :-) But interestingly this experiment has been
done in-reverse a few times. The Sunshine Trials show that even after long-term
relaxed evaluation has been conducted subjects may be unable to "hear" amps and
wires with bias controls implemented.

You will suggest that implementation of bias-controls by itself is enough
tode-sensitize listeners. I think the more rational, and obvious answer is that
previous "results" were not acoustically caused.

Essentially, the blinded (dbt), relaxed, evaluative test is "the missing
link" between the current dbt camp and the current subjectivist camp as it
helps resolve both the "blind" issue and the "comparative vs. evaluative"
issue. Using components playing music, not artifacts or pink noise.


Yes, noise is overly-sensitive.


GENERAL TEST CONDITIONS
* Participants must take place in all three tests...open end sighted, open
end blind, and comparative blind.
* There has to be enough trials of each type to allow statistical
evaluation.
* Musical selections and media must be agreed to in advance by all parties
as being sufficiently varied to reveal all types of significant audio
reproduction qualities. (Dynamic range, soundstaging, depth,
dimensionality, bass quality, treble quality, midrange quality, etc.)
* Equipment under test must be believed by most participants to sound
different from one another under sighted conditions and to have some degree
of objectivist skepticism about same.


Good this may limit subject choice to subjectivists if bits/wires/amps/parts
are the "components" being tested.

* Equipment under test, everything else being equal, should make testing
under home/similar to home conditions as simple as possible, including
time-synched switching.


Sounds like a job for ABX.

* Tests must either be done in-home of participants, or at a site accessible
to participants over long periods of time on a sighted basis before test
ratings collected.


This limits it to in-home as I see it.


EVALUATIVE TEST CONDITIONS
* Open-ended home listening must supplant informal note taking with formal
rating of components on evaluative scale, in order to be able to
statistically correlate with blind evaluative testing.


Of course.

* Evaluative scale should draw from and reflect all significant variables
suggested by RAHE participants, reduced to a manageable number by
consolidating very similar qualities.

COMPARATIVE TEST CONDITIONS
* Test should be a-b, rather than a-b-x, in order to better approximate the
evaluative tests
* Test should ask for overall preference and preference on comparative
version of evaluative scales (at least those found significantly different
in the evaluative testing.)


Generally speaking "preference" scores, rather than scoring of performance
variables tends to allow individual errors to cloud results. For example each
program may have specific reference characteristics that individuals may not
"like". A quallity system will return what's on the recording and not be
preferable to a given subject. It would be easy to conduct a test of subjects
and not equipment, even if we are seeking a confluence of 'results' between
alternative testing modes.

BLIND TEST CONDITIONS
* Participants should be allowed substantial "warm up" time on sighted basis
to listen to the test equipment using the musical selections to be used in
the test.
* Participants should be allowed to control switching of test.
* Participants should be left alone in room during test ideally, and should
"turn in" ratings to out of room proctor who also has recorded the actual
a-b assignment for each trial.
* a-b assignments shall be based on random drawings and then adjusted, if
needed, slightly to assure equal positioning and no chance of order bias.

* * * * * * * * * * * * * * * *

With those general conditions established, I would like to discuss actual
test implementation practicalities. This is where it gets complicated.

THE OPEN-ENDED SIGHTED EVALUATIVE TEST
Essentially, as I described in an earlier post, the typical audiophile puts
a new piece of equipment in the system, listens open-ended for awhile,
switches back, does the same, and by doing this a few times over several
selections of music begins to hone in on what characteristics the new
equipment has in his system versus the old. These may be improvements; they
may be deficiencies. He continues to do this until a) he has to return
equipment, or b)reaches a definitive preference for one or the other (a
preference growing organically out of the evaluation and the emergence of
defining audio characteristics).



How to best approximate this test on a slightly more structured basis, so
that results may be compared to later tests?

The first and probably only thing required, seems to me, is to substitute
formal evaluation rating scales for the informal notes done during this
process. My suggestion is that the evaluator would have perhaps
half-a-dozen interim rating sheets that he/she would use, lets say for six
weeks. Then at the end, he/she would review those sheets and put together a
"final" rating for the two pieces of equipment. These would be on an
absolute scale for the two pieces. For example, both might be rated high on
"throw a wide soundstage beyond the outside edge of the speakers". One
would be rated "5" and the other "4" on a "1" to "5" scale. So this score
can be used both as a numeric rating and as a comparative rating, e.g.. both
same, one higher (different, higher) on that characteristic. Their would
also be a similar rating "preference overall" that might be "4" and "3"
(different, better). Or perhaps "4" and "4" (no preference).

However, one can immediately see one problem. With a sighted test, there is
no such thing as doing 16 independent trials, since presumably once the
person "locks in" his future ratings would be very similar since he knows
which equipment is which. Even allowing for differences in moods, climates,
etc. these would not be sixteen independent tests.

The implications of this are that for the "relaxed, evaluative, sighted"
versus "relaxed, evaluative, blind" tests, more than one person must be
tested....probably at least twenty. In the food industry we used to
consider 100 as the smallest test size we considered reasonable. This adds
enormously to the cost, time, and complexity of running such a test if one
is to do it in-home.

It would be a little more manageable doing it out-of-home at a central
facility, and having sixteen audiophiles do it. But this is fraught with
problems...an unfamiliar system probably requiring more time to reach a
final evaluation for each respondent, the need to maintain the setup for
several weeks to allow all respondents to have multiple exposures before
doing so, etc.

Problems, problems.

THE OPEN-ENDED, BLIND EVALUATIVE TEST
This test would be very similar to the open-ended sighted test, but
double-blind. Once a warm up period of perhaps a few hours was over,
however, the respondent would take a trial, rate, turn in, take a break,
start another trial, etc up to four in a row. If repeated four days or
four weeks in a row, this could result in sixteen trials, enough to
determine significance of differences in ratings. The ratings would be the
same used in the sighted testing. The results of this test would be: were
differences between the equipment found, and were they statistically
significant at the 95th percentile. What characteristics, if any, came
through as significantly different.

Once a respondents results were determined (different, same) overall and for
each characteristic, they could be compared to open-ended scores and a
correlation established (or not). Since the open-ended sighted test only
had one score, it would be hard to evaluate significance for an individual
person on these correlations, but if done across 20-100 people, a
statistical correlation could be established. For this to be a true
"scientific" test, it would have to be done across a substantial population
of audiophiles as has already been pointed out.


Generally agreed; but I will point out that we have not established the
reliability of any given subject here. IOW subjects may not give identical
rating to the same stimulus on successive presentations. This, of course, is
the reason for a larger sample size.


THE COMPARATIVE, BLIND TEST
The main blind (a-b) test would use the evaluative factors of the sighted
and blind evaluative tests, but on a comparative basis (e.g. which did you
prefer overall, which had the widest soundstage, etc.). The comparative
evaluation test could be directly correlated with the blind evaluative test,
as well as within itself over sixteen trials. Again, these probably should
be done in groups of four since they require a fair number of ratings.

Not essential but of possible interest, would be to do a traditional a-b-x
test as well, to see if it correlated with the overall preference a-b test
(% of respondents noting a difference in each/statistical significance of
same).


You mean that if subjects were unable to identify the components being then
"preference" is due to something other than the performance categories being
rated?

I think it's imperative to do this test for the components we disagree about.
Power amps or wires are the key issues.


* * * * * * * * * * * * * * * *
* * * * * * *

IMPLICATIONS

As noted, to truly be significant, this test has to be done across a sample
of audiophiles, probably at least two dozen in-home evaluations and
subsequent test follows up. This would kept Tom and I busy for a year.

From a practical standpoint, the blind comparative vs. blind evaluative
tests are easier to do, since multiple trials allows for internal
statistical validity.


Not for the sake of argument but experiments are valid by design and reliable
by statistics. Statistical Validity is a useful concept but multiple trials
enable us to verify reliability and not the validity of the experiment. For
example if I conducted an ABX test with one ampliifer 10 dB louder than the
other and got statistically significant results that's not surprising (it's
pretty sure that subjects correctly identified X relaibly.....but if I said
that showed that A had a better soundstage would not be a valid conclusion
based on that evidence.

I would be willing to develop and be the initial
testee of such a test along with Tom, whom I would also ask to do the same,
and perhaps a "neutral" third party.
I would also do the sighted test, but the results would be strictly
"anecdotal" until an appropriate database of RAHE participants was built up,
and would request that Tom and the "neutral" do the same.

I would also suggest that a good and most interesting vehicle for this test
would be a SACD player using stereo mix SACD and CD layer, on disks and
tracks judged appropriate and "identical" in mix. The test would be easy to
run...two identical side-by-side SACD players into a preamp input, with
control box switching or manual switching, automatically volume matched, no
impedance problems a la speaker cables, and perhaps some ultimate insight
into "is there a difference in SACD vs CD". I have a SACD player; Tom
would have to buy one or borrow one; same for neutral third party.


I have a SACD player.

If SACD is judged impractical, then I would suggest a CD test between two CD
players judged to be likely audibly different...say an Arcam 27 versus a
Sony $300 job. However, the equipment would have to be on long term loan,
since it would take probably at least six months to complete the testing.


Who would supply same?

We would also need neutral proctors to run the test and record scores.

* * * * * * * * * * * * * * * *
* * * *

CONCLUSION
There would be a fair amount of work needed to get this off the ground, but
it is doable. In particular, I would want broad agreement within RAHE that
it was worthwhile doing, and I would want input from members of appropriate
test SACDS or CDs and tracks for testing, and I would want myself, Tom, and
the other participant to agree on the selections to be used.

Your comments and suggestions and questions are hereby solicited.

Harry Lavo


I like the SACD vs CD comparison.


  #6   Report Post  
Steven Sullivan
 
Posts: n/a
Default A comparative versus evaluative, double-blind vs. sighted control test

Nousaine wrote:
I have a SACD player.


If SACD is judged impractical, then I would suggest a CD test between two CD
players judged to be likely audibly different...say an Arcam 27 versus a
Sony $300 job. However, the equipment would have to be on long term loan,
since it would take probably at least six months to complete the testing.


Who would supply same?


We would also need neutral proctors to run the test and record scores.

* * * * * * * * * * * * * * * *
* * * *

CONCLUSION
There would be a fair amount of work needed to get this off the ground, but
it is doable. In particular, I would want broad agreement within RAHE that
it was worthwhile doing, and I would want input from members of appropriate
test SACDS or CDs and tracks for testing, and I would want myself, Tom, and
the other participant to agree on the selections to be used.

Your comments and suggestions and questions are hereby solicited.

Harry Lavo


I like the SACD vs CD comparison.


Rather than all this business about getting consensus on RAHE about components
and worthwhile-ness and such, why not just have Harry list some components/treatments
he *already hears differences between*, and test *those* claims in a DBT,proctored
by Tom. There's no need for an 'evaluation' step there -- Harry's already done the
'evaluation'.

--

-S.

"They've got God on their side. All we've got is science and reason."
-- Dawn Hulsey, Talent Director

  #7   Report Post  
Bob Marcus
 
Posts: n/a
Default A comparative versus evaluative, double-blind vs. sighted control test

Harry has proposed a test that is both impossible to implement (for reasons
I've partly explained elsewhere) and meaningless (for reasons that others
have partly explained elsewhere).

The fundamental problem is that he starts from his conclusion: He assumes
that what he calls "evaluative" listening is both different from and better
than what he calls "comparative" listening. This disctinction is purely
semantic. You cannot compare the sound of two components without evaluating
them, and any audiophile who has ever "evaluated" two components has used
his evaluations to compare the two.

Harry's objection to traditional DBTs comes down to the same old complaint:
They don't allow the listener time to notice subtle differences. To which I
can only give the same old answers: 1) All extant research indicates that
time is the enemy of subtle distinctions, because our memory for them is so
short; and 2) DBT protocols do not preclude a subject from taking as long as
he wants to listen and "evaluate" components before using that evaluation to
make a simple determination.

Ultimately, the idea that we need some new test to determine whether
traditional DBTs are valid is absurd. I'm not qualified to say whether they
are, but neither is Harry Lavo. The people who ARE so qualified are the
experts who study human hearing perception for a living, and they use these
tests all the time in all sorts of ways to answer all sorts of questions.

Find me one such expert on the faculty of any accredited university who does
NOT believe that traditional DBTs are a valid means of testing for audible
difference between ANY two sounds, and I will agree that we have something
to talk about. Absent that, we are left with two groups of people: Those who
accept what psychoacoustics researchers have learned about human hearing
perception, and those who do not but can offer no empirical basis for their
objections.

bob

__________________________________________________ _______________
Check out the great features of the new MSN 9 Dial-up, with the MSN Dial-up
Accelerator. http://click.atdmt.com/AVE/go/onm00200361ave/direct/01/
  #8   Report Post  
 
Posts: n/a
Default A comparative versus evaluative, double-blind vs. sighted control test

Bob Marcus wrote:


Ultimately, the idea that we need some new test to determine whether
traditional DBTs are valid is absurd. I'm not qualified to say whether they
are, but neither is Harry Lavo. The people who ARE so qualified are the
experts who study human hearing perception for a living, and they use these
tests all the time in all sorts of ways to answer all sorts of questions.


Such a 'verifying' test has not been done for a very simple reason:

The blind protocols have been SHOWN to be sensitive down to the lowest
instantaneous loudness that results in a signal at the auditory nerve.

It is a waste of time to do a 'verifying' test when a test validates itself,
based on already well known research data.
  #9   Report Post  
Mkuller
 
Posts: n/a
Default A comparative versus evaluative, double-blind vs. sighted control test

wrote:
Such a 'verifying' test has not been done for a very simple reason:

The blind protocols have been SHOWN to be sensitive down to the lowest
instantaneous loudness that results in a signal at the auditory nerve.

It is a waste of time to do a 'verifying' test when a test validates itself,
based on already well known research data.


This sounds like the old "don't confuse me with facts, I've already made up my
mind" arguement. You guys seem positive you're right in spite of being a small
minority in the audiophile universe. Isn't there a chance you are mistaken?
Until you admit this possibility, I wouldn't expect much help in coming up with
some type of a *verification test* for dbts in audio. In which case we can
just continue the endless debate forever ("perfect DBTs forever" - apologies to
Sony).

Sure dbts have been shown to be sensitive to "the threshold of human hearing"
when the *one-dimensional* artifact being tested for is *known* and
*quantified* and the subjects are *trained* to recognize it. In an audio
component dbt, *none* of these factors is present. It is a very different type
of use for this test than is seen in published clinical research studies.

In audio, the test is *open-ended*, i.e. what the listeners are listening for
(a *multi-dimensional difference*) is *unknown*, *not quantified*, and there is
no training of the subjects, because they can't be trained to hear something
that might not be there. Music is the only meaningful program source, and is
recognized by clinical researchers to be insensitive to audible differences in
dbts.

Until there is a difinitive *verification* test for dbts between audio
components using music, there is no proof that a dbt does not mask or obscure
the very audible differences you are using it to detect.
Regards,
Mike
  #10   Report Post  
 
Posts: n/a
Default A comparative versus evaluative, double-blind vs. sighted control test

Mkuller wrote:
wrote:
Such a 'verifying' test has not been done for a very simple reason:

The blind protocols have been SHOWN to be sensitive down to the lowest
instantaneous loudness that results in a signal at the auditory nerve.

It is a waste of time to do a 'verifying' test when a test validates itself,
based on already well known research data.


This sounds like the old "don't confuse me with facts, I've already made up my
mind" arguement. You guys seem positive you're right in spite of being a small
minority in the audiophile universe. Isn't there a chance you are mistaken?


Of course the scientists could be wrong. If that possibility wasn't part
and parcel, they wouldn't be scientists. But you are the one who is
suggesting something is wrong and therefore the onus is upon you to show
what that it is. If something is wrong then it should be able to shown.
(or are you suggesting that irrationality be part of scientific studies?)
So far no evidence other than personal opinions has been offered.

Small minority? You need to get out more! It was once a minority that
thought the sun revolved the earth. Is that the kind of thing that's being
proposed?


Sure dbts have been shown to be sensitive to "the threshold of human hearing"
when the *one-dimensional* artifact being tested for is *known* and
*quantified* and the subjects are *trained* to recognize it. In an audio
component dbt, *none* of these factors is present. It is a very different type
of use for this test than is seen in published clinical research studies.


Instantaneous loudness/partial louness IS what we hear. It IS multi dimensional.
This is virtually axiomatic without actually being so. Please try to keep up. ;-)


In audio, the test is *open-ended*, i.e. what the listeners are listening for
(a *multi-dimensional difference*) is *unknown*, *not quantified*, and there is
no training of the subjects, because they can't be trained to hear something
that might not be there. Music is the only meaningful program source, and is
recognized by clinical researchers to be insensitive to audible differences in
dbts.


And Mike, how do you think of thresholds of audible detection of the human ear
have been established? They are pretty comprehensive and have been known for a
LONG time. The field is audiology, and the activites in that field compared
to what goes on in 'high-end audio' in terms of sophistication and comprehensiveness
are so great as to not even deserve a comparison.

Please try to understand the fact that music can be an insensitive
stimulus in terms of detection at the audible threshold has NOTHING to do
with the validity of DBT's. 'Differences' in MUSICAL terms are LARGE from a
scientific and analytical perspective.


Until there is a difinitive *verification* test for dbts between audio
components using music, there is no proof that a dbt does not mask or obscure
the very audible differences you are using it to detect.


You cannot rationally say that while understanding what 'partial loudness'
and/or 'instantaneous loudness' means. What is the point of irrationality in
the context of comparison of audio components other than personal
preference, which by definition involves personal factors, many of which are
non-sonic?

Frankly, I find the idea of 'veryfying' highly personal subjective
impressions with a scientific test bizzare, absurd and invasive.
I find it sad and amusing that some subjectivists are wanting to indulge
in such an activity, which actually fulfills the definitions of
scientism, in what is a hobby.


Reply
Thread Tools
Display Modes

Posting Rules

Smilies are On
[IMG] code is On
HTML code is Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Comments about Blind Testing watch king High End Audio 24 January 28th 04 05:03 PM
Some serious cable measurements with interesting results. Bruno Putzeys High End Audio 78 December 19th 03 04:27 AM
Mechanic blames amplifier for alternator failing?? Help>>>>>>>>>>> SHRED© Car Audio 57 December 13th 03 11:24 AM
Richman's ethical lapses Michael McKelvy Audio Opinions 9 December 12th 03 09:16 AM


All times are GMT +1. The time now is 10:37 AM.

Powered by: vBulletin
Copyright ©2000 - 2024, Jelsoft Enterprises Ltd.
Copyright ©2004-2024 AudioBanter.com.
The comments are property of their posters.
 

About Us

"It's about Audio and hi-fi"