AudioBanter.com - View Single Post - A comparative versus evaluative, double-blind vs. sighted control test

Here's the real question, as I see it. Lets say that two CD players, A and
B, are being evaluated and during sighted tests, evaluative or comparative,
a subject states a preference for unit A, yet when blinded is unable to
duplicate the sighted result. What conclusion will be drawn? My suspicion
is that both subjectivist and objectivist alike will wring their hands with
glee shouting, "See...just like I told you."

"Harry Lavo" wrote in message
news:wrDVb.249701$I06.2756526@attbi_s01...
Hi RAHE'rs -

I've had many inquires and some interest in my proposal that before
comparative dbt'ng is crowned "the" test for audio evaluation, it needs to
be validated by a control test. While I have sketched such a test in
several different posts/threads, there seems to be enough confusion over
what I have said that it is worth outlining here in a definitive post on
the
subject.

In addition, at the end I will respond to Tom's offer to join together in
such a test.

WHAT IS THE ISSUE?

As I have analyzed my own and others arguments here for and against
comparative dbt'ng, it seems to me that the issue has much less to be with
being blind than it does with being comparative. In other words, does a
test "forcing" a choice under uncertainty duplicate the results that would
be obtained by listening and evaluating components at home in a relaxed
atmosphere, whether blind or sighted. I have accordingly proposed that
the
only way to validate the comparative dbt as the definitive tool is to
remove
this question mark. And it could be done, with enough time and resources
devoted to it.

As such, the control test must separate out and test two variables -

* evaluative (blind) vs. comparative (blind) ,,, a test of evaluative
testing versus comparative testing
* evaluative (blind) vs. evaluative (sighted) ,,, a test of blind vs.
sighted testing

With the answers to these two comparisons, it should be able to answer the
following questions?

* Does blinding give better bias control? (presumably yes)
* How close can open-ended, relaxed, sighted evaluative testing (the
traditional home "sighted" tests which are believed worthless by the
objectivists) come to duplicating the results of open-ended, relaxed, but
blinded evaluative testing. Same test technique, but blinded, which
objectivist presumably would support.
* Do traditional comparative dbt tests give identical results to more
relaxed and evaluative dbt tests? (answer simply not known, but postulated
by subjectivists as "no", thinking that the test itself is different
enough
to get in the way).

Essentially, the blinded (dbt), relaxed, evaluative test is "the missing
link" between the current dbt camp and the current subjectivist camp as it
helps resolve both the "blind" issue and the "comparative vs. evaluative"
issue. Using components playing music, not artifacts or pink noise.

GENERAL TEST CONDITIONS
* Participants must take place in all three tests...open end sighted, open
end blind, and comparative blind.
* There has to be enough trials of each type to allow statistical
evaluation.
* Musical selections and media must be agreed to in advance by all parties
as being sufficiently varied to reveal all types of significant audio
reproduction qualities. (Dynamic range, soundstaging, depth,
dimensionality, bass quality, treble quality, midrange quality, etc.)
* Equipment under test must be believed by most participants to sound
different from one another under sighted conditions and to have some
degree
of objectivist skepticism about same.
* Equipment under test, everything else being equal, should make testing
under home/similar to home conditions as simple as possible, including
time-synched switching.
* Tests must either be done in-home of participants, or at a site
accessible
to participants over long periods of time on a sighted basis before test
ratings collected.

EVALUATIVE TEST CONDITIONS
* Open-ended home listening must supplant informal note taking with formal
rating of components on evaluative scale, in order to be able to
statistically correlate with blind evaluative testing.
* Evaluative scale should draw from and reflect all significant variables
suggested by RAHE participants, reduced to a manageable number by
consolidating very similar qualities.

COMPARATIVE TEST CONDITIONS
* Test should be a-b, rather than a-b-x, in order to better approximate
the
evaluative tests
* Test should ask for overall preference and preference on comparative
version of evaluative scales (at least those found significantly different
in the evaluative testing.)

BLIND TEST CONDITIONS
* Participants should be allowed substantial "warm up" time on sighted
basis
to listen to the test equipment using the musical selections to be used in
the test.
* Participants should be allowed to control switching of test.
* Participants should be left alone in room during test ideally, and
should
"turn in" ratings to out of room proctor who also has recorded the actual
a-b assignment for each trial.
* a-b assignments shall be based on random drawings and then adjusted, if
needed, slightly to assure equal positioning and no chance of order bias.

* * * * * * * * * * * * * * *
*

With those general conditions established, I would like to discuss actual
test implementation practicalities. This is where it gets complicated.

THE OPEN-ENDED SIGHTED EVALUATIVE TEST
Essentially, as I described in an earlier post, the typical audiophile
puts
a new piece of equipment in the system, listens open-ended for awhile,
switches back, does the same, and by doing this a few times over several
selections of music begins to hone in on what characteristics the new
equipment has in his system versus the old. These may be improvements;
they
may be deficiencies. He continues to do this until a) he has to return
equipment, or b)reaches a definitive preference for one or the other (a
preference growing organically out of the evaluation and the emergence of
defining audio characteristics).

How to best approximate this test on a slightly more structured basis, so
that results may be compared to later tests?

The first and probably only thing required, seems to me, is to substitute
formal evaluation rating scales for the informal notes done during this
process. My suggestion is that the evaluator would have perhaps
half-a-dozen interim rating sheets that he/she would use, lets say for six
weeks. Then at the end, he/she would review those sheets and put together
a
"final" rating for the two pieces of equipment. These would be on an
absolute scale for the two pieces. For example, both might be rated high
on
"throw a wide soundstage beyond the outside edge of the speakers". One
would be rated "5" and the other "4" on a "1" to "5" scale. So this score
can be used both as a numeric rating and as a comparative rating, e.g..
both
same, one higher (different, higher) on that characteristic. Their would
also be a similar rating "preference overall" that might be "4" and "3"
(different, better). Or perhaps "4" and "4" (no preference).

However, one can immediately see one problem. With a sighted test, there
is
no such thing as doing 16 independent trials, since presumably once the
person "locks in" his future ratings would be very similar since he knows
which equipment is which. Even allowing for differences in moods,
climates,
etc. these would not be sixteen independent tests.

The implications of this are that for the "relaxed, evaluative, sighted"
versus "relaxed, evaluative, blind" tests, more than one person must be
tested....probably at least twenty. In the food industry we used to
consider 100 as the smallest test size we considered reasonable. This
adds
enormously to the cost, time, and complexity of running such a test if one
is to do it in-home.

It would be a little more manageable doing it out-of-home at a central
facility, and having sixteen audiophiles do it. But this is fraught with
problems...an unfamiliar system probably requiring more time to reach a
final evaluation for each respondent, the need to maintain the setup for
several weeks to allow all respondents to have multiple exposures before
doing so, etc.

Problems, problems.

THE OPEN-ENDED, BLIND EVALUATIVE TEST
This test would be very similar to the open-ended sighted test, but
double-blind. Once a warm up period of perhaps a few hours was over,
however, the respondent would take a trial, rate, turn in, take a break,
start another trial, etc up to four in a row. If repeated four days or
four weeks in a row, this could result in sixteen trials, enough to
determine significance of differences in ratings. The ratings would be
the
same used in the sighted testing. The results of this test would be: were
differences between the equipment found, and were they statistically
significant at the 95th percentile. What characteristics, if any, came
through as significantly different.

Once a respondents results were determined (different, same) overall and
for
each characteristic, they could be compared to open-ended scores and a
correlation established (or not). Since the open-ended sighted test only
had one score, it would be hard to evaluate significance for an individual
person on these correlations, but if done across 20-100 people, a
statistical correlation could be established. For this to be a true
"scientific" test, it would have to be done across a substantial
population
of audiophiles as has already been pointed out.

THE COMPARATIVE, BLIND TEST
The main blind (a-b) test would use the evaluative factors of the sighted
and blind evaluative tests, but on a comparative basis (e.g. which did you
prefer overall, which had the widest soundstage, etc.). The comparative
evaluation test could be directly correlated with the blind evaluative
test,
as well as within itself over sixteen trials. Again, these probably
should
be done in groups of four since they require a fair number of ratings.

Not essential but of possible interest, would be to do a traditional a-b-x
test as well, to see if it correlated with the overall preference a-b test
(% of respondents noting a difference in each/statistical significance of
same).

* * * * * * * * * * * * * * *
*
* * * * * * *

IMPLICATIONS

As noted, to truly be significant, this test has to be done across a
sample
of audiophiles, probably at least two dozen in-home evaluations and
subsequent test follows up. This would kept Tom and I busy for a year.

From a practical standpoint, the blind comparative vs. blind evaluative
tests are easier to do, since multiple trials allows for internal
statistical validity. I would be willing to develop and be the initial
testee of such a test along with Tom, whom I would also ask to do the
same,
and perhaps a "neutral" third party.
I would also do the sighted test, but the results would be strictly
"anecdotal" until an appropriate database of RAHE participants was built
up,
and would request that Tom and the "neutral" do the same.

I would also suggest that a good and most interesting vehicle for this
test
would be a SACD player using stereo mix SACD and CD layer, on disks and
tracks judged appropriate and "identical" in mix. The test would be easy
to
run...two identical side-by-side SACD players into a preamp input, with
control box switching or manual switching, automatically volume matched,
no
impedance problems a la speaker cables, and perhaps some ultimate insight
into "is there a difference in SACD vs CD". I have a SACD player; Tom
would have to buy one or borrow one; same for neutral third party.

If SACD is judged impractical, then I would suggest a CD test between two
CD
players judged to be likely audibly different...say an Arcam 27 versus a
Sony $300 job. However, the equipment would have to be on long term loan,
since it would take probably at least six months to complete the testing.

We would also need neutral proctors to run the test and record scores.

* * * * * * * * * * * * * * *
*
* * * *

CONCLUSION
There would be a fair amount of work needed to get this off the ground,
but
it is doable. In particular, I would want broad agreement within RAHE
that
it was worthwhile doing, and I would want input from members of
appropriate
test SACDS or CDs and tracks for testing, and I would want myself, Tom,
and
the other participant to agree on the selections to be used.

Your comments and suggestions and questions are hereby solicited.

Harry Lavo
"it don't mean a thing if it ain't got that swing" - Duke Ellington