November 19th 05, 05:59 AM
Oooh look at all those lines.
DIGITAL WATERMARKING OF AUDIO SIGNALS USING A
PSYCHOACOUSTIC AUDITORY MODEL AND SPREAD SPECTRUM
THEORY
RICARDO A. GARCIA*, AES Member
School of Music Engineering Technology, University of Miami
Coral Gables, FL 33146, USA
A new algorithm for embedding a digital watermark into an audio
signal is proposed. It uses spread spectrum theory to generate a
watermark resistant to different removal attempts and a psychoacoustic
auditory model to shape and embed the watermark into the audio signal
while retaining the signal's perceptual quality. Recovery is performed
without knowledge of the original audio signal. A software system is
implemented and tested for perceptual transparency and data-recovery
performance.
0 INTRODUCTION
Every day the amount of recorded audio data and the possibilities to
distribute it (i.e. by the
Internet, CD recorders, etc) are growing. These factors can lead to an
increase in the illicit
recording, copying and distributing of audio material without respect to the
copyright or
intellectual property of the legal owners. Another concern is the tracking
of audio material over
broadcast media without the use of human listeners or complicated audio
recognition devices.
Audio watermarking techniques promise a solution to some of these problems.
The concept of watermarking has been used for years in the fields of still
and moving
images. The basic idea of a watermark is to include a special "code" or
information within the
transmitted signal. This code should be transparent to the user
(non-perceptible) and resistant
against removal attacks of various types.
In audio signals, the desired characteristics can be translated into:
- Not perceptible (the audio information should appear "the same" to the
average listener
before and after the code is embedded).
- Resistant to degradation because of analog channel transmission. (i.e. TV,
radio and tape
recording).
- Resistant to degradation because of uncompressed-digital media. (i.e. CD,
DAT and wav
files).
- Resistant to removal through the use of sub-band coders or psychoacoustic
models. (i.e.
MPEG, Atrac, etc).
The proposed algorithm generates a digital watermark (i.e. a bit stream)
that is spectrally
shaped and embedded into an audio signal. Spread spectrum theory is used in
the generation of
* Currently with the program in Media Arts & Sciences, Machine Listening
Group,
Massachusetts Institute of Technology (MIT), Cambridge, MA 02139 - 4307 USA.
2
the watermark. The strength of coded direct-sequence/binary-phase-shift
keying (DS/BPSK) is
used to create a robust watermark. The concepts are adapted to better deal
with audio signals in a
restricted audio bandwidth. A psychoacoustic auditory model is applied to
shape and embed the
watermark into the audio signal while retaining its perceptual quality for
the average listener.
A complete psychoacoustic auditory model algorithm is explained in detail.
This
information is useful for other applications involving auditory models. The
spread spectrum
encoding and decoding processes are then presented. The algorithm performs
an analysis of the
incoming signal and searches the frequency domain for "holes" in the
spectrum where the spread
spectrum data can be placed without being perceived by the listener. The
psychoacoustic
auditory model is used to find these frequency "holes."
After transmission, the receiver recovers the embedded spread spectrum
information and
decodes it in order to reconstruct the original bit stream (watermark).
There is no need for the
receiver to have access to the original audio signal.
The algorithm was implemented in a software system to create an encoder and
decoder, and
its performance was evaluated for diverse channels and audio signals. The
survival of the
watermark (number of correct bytes/second) was analyzed for different
configurations of the
encoding system. Each one of these configurations was tested for
transparency using an ABX
listening test and for different channels (i.e. AM Radio, FM stereo radio,
Mini Disc, MPEG layer
3, D/A - A/D conversion, etc).
1 PSYCHOACOUSTIC AUDITORY MODEL
An auditory model is an algorithm that tries to imitate the human hearing
mechanism. It uses
knowledge from several areas such biophysics and psychoacoustics.
From the many phenomena that occur in the hearing process, the one that is
the most important
for this model is "simultaneous frequency masking." The auditory model
processes the audio
information to produce information about the final masking threshold. The
final masking
threshold information is used to shape the generated audio watermark. This
shaped watermark is
ideally imperceptible for the average listener. To overcome the potential
problem of the audio
signal being too long to be processed all at the same time, and also extract
quasi-periodic
sections of the waveform, the signal is segmented in short overlapping
segments, processed and
added back together. Each one of these segments is called a "frame."
The steps needed to form a psychoacoustic auditory model are condensed in
Figure 1. The
first step is to translate the actual audio frame signal into the frequency
domain using the Fast
Fourier Transform. In the frequency domain the power spectrum, energy per
critical band and the
spread energy per critical band are calculated to estimate the masking
threshold. This masking
threshold is used to shape the "noise or watermark" signal to be
imperceptible (below the
threshold). Finally frequency domain output is translated into the time
domain and the next
frame is processed.
1.1 Short Time Fourier Transform (STFT)
The cochlea can be considered as a mechanical to electrical transducer, and
its function is to
make a time to frequency transformation of the audio signal. To be more
specific, the audio
information, in time, is translated in first instance into a
frequency-spatial representation inside
the basilar membrane. This spatial representation is perceived by the
nervous system and
translated into a frequency-electrical representation.
3
This phenomenon is modeled using the short time Fourier Transform (STFT).
The STFT uses
successive, overlapped windows from the time domain input signal.
1.2 Simultaneous Frequency Masking and Bark Scale
Simultaneous masking of sound occurs when two sounds are played at the same
time and one
of them is masked or "hidden" because of the other. The formal definition
says that masking
occurs when a test tone or "maskee" (usually a sinusoidal tone) is barely
audible in the presence
of a second tone or "masker." The difference in sound pressure level between
the masker and
maskee is called the "masking level."[ 1 ]
It is easier to measure the masking level for narrow band noise maskers
(with a defined
center frequency) and sinusoidal tone maskees. Figure 2 (a) and (b) display
some curves that
show the masking threshold for different narrow band noise maskers centered
at 70, 250, 1000
and 4000 Hz. The level of all the maskers is 60 dB. The broken line
represents the "threshold in
quiet." Average listeners will not hear any sound below this threshold.
Figure 2 (a) uses a linear
and (b) uses a logarithmic frequency scales.
The shape of all the masking curves is very different across the frequency
range in both
graphs. There are some similarities in the shape of the curves below 500 Hz
in the linear
frequency scale (a), and some similarities above 500 Hz in the logarithmic
frequency scale (b). A
more useful scale has been introduced that is known as "critical band rate"
or "Bark scale." The
concept of the Bark scale is based on the well-researched assumption [ 1 ]
that the basilar
membrane in the hearing mechanism analyzes the incoming sound through a
spatial-spectral
analysis. This is done in small sectors or regions of the basilar membrane
that are called "critical
bands." If all the critical bands are added together in a way that the upper
limit of one is the
lower limit of the next one, the critical band rate scale is obtained. Also
a new unit has been
introduced, the "Bark" that is by definition one critical band wide.
Figure 3 shows the same masking curves from Figure 2 in a Bark scale. Notice
that the shape
of the masking curves is almost identical across the frequency range.
Various approximations
may be used to translate frequency into a Bark scale [ 2 ]:
÷ ÷
ø
ö
ç ç
è
æ
÷ø
ö
çèæ
+
÷ø
ö
çè
= - æ -
2
1 1
7500
3.5tan
1000
0.76 *
13tan
f f
z ( 1 )
and [ 3 ]:
0.53
1960
26.81* -
+
=
f
f
z ( 2 )
where f is the frequency in Hertz and z is the mapped frequency in Barks.
Eq. ( 1 ) is more accurate, but the Eq. ( 2 ) is easier to compute. Figure 3
shows the excitation
level of several narrow band noises with diverse center frequencies in a
Bark scale.
1.3 Power Spectra
The first step in the frequency domain (linear, logarithmic or bark scales)
is to calculate the
power spectra of the incoming signal. This is calculated with:
{ } { }
2
2 2
( )
( ) Re ( ) Im ( )
w
w w w
Sw j
Sp j Sw j Sw j
=
= +
( 3 )
The energy per critical band, Spz(z) , is defined as:
4
å=
=
HBZ
LBZ
Spz z Sp j
w
( ) ( w) ( 4 )
Where: z = 1,2,.,total of critical bands Zt; LBZ and HBZ the lower and
higher frequencies in
the critical band z.
The power spectrum Sp(jw) and the energy per critical band Spz(z) are the
base of the
analysis in the frequency domain. They will be used to compute the spread
masking threshold.
1.4 Basilar Membrane Spreading Function
A model that approximates the basilar membrane spreading function, without
taking in
account the change in the upper slope is defined [ 3 ]:
B(z) =15.91+ 7.5(z + 0.474) -17.5 1+ (z + 0.474)2 ( 5 )
where z is the normalized Bark scale. Figure 4 shows B(z) .
The auditory model uses the information about the energy in each critical
band given by Eq. (
4 ) and uses Eq. ( 5 ) to calculate the spread masking across critical bands
Sm(z) . This is done
using:
Sm(z) = Spz(z) * B(z) ( 6 )
This operation is a convolution between the basilar membrane spreading
function and the total
energy per critical band. A true spreading calculation should include all
the components in each
critical band, but for the purposes of this algorithm, the use of the energy
per critical band
Spz(z) is a close approximation. Sm(z) can be interpreted as the energy per
critical band after
taking in account the masking occasioned by neighboring bands.
1.5 Masking Threshold Estimate
1.5.1 Masking Index
There are two different indexes used to model masking. The first one is used
when a tone is
masking noise (masker = tone, maskee = noise), and it is defined to be 14.5
+ z dB below the
spread masking across critical bands Sm(z) . In this case z is the center
frequency of the masker
tone using a bark scale. The second index is used when noise is masking a
tone (masker = noise,
maskee=tone), and is defined to be 5.5 dB below Sm(z) , regardless of the
center frequency [ 4 ].
1.5.2 Spectral Flatness Measure (SFM) and Tonality Factor a
The spectral flatness measure (SFM) is used to determine if the actual frame
is noise-like or
tone-like and then to select the appropriate masking index. The SFM is
defined as the ratio of the
geometric to the arithmetic mean of Spz(z) , expressed in dB as:
Zt
Zt
z
Zt
z
dB
Spz z
Zt
Spz z
SFM
1
1
1
( )
1
( )
10log10
ï ïþ
ï ïý
ü
ï ïî
ï ïí
ì
=
å
Õ
=
= ( 7 )
with Zt = total number of critical bands on the signal
5
The value of the SFM is used to generate the "tonality factor" that will
help to select the right
masking index for the actual frame. The tonality factor is defined in [ 3 ],
[ 4 ] as the minimum
of the ratio of the calculated SFM over a SMF maxima and 1:
÷ ÷ø
ö
ç çè
æ
= min , 1
dBmax
dB
SFM
SFM a ( 8 )
with SFM dB dB 60 max = - .
Therefore, if the analyzed frame is tone-like, the tonality factor a will be
close to 1, and if the
frame is noise-like, a will be close to 0. The tonality factor a is used to
calculate the masking
energy offset O(z), is defined as [ 3 ], [ 4 ]:
O (z) =a(14.5 + z) + (1-a)5.5 ( 9 )
The offset O(z)is subtracted from the spread masking threshold to estimate
the raw masking
threshold Traw(z) .
( ) ÷
ø
ö
çè
æ -
= 10
( )
log10 ( ) ( ) 10
O z
Sm z
Traw z ( 10 )
1.5.3 Threshold Normalization
The use of the spreading function B(z) increases the energy level in each
one of the critical
bands of the spectrum Sm(z) . This effect has to be undone using a
normalization technique, to
return Traw(z) to the desired level. The energy per critical band calculated
with Eq. ( 4 ) is also
affected by the number of components in each critical band. Higher bands
have more
components than lower bands, affecting the energy levels by a different
amount. The
normalization used [ 4 ] simply divides each one of the components of
Traw(z) by the number
of points in the respective band z P .
z P
Traw z
Tnorm z
( )
( ) = ( 11 )
Where:
z 1,2, .Zt
P number of points in each band z z
= ¼
=
1.5.4 Final Masking Threshold
After normalization, the last step is to take in to account the absolute
auditory threshold or
"hearing threshold." The hearing threshold varies across the frequency range
as stated in Zwicker
and Zwicker [ 1 ]. In the proposed auditory model the hearing threshold will
be simplified to use
the worst case threshold (the lowest). That is defined as a sinusoidal tone
of 4000 Hz with one bit
of dynamic range [ 4 ]. These values are chosen based on the data from
experimental research
that shows that the most sensitive range of the human ear is in the range of
2500 to 4500 Hz [ 1 ].
For a frequency of 4000 Hz, the measured sound intensity is 10-12 Watt/m2,
that equals a loudnes
of 0 phons at that frequency [ 12 ].The chosen amplitude (one bit) is the
smallest possible
amplitude value in a digital sound format. The hearing threshold is then
calculated with [ 4 ]:
TH = max(Pp( jw) ) ( 12 )
where:
( ) sin(2 4000 )
( ) power spectrum of the probe signal ( )
p t t
Pp j p t
p
w
=
=
6
The final threshold T(z) is:
T(z) = max(Tnorm(z),TH) ( 13 )
1.6 Noise Shaping Using the Masking Threshold
The objective of the auditory model is to find a usable masking threshold.
The final masking
threshold is always compared with the values of the power spectrum of the
signal Sp( jw) . This
can be interpreted as "below this threshold, the information is not relevant
for human hearing."
This means that if the frequency components that fall below the masking
threshold are removed;
the average listener will notice no difference between the original sound
signal and the altered
version.
Another very important consequence of this is that if these components are
not just discarded
but replaced with new components they will be, as before, inaudible for the
listener. This
assumes that the new components do not change the average energy
considerably in their critical
band. Let the frame with the new components be called N( jw) . The objective
is to use the final
masking threshold to select which components from Sp( jw) can be replaced
with components
from N( jw) . The components of N( jw) are shaped to stay below the final
masking threshold.
The final signal, that includes components from Sw( jw)and N( jw) , ideally
retains the
perceptual quality of the original signal for the average listener.
The following steps are used to remove the components from Sw( jw) , shape
the vector
N( jw) and mix them:
Calculate the "new" version of the sound signal (after removing some
components):
i
i
Sp j T z
Sw j Sp j T z
Swnew j
i
i i
i
z, according to component
1,2 number of components
0 ( ) ( )
( ) ( ) ( )
( )
w
w
w w
w
= K
î í ì
<
³
=
( 14 )
Remove the unneeded components in the N( jw) vector:
i
i
N j Sp j T z
Sp j T z
Nnew j
i i
i
i
z, according to component
1,2 number of components
( ) ( ) ( )
0 ( ) ( )
( )
w
w w
w
w
= K
î í ì
<
³
=
( 15 )
Calculate the power spectrum of Nnew( jw) :
2 Nnewp( jw) = Nnew( jw) ( 16 )
and then, the energy per critical band:
å=
=
HBZ
LBZ
Nnewpz z Nnewp j
w
( ) ( w) ( 17 )
Where: z = 1,2,.,total of critical bands Zt; LBZ and HBZ the lower and
higher frequencies in
the critical band z.
7
The shaping is done applying a factor z F to each critical band. These
factors are given by:
( )
LBZ HBZ z
z Zt
Nnew j
T j
F A z
to for each band
1,2
max ( )
( )
=
=
=
w
w
w
K
( 18 )
The coefficient A is used as the "gain of the noise signal". Varies from 0
to 1 and weights the
embedded noise below the threshold of masking. The factors z F are applied
using:
LBZ HBZ z
z Zt
Nfinal j Nnew j Fz
to for each band
1,2
( ) ( )
=
=
=
w
w w
K ( 19 )
The final step is to mix both spectrums, the altered Swnew( jw) and the
shaped Nfinal( jw) to
form the composite signal OUT( jw) :
OUT( jw) = Swnew( jw) + Nfinal( jw) ( 20 )
2 SPREAD SPECTRUM
One of the requirements of a watermarking algorithm is that the watermark
should resist
multiple types of removal attacks. A removal attack is considered as
anything that can degrade or
destroy the embedded watermark. Another factor to be considered is that the
masking threshold
of the actual audio signal determines the embedding of the watermark,
because the watermark is
embedded in the "spare components" found using the psychoacoustic auditory
model. From this
point of view, the watermark has to be the least intrusive to the audio
signal, and therefore, the
actual audio data can be seen as the main obstacle for a good watermarking
algorithm. This is
because the audio will use all the needed bandwidth and the watermark will
use what is left after
the auditory model analysis.
The desired watermarking technique should be resistant to degradation
because of:
- The used transmission channel: analog or digital.
- High-level wide-band noise (in this case, the "noise" is the actual audio
signal). This is
often related as "low signal to noise ratio".
- The use of psychoacoustic algorithms on the final watermarked audio.
A communication theory technique that meets the requirements is the "spread
spectrum
technique", as described thoroughly in Simon et al. [ 5 ] and Pickholtz et
al.
[ 6 ]. "Spread spectrum is a means of transmission in which the signal
occupies a bandwidth in
excess of the minimum necessary to send the information; the band spread is
accomplished by
means of a code which is independent of the data, and a synchronized
reception with the code at
the receiver is used for despreading and subsequent data recovery." [ 6 ]
In the following analysis, the process of generating a watermark that will
be embedded in an
audio signal is expressed in spread spectrum terminology. The original audio
signal will be
called "noise" and the bit stream that conforms the watermark sequence will
be the data signal.
The watermark sequence is transformed in a watermark audio signal and then
the audio signal
(noise) is added to it. This process of adding noise to a channel or signal
is called "jamming."
The objective of a jammer in a communication system is to degrade the
performance of the
transmission, exploiting knowledge of the communication system. In the
watermark algorithm
8
the audio signal (i.e. music) is considered the jammer, and it has much more
power than the
transmitted bit stream (watermark).
2.1 Basic Concepts
The primary challenge that a receiver must overcome is intentional jamming,
especially if the
jammer has much more power than the transmitted signal. Classical
communications theoretical
investigations about additive white Gaussian noise help to analyze the
problem. White Gaussian
noise is a signal which has infinite power spread uniformly over all
frequencies; but even under
these circumstances communication can be achieved due to the fact that on
each of the "signal
coordinates" the power of the noise component is limited (not infinite).
Therefore, if the noise
component in the signal coordinates is not too large, communication can be
made. This is usually
applied in a typical narrow-band signal, where just the noise components in
the signal bandwidth
are taken into account as possible factors that can do harm to the
communication. With this
knowledge, the best strategy to combat intentional jamming is to select
signal coordinates where
the jammer to signal ratio is the smallest possible.
Assume a communication link with many signal coordinates available to choose
from, and
only a small subset of these is used at any time. If the jammer can not
determine which subset is
being used, it is forced to jam all the coordinates and therefore, all its
power will be distributed
among all the coordinates, with little power in each of them. If the jammer
chooses to jam only
some of the coordinates, the power over each of them is larger, but the
jammer lacks the
knowledge of which coordinates to jam. The protection against the jammer is
enhanced, as more
signal coordinates are available to choose from.
Having a signal of bandwidth W and duration T, the number of coordinates
available is given
by:
î í ì
@
non - coherent signals
2 coherent signals
WT
WT
N ( 21 )
T is the time used to send a standard symbol. To make N larger when T is
fixed, two techniques
can be applied:
§ Direct sequence spreading (DS): this is the selected approach in this
algorithm.
§ Frequency hopping (FH)
The signals created with these techniques are called "spread spectrum
signals."
2.1.1 Models and Fundamental Parameters
The basic system is shown in Figure 5, with the following parameters:
Wss = Total spread spectrum signal bandwidth available
Rb = Data rate ( bits / second )
S = Signal power (at the input of the receiver)
J = Jammer power (at the input of the receiver)
Wss is defined as the total available spread spectrum bandwidth that could
be used by the
transmitter, but it is not guaranteed that it will be used during the actual
transmission. Neither is
it guaranteed that the spectrum will be continuous. Rb is the uncoded bit
data rate used during
transmission. The signal and the jammer powers S and J are the averaged
power at the receiver.
This does not change even if the jammer and/or the signal are pulsating.
9
2.1.2 Jammer Waveforms
The number of possible jammer waveforms that a jammer can apply to a
communication
system is infinite. The principal types include:
- Broadband Noise Jammer: Spreads Gaussian noise of a total power J evenly
over the total
frequency range of the spread bandwidth Wss.
- Partial Band Noise Jammer: Spreads noise of total power J evenly over a
frequency range of
bandwidth WJ, which is contained in the total spread bandwidth Wss. r is the
fraction of the
total spread spectrum bandwidth that is being jammed.
- Pulse Jammer: Transmits the jammer waveform during a fraction r of the
time, the average
power is J, but the peak power during transmission is higher.
2.2 Coherent Direct-Sequence Systems
Coherent direct-sequence systems use a pseudorandom sequence and a modulator
signal to
modulate and transmit the data bit stream. The main difference between the
uncoded and coded
versions is that the coded version uses redundancy and "scrambles" the data
bit stream before the
modulation is done and reverses the process at the reception. The
watermarking algorithm uses
the coded scheme, but the uncoded is studied because is easier to understand
and is the
foundation of the coded scheme.
2.2.1 Uncoded Direct-Sequence Spread Binary Phase-Shift-Keying
Uncoded Direct-Sequence Spread Binary Phase-Shift-Keying is known as Uncoded
DS/BPSK. It may be explained with a simple example. BPSK signals are often
expressed as:
( 1) , integer
2
( ) 2 sin 0
£ < + =
úû
ù
êë
= é +
nT t n T n
d
s t S t
b b
np
w
( 22 )
where
Tb is the data bit time ÷ ÷
ø
ö
ç çè
æ
b R
1
{ } n d is the sequence of data bits, with the possible values of 1 or -1;
and equal probability of occurrence.
Eq. ( 22 ) can be expressed as:
( 1) , integer
( ) 2 cos( ) 0
£ < + =
=
nT t n T n
s t d S t
b b
n w
( 23 )
BPSK can be seen as phase modulation in Eq.( 22 ) or amplitude modulation in
Eq. ( 23 ). The
spectrum of a BPSK signal is usually of the form shown in Figure 6. This is
a (sin 2 x) x2
function, and the first null bandwidth is b 1 T . This shows the minimum
bandwidth needed to
transmit the signal s(t) and to recover it at the receiver.
Spread spectrum theory requires the signal to be spread over a larger
spectrum than the
minimum needed for transmission. The spreading of the direct sequence is
done using a
pseudorandom (PN) binary sequence {c}. The values of this sequence are 1
or -1 and its speed is
N times faster than the{d} data rate. The time, Tc, of each bit on a PN
sequence is known as a
"chip" and is given by:
10
N
T
T b
c = ( 24 )
The direct sequence spread spectrum signal has the form:
[ ]
integer
0,1,2 1
( 1)
2 cos( )
( ) 2 sin / 2
0
0
=
= -
+ £ < + +
=
= -
+
+
n
k N
nT kT t nT k T
d c S t
x t S t d c
b c b c
n nN k
n nN k
K
w
w p
( 25 )
The signal is very similar to the common BPSK, except that the bit rate is N
times faster and the
power spectrum is N times wider, as shown in Figure 7. The processing gain
is given by:
N
R
W
PG
b
= SS = ( 26 )
WSS is the direct sequence spread spectrum bandwidth
c b T
N
T
1 1 = .
If the data function is defined as:
integer
( ) , ( 1)
=
= £ < +
n
d t d nT t n Tn b b ( 27 )
and the PN sequence is:
integer
( ) , ( 1)
=
= £ < +
k
c t c kT t k Tk c c ( 28 )
Eq. ( 25 ) can be expanded as:
[ ]
( ) ( ) 2 cos( )
( ) 2 sin ( ) ( ) / 2
0
0
c t d t S t
x t S t c t d t
w
w p
=
= +
( 29 )
Figure 8 shows the block diagram for the normal DS/BPSK modulation; and
Figure 9 shows
an equivalent model used in the next step of the analysis. Figure 11 shows
the signals d(t) and
c(t) and Figure 12 shows c(t)d(t) with N=6. From Figure 9, the equivalent
form of x(t) is given
by:
x(t) = c(t)s(t) ( 30 )
Where ( ) ( ) 2 cos( ) 0s t = d t S w t ( 31 )
This is the original BPSK signal. The property:
c2 (t) = 1 for all t ( 32 )
is the key point exploited to "recover" the original BPSK signal:
c(t)x(t) = s(t) ( 33 )
If the receiver possesses a copy of the PN sequence and can synchronize the
local copy with
the received signal x(t), it is able to de-spread the signal and recover the
transmitted data.
11
2.2.1.1 Constant Power Broadband Noise Jammer
A jammer, J(t), with constant power J is shown in Figure 10. The system is
also assumed to
have no noise from the transmission channel. An ideal BPSK demodulator is
assumed after the
received signal y(t) is multiplied by the PN sequence. The channel output
is:
y(t) = x(t) + J (t) ( 34 )
This is multiplied by the PN sequence c(t):
( ) ( ) ( )
( ) ( ) ( ) ( )
( ) ( ) ( )
s t c t J t
c t x t c t J t
r t c t y t
= +
= +
=
( 35 )
This term shows the original BPSK signal plus a noise given by c(t)J(t). The
output of the
conventional BPSK detector is then:
r d E n b = + ( 36 )
where: d is the data bit for the actual Tb second interval.
Eb=STb is the bit energy.
n is the equivalent noise component.
n is further defined as:
= ò b
b
T
c t J t t dt
T
n
0
( ) ( )cos( 0 )
2 w ( 37 )
The usual decision rule for BPSK is:
î í ì
- £
>
=
1, if 0
^ 1, if 0
r
r
d ( 38 )
2.2.2 Coded Direct Sequence Spread Binary Phase-Shift-Keying
Several types of coding techniques can be used that provide extra gain and
force the worst
case jammer to be a constant power jammer. Coding techniques usually require
the data rate to
be decreased or the bandwidth increased because of the redundancy inherent
to the coding. In
spread spectrum systems, coding does not require an increase of the
bandwidth or decrease of the
bit rate. These properties can be seen in a simple example. If k=2 (constant
length) the rate is
R=1/2 bits per coded symbol of convolutional code. For each data bit of the
sequence {d}, the
encoder generates two coded bits. For the kth transmission interval, the two
data bits are:
ak = (ak1,ak 2) ( 39 )
where:
î í ì -
¹
=
=
=
-
-
1
1
2
1
1
1
k k
k k
k
k k
d d
d d
a
a d
( 40 )
If Tb is the data bit time, each coded bit time is given by:
2
b
s
T
T = ( 41 )
Defining:
integer
( 1/ 2) ( 1)
( 1/ 2)
( )
2
1
=
î í ì
+ £ < +
£ < +
=
k
a k T t k T
a kT t k T
a t
k b b
k b b
( 42 )
12
In Figure 11 the uncoded data signal d(t), the PN sequence c(t) and the
coded signal a(t) are
shown for N=6. In Figure 12 the multiplicated signals d(t)c(t) and a(t)c(t)
are shown. With
ordinary BPSK, the coded signal a(t) would have twice the bandwidth of the
uncoded signal; but
after spreading with the PN sequence, the final bandwidth is the same as the
original. One of the
simplest coding schemes is the "repeat code." It sends m bits with the same
value, d, for each
data bit. The rate is then R=1/m bits per coded symbol. In this case, the
resulting coded bits are:
( , , , ) 1 2 m a = a a K a ( 43 )
Where: a d i m i = =1,2,K, ( 44 )
Also, each coded bit ai has a transmission time of:
m
T
T b
s = ( 45 )
It is very important to note that if m<N, the bandwidth of the spread signal
does not change. The
complete coded DS/BPSK system is shown in Figure 13.
The interleaver scrambles the bits in time at the transmission, and the
deinterleaver
reconstructs the data sequence at the receiver. After the interleaver, the
signal is BPSK
modulated and then multiplied by the PN sequence. At this point the
transmitted DS/BPSK
signal looks like the one in Eq. ( 30 ).
x(t) = c(t)s(t)
where s(t) is the common BPSK (with coding). The input at the receiver is
the same as that in
Eq. ( 34 ):
y(t) = x(t) + J (t)
After multiplication with c(t) (de-spreading), it becomes Eq. ( 35 ):
r(t) = s(t) + c(t)J (t)
The output of the detector after the de-interleaver is given by:
i m
Z n
m
E
r a i i
b
i i
= 1,2,K,
= +
( 46 )
where n1,n2,.nm are independent zero mean Gaussian random variables with
variance NJ/(2r). r
is the fraction of time that the pulse jammer is on, and Zi is the jammer
state:
î í ì
=
0 jammer off during transmission
1 jammer on during transmission
i
i
i a
a
Z ( 47 )
With probability equal to:
{ }
{ } r
r
= = -
= =
Pr 0 1
Pr 1
i
i
Z
Z
( 48 )
2.2.2.1 Interleaver and Deinterleaver
The idea of using an interleaver to scramble the data bits at transmission
and a deinterleaver
to unscramble the bits at reception causes the pulse jamming interference on
each affected data
bit to be independent from each other. In the ideal interleaving and
deinterleaving process, the
13
variables Z1,Z2,.,Zm become independent random variables. Assume that there
is no interleaver
and/or deinterleaver in the system shown in Figure 13. The output of the
channel is given by:
i m
Zn
m
E
r d i
b
i
= 1,2,K,
= +
( 49 )
and because there is no interleaver/deinterleaver:
i m
Z Z
a d
i
i
=1,2,K,
=
=
( 50 )
Also, it is assumed that the jammer was on during the whole data bit
transmission Tb. Because
there is no interleaver/deinterleaver, the optimum decision rule is:
å
å
=
=
= +
=
m
i
b i
m
i
i
d mE Z n
r r
1
1 ( 51 )
Eq. ( 38 ) is used as a decision rule:
î í ì
- £
>
=
1, if 0
^ 1, if 0
r
r
d
This bit error probability is the same for uncoded DS/BPSK; this means that
without a
interleaver/deinterleaver, there is no difference between uncoded systems
and simple repeat code
systems. Therefore, the use of a interleaver/deinterleaver is mandatory in
order to achieve a good
error probability measure against a pulse jammer.
Selection of the decision technique that determines the value of the coded
bits {r} requires
knowledge about the state of the channel. With an ideal
interleaver/deinterleaver, the output of
the channel is given by Eq. ( 46 ):
i m
Z n
m
E
r a i i
b
i i
= 1,2,K,
= +
where Z1,Z2,.,Zm and n1,n2,.,nm are considered to be independent random
variables. The
decoder takes r1,r2,.,rm and finds d1,d2,.,dm with possible values of 1
or -1. This analysis is
valid only for the instances where the state of the channel is unknown
(there is no information
regarding the state of the jammer signal).
2.2.2.2 Hard Decision Decoder
The hard decision decoder performs a binary decision on each coded bit
received:
i m
r
r
d
i
i
i
1,2, ,
1 0
^ 1 0
= K
î í ì
- £
>
=
( 52 )
The final decision in decoding the transmitted bit is:
14
ï ïî
ï ïí
ì
- £
>
=
å
å
=
=m
i
i
m
i
i
k
d
d
d
1
1
1 ^ 0
1 ^ 0
^ ( 53 )
2.2.2.3 Interleaver Matrix
The interleaving techniques will improve the performance in pulse jammer
environments
because it makes the noise components become statistically independent
variables. A block
interleaver with depth I=5 and interleaver span H=15 is shown in Figure 14.
The coded symbols
are written to the interleaver matrix along columns, while the transmitted
symbols are read out of
the matrix along rows. If the coded symbol sequence is x1,x2,x3. the
sequence that comes out of
the interleaver matrix is x1,x16,x31,x46,x61,. . At the receiver, the
deinterleaver performs the
inverse process, writing symbols into rows and reading them by columns. A
jamming pulse of
duration b symbols, with b £ I will result in these jammed symbols at the
deinterleaver output to
be separated at least by H symbols.
2.3 Synchronization of Spread-Spectrum Systems
Because a pseudorandom sequence PN is used at the transmitter to modulate
the signal, the
first requirement at the receiver is to have a local copy of this PN
sequence. The copy is needed
to de-spread the incoming signal. This is done by multiplying the incoming
signal by the local
PN sequence copy. To accomplish a good de-spreading, the local copy has to
be synchronized
with the incoming signal and the PN sequence that was used in the spreading
process.
The process of synchronization is usually performed in two steps: first, a
coarse alignment of
the PN sequence is done with a precision of less than a "chip." This is
called "PN acquisition."
After this, a fine synchronization takes care of the final alignment and
corrects the small
differences in the clock during transmission. This is called "PN tracking."
Theoretically,
acquisition and tracking can be done in the same step with a structure of
matched filters or
correlators searching with high resolution the incoming signal and comparing
it with the local
PN sequence.
2.3.1 Fast Fourier Transform (FFT) Scalar Filters
These filters are implemented in the frequency domain, and they use the Fast
Fourier
Transform (forward and backward). They work over a set of N samples (usually
in the frequency
domain) [ 7 ]. The block diagram of an adaptive digital filter is shown in
Figure 15.
Where: s(n) is the input signal
n(n) is the noise (unwanted) signal
r(n) is the input to the filter
R(m) is the frequency representation of the signal (n)
H(m) is the transfer function of the filter
C(m) is the output (in frequency domain) after the filter is applied
G(m) is the transfer function of the post-processing filter
P(m) is the output after the post-processing filter
p(n) is the output signal in the time domain
The following relationships are given:
15
( ) FFT ( ( ))
( ) ( ) ( )
( ) ( ) ( )
( ) FFT( ( ))
( ) ( ) ( )
p n 1 P m
P m G m C m
C m H m R m
R m r n
r n s n n n
= -
=
=
=
= +
( 54 )
2.3.1.1 High-resolution Detection FFT Scalar Filter
The high-resolution detection filter outputs a peak when the desired signal
s(n) and noise
n(n) are applied to it. The transfer function is given by:
2 2 ( ) ( )
*( )
( )
S m N m
S m
H m
+
= ( 55 )
This version of high-resolution detection assumes that the noise and the
signal are uncorrelated
(orthogonal). The output of this filter C(m) must be transformed to the time
domain to detect the
level and the position of the peak on the output vector c(n). This position
can be interpreted as
the exact point where the desired signal starts within the processed set of
samples N.
2.3.1.2 Adaptive Filtering
Adaptive filters require a learning process and use adaption techniques to
form the transfer
function of the desired filter H(m). The components of the transfer function
are updated
periodically with actual values taken from the signal or with estimates made
using stored data.
The class 1/3 high-resolution detection filter is given by [ 7 ]:
2 ( )
*( )
( )
R m
S m
H m = ( 56 )
where S*(m) is the conjugate of the spectrum of the desired signal to detect
and R(m) is the
magnitude of the spectrum of the actual input of the system.
The expression R(m) is used to denote the "smoothing" process. This process
is done to
estimate the average spectrum of the signal plus noise from the actual input
of the system. The
smoothing used is called "inner block averaging" or "frequency domain
averaging" and it is
defined as:
or ( ) ( )
( ) ( )
2
1
( )
r r t b t
R j R j B j
b =
= w * w
p
w
( 57 )
The frequency averaging window B(jw) is convolved with the spectrum of the
input signal. This
is equivalent to a temporal weighting of the input r(t) by b(t) in the time
domain. The window is
usually selected to be a percentage of the input vector length.
3 PROPOSED SYSTEM
Different systems have been applied to watermarking of audio signals. All of
them are
classified as "steganographic systems" because they deal with the concept of
hiding data within
the signal. Boney et al. [ 23 ] proposed a system where a PN sequence was
filtered using a filter
that approached the masking characteristics of the human auditory system in
the frequency and
16
time domains. Some other techniques have been imported from the fields of
video and still
image watermarking. Cox [ 24 ] proposes a multiplatform system capable of
extract a
pseudorandom sequence without the use of the original unwatermarked data.
The watermarking algorithm proposed in this paper mixes the psychoacoustic
auditory model
and the spread spectrum communication technique to achieve its objective. It
is comprised of
two main steps: first, the watermark generation and embedding and second,
the watermark
recovery. The watermark generation and embedding process is shown in Figure
16. A bit stream
that represents the watermark information is used to generate a noise-like
audio signal using a set
of known parameters to control the spreading. At the same time, the audio
(i.e. music) is
analyzed using a psychoacoustic auditory model. The final masking threshold
information is
used to shape the watermark and embed it into the audio. The output is a
watermarked version of
the original audio that can be stored or transmitted.
The watermark recovery is shown in Figure 17. The input is the watermarked
audio after
transmission (i.e. music + noise, low quality, etc). An auditory
psychoacoustic model is used to
generate a residual. At the same time as the known parameters are used to
generate the header of
the watermark. Using an adaptive high-resolution filter, all the residual is
scanned to find all the
occurrences of the known header and therefore the initial position of each
possible watermark.
After this, the same known parameters used to generate the header are used
to de-spread and
recover the watermark.
3.1 WATERMARK GENERATION AND EMBEDDING
3.1.1 WATERMARK GENERATION
The objective of the watermark generation is to generate a watermark audio
signal x(t) that
contains the watermark bit stream data. This watermark signal can be
transmitted and then
processed for data recovery. The technique used to generate the watermark
signal x(t) is "coded
DS/BPSK spread spectrum." The process is condensed in Figure 18.
Where: {w} is the original digital bit stream(watermark)
m is the repetition code factor
{ } R w is the watermark after the coding process (repeat code)
I,H = width and length of the interleaver matrix
{ } I w is the watermark after the interleaver process
{header}= is the header sequence
{ } { } { } I d = header + w = sequence to be spread and transmitted
f0 = frequency used by the BPSK modulator
The process can be explained with a simple example: Let {w} be the watermark
bit stream.
All the bit streams used are bipolar (value 1 or -1). Defining {w}with a
length of 16 bits as the
sequence:
{w}= { 1 1 -1 1 -1 -1 1 -1 1 1 -1 1 1 1 -1 -1}
Using Eq. ( 43 ) to generate the repeat code, and choosing m=3, the { } R w
sequence is:
17
{ }
1 1 1 1 1 1 1 1 1 1 1 1 }
1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1
{ 1 1 1 1 1 1 1 1 1 1 1 1
- - - - - -
- - -
- - - - - - - - -
= - - - R w
The next step is to perform interleaving. To do this, the values of the
interleaving matrix are
chosen; in this case, I=5, H=10, (see Figure 14). The resulting matrix is
shown in Figure 19. The
last two spaces are padded with 1's. Using the interleaving matrix, the
output sequence { } I w is:
{ }
1 1 }
1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1
{ 1 1 1 1 1 1 1 1 1 1 1 1
- - -
- - - - - - -
- - - - - - -
= - - - - I w
The selected header is a sequence usually composed by 1's.
{header}= {1 1 1 1 1 1 1 1 1 1}
The final data sequence {d} is obtained concatenating the {header} and the
{ } I w :
{ } { } { }
{ }
1 1 1 1 1 1 1 1 1 1 1 1 }
1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1
{ 1 1 1 1 1 1 1 1 1 1 1 1
- -
- - - - - - -
- - - - - -
- - - - - -
=
= +
d
d header wI
The PN sequence {c} can be generated by any means. Usually this is done
using a
pseudorandom number generator. In this case, the PN sequence is assumed to
be long enough to
spread a complete bit stream (header and data) without repeating any portion
of it. The important
factor is that the transmitter and the receiver must have a copy of the
whole PN sequence{c}.
This sequence is ideally uncorrelated with the {d} sequence, and has the
form:
{c}= { 1 -1 1 1 -1 -1 1 -1 1 1 -1 1 K}
3.1.1.1 Spread Spectrum Parameter Selection
Audio signals are usually considered to be baseband signals [ 21 ]. The
described spread
spectrum technique can be applied to passband systems (with f0>0) or
baseband systems (f0=0)
without losing generality. The selection of all the parameters is based on
the considerations of
how the overall watermarked audio signal will be transmitted or stored. The
frequency response
of those systems determines which frequencies are likely to be present at
the receiver. Let a
baseband bandlimited signal, with no modulation (f0=0) have the magnitude
spectrum shown in
Figure 20. With amplitude modulation (f0>0), the spectrum will have the form
shown in Figure
21. FS is the sampling frequency of the system. To avoid aliasing because of
the use of
modulation, the modulation frequency should be:
18
Rc
FS
Rc £ f £ -
2 0 ( 58 )
If a system possesses a lower frequency limit LF and/or an upper frequency
limit HF, the
modulation frequency f0 have to be selected in a way that the sidebands fall
between the lower an
upper limits, as shown in Figure 22. If a sideband falls outside of these
limits, aliasing or data
loss could result. Taking into account, the selection of parameters should
be done using:
2
0,
0
FS
LF HF
LF Rc f HF Rc
³ £
+ £ £ -
( 59 )
The parameters selected must satisfy Eq. ( 58 ) and Eq. ( 59 ), along with
the following
relationships:
Rd = is the data bits per second
m = is the repetition code factor
N = is the spreading factor, Eq. ( 26 )
Rb = Rd*m is the coded bits per second
Tb = 1/Rb is the time of each coded bit
Rc=N*Rb is thePN sequence bits per second
Tc=Tb/N is the time of each PN bit or "chip"
Assuming a frequency response similar to FM Radio [ 22 ] with LF = 50 Hz and
HF = 15000
Hz, for the actual example, a set of spread spectrum parameters that satisfy
all the requirements
is:
N = 3
m = 3
Rd = 100 bits/sec
Rb = 300 bits/sec
Rc = 900 bits/sec
f0 = 3500 Hz
Note that N and m are selected with small values for this example. The
modulation is done using
Eq. ( 31 ):
( ) ( ) 2 cos( ) 0s t = d t S w t
The spreading is done using Eq. ( 30 ):
x(t) = c(t)s(t)
The output of the system is the watermarked audio waveform x(t) shown in
Figure 23.
3.1.2 FRAME SEGMENTATION
To overcome the potential problem of the audio signal or the watermark
signal being too
long to be processed using a single FFT, the signal is segmented in short
overlapping segments,
processed and added back together [ 8 ]. Another consideration for the
watermark algorithm is
that the audio signal has to be longer than the watermark signal. Therefore,
the watermark can be
repeated several times during the duration of the audio signal. This
redundancy is one of the
important features in the watermarking algorithm. Figure 24 shows audio and
watermark signals
that will be segmented. The watermark is repeated several times. If the
total length of the audio
signal is LENGTH samples, the desired length of the analysis frame is BLOCK
samples, and the
19
overlap between consecutive frames is OVERLAP samples, the total number of
FRAMES is
given by:
BLOCK OVERLAP
LENGTH OVERLAP
FRAMES
-
-
= ( 60 )
In Figure 24 two equal length frames were selected to be processed. One from
the audio
signal and the other from the respective point in the watermark signal. The
last frame is zeropadded
if it is shorter than BLOCK samples. These padded samples are discarded in
post
processing. From this point on, all processes described are applied to the
audio or watermark
signal frames, not the entire signal.
3.1.3 FREQUENCY REPRESENTATION
The Short Time Fourier Transform (STFT) is used to acquire a frequency
representation of
the actual frames. Before doing the STFT, a Hamming window is applied to
both signals [ 7 ], [ 8
]. This improves the representation of the signal in the frequency domain
reducing the leakage. If
s(t) is the actual audio signal frame and x(t) the actual watermark signal
frame, then the
windowing is done using:
sw(t) = s(t)w(t) ( 61 )
xw(t) = x(t)w(t) ( 62 )
The Hamming window is defined as:
sampling period
( ) ( )
1,2
2
( ) 0.54 0.46cos
=
=
=
÷ø
ö
çè
= + æ
T
w t w nT
n BLOCK
BLOCK
n
w n
K
p
( 63 )
The frequency representation of the audio frame is:
Sw( jw) = FT{sw(t)} ( 64 )
and the watermark frame:
Xw( jw) = FT{xw(t)} ( 65 )
The power spectra is found using Eq. ( 3 ):
2 Sp( jw) = Sw( jw) ( 66 )
The indices of the actual frequency representations have to be mapped to the
Bark scale.
Once this index mapping is done, the representation in the critical band
scale is formed by
mapping the components to the respective position on the critical band axis.
The relationship
between each component index, i, and the corresponding frequency, fi, that
it represents is given
by:
Sampling Frequency
2
1,2
( 1) *
=
=
-
=
FS
BLOCK
i
BLOCK
i FS
fi
K
( 67 )
20
The relationship between each frequency fi and the bark scale or critical
band scale zI is found
using Eq. ( 1 ):
÷ ÷
ø
ö
ç ç
è
æ
÷ø
ö
çè
æ + ÷
ø
ö
çè
= - æ -
2
1 1
7500
3.5tan
1000
0.76*
13tan i i
i
f f
z
This relationship between each component index i and the frequency fi or
critical band zi that it
represents can be calculated at the beginning of the algorithm and stored in
a table. The energy
per critical band is calculated using Eq. ( 4 ):
å=
=
HBZ
LBZ
Spz z Sp j
w
( ) ( w)
Where: z = 1,2,.,total of critical bands Zt; LBZ and HBZ the lower and
higher frequencies in
the critical band z. Figure 25 (a) shows the original audio frame s(t) in
the time domain and the
shape of the Hamming window w(t); (b) shows the sw(t) frame after the
windowing process; (c)
shows the magnitude of Sw( jw) , and (d) shows the power spectrum Sp( jw)
and the energy per
critical band Spz(z) .
3.1.4 BASILAR MEMBRANE SPREADING FUNCTION
The basilar membrane spreading function determines how much of the energy of
each critical
band is contributed to the neighboring bands. The spreading function B(z) is
calculated using Eq.
( 5 ):
K 2, 1,0,1,2K
15.91 7.5( 0.474) 17.5 1 ( 0.474)2
= - -
= + + - + +
k
B k k k
The spreading across bands is computed by the convolution of the spreading
function B(z) and
the energy per critical band Spz(z) ,using Eq. ( 6 ):
Sm(z) = Spz(z) * B(z)
Figure 26 (a) shows the energy per critical band Spz(z) , (b) shows the
spreading function B(z)
for 9 points, and (c) shows the spread energy per critical band Sm(z).
3.1.5 MASKING THRESHOLD ESTIMATE
The Spectral Flatness Measure (SFM) of the actual audio frame Sw( jw) is
taken using
Eq. ( 7 ):
Zt
Zt
z
Zt
z
dB
Spz z
Zt
Spz z
SFM
1
1
1
( )
1
( )
10log10
ï ïþ
ï ïý
ü
ï ïî
ï ïí
ì
=
å
Õ
=
=
with Zt = total number of critical bands in each frame
The energy per critical band Spz(z) is used rather than spread energy per
critical band Sm(z)
to avoid false results due to smoothing of the signal. The tonality factor a
is then calculated
using Eq. ( 8 ):
÷ ÷ø
ö
ç çè
æ
= min ,1
dB max
dB
SFM
SFM a
21
with SFM dB dB 60 max = - .
The masking energy offset O(z) is then calculated using Eq. ( 9 ):
O(z) =a(14.5 + z) + (1-a)5.5
The raw masking threshold, Traw(z), is calculated with Eq. ( 10 ):
( ) ÷
ø
ö
çè
æ -
= 10
( )
log10 ( ) ( ) 10
O z
Sm z
Traw z
The raw masking threshold is normalized using Eq. ( 11 ):
z P
Traw z
Tnorm z
( )
( ) =
where:
z 1,2, .Zt
P number of points in each band z z
= ¼
=
To calculate the final masking threshold T it is necessary to first
calculate the hearing
threshold (or threshold in quiet) TH. It is defined as a sinusoidal tone of
4000 Hz with one bit of
dynamic range. Using Eq. ( 12 ):
TH = max(Pp( jw) )
Where:
( ) sin(2 4000 )
( ) power spectrum of the probe signal ( )
p t t
Pp j p t
p
w
=
=
Then the final masking threshold T is calculated using Eq.( 13 ):
T(z) = max(Tnorm(z),TH)
with: z=1,2,..Zt
Figure 27 (a) shows the raw masking threshold Traw(z) and (b) shows the
normalized
threshold Tnorm(z).
3.1.6 WATERMARK SPECTRAL SHAPING
The final masking threshold T is used to determine which components of the
audio signal
Sw( jw) can be removed without affecting the perceptual quality of the
signal. The power
spectrum Sp( jw) is compared against the final masking threshold T. The
components that fall
below it are removed in Sw( jw) . The new frame with only the components
above the threshold
is called Swnew( jw) . Eq. ( 14 ) is used:
i
i
Sp j T z
Sw j Sp j T z
Swnew j
i
i i
i
z, according to component
1,2 number of components
0 ( ) ( )
( ) ( ) ( )
( )
w
w
w w
w
= K
î í ì
<
³
=
Then the unneeded components of the watermark signal Xw( jw) are removed.
These
components correspond to the non-removed components in Sw( jw) . Eq. ( 15 )
is used:
i
i
Xw j Sp j T z
Sp j T z
Xwnew j
i i
i
i
z, according to component
1,2 number of components
( ) ( ) ( )
0 ( ) ( )
( )
w
w w
w
w
= K
î í ì
<
³
=
22
The factors that will shape the new watermark Xwnew( jw) are found using Eq.
( 18 ):
( )
LBZ HBZ z
z Zt
Xwnew j
T j
F A z
to for each band
1,2
max ( )
( )
=
=
=
w
w
w
K
The square root of the final threshold is divided by the maximum magnitude
component
found in the energy of the new watermark in each critical band. Each one of
these factors is
scaled using the gain A, that varies from 0 to 1, and controls the overall
magnitude of the
watermark signal in relation with the audio signal.
Each one of the components in each critical band k is scaled by the
corresponding factor
using Eq. ( 19 ):
LBZ HBZ z
z Zt
Xfinal j Xwnew j Fz
to for each band
1,2
( ) ( )
=
=
=
w
w w
K
Figure 28 shows the final masking threshold and the watermark signal before
shaping (a) and
after shaping (b). Note that the watermark falls below the threshold of
masking. The factor A
gives control of "how much gain" will have the watermark related with the
masking threshold (A
is a value from 0 to 1).
3.1.7 AUDIO AND WATERMARK SIGNAL COMBINATION
The final output OUT( jw) is the sum of the new audio, Swnew( jw) , and the
final
watermark Xfinal( jw) . This is given by the Eq. ( 20 ):
OUT( jw) = Swnew( jw) + Xfinal( jw)
Figure 29 shows the final masking threshold Tfinal(z), and the power
spectrum of (a) Swnew(jw),
(b) Xfinal(jw), and (c) OUT(jw).
3.1.8 TRANSFORMATION TO THE TIME DOMAIN
The Inverse Fourier Transform is used to convert the frequency domain
information back to
the time domain.
out(t) = IFT{OUT( jw)}
This output frame out(t) is added to the correspondent point at the total
time domain output
output(t). The next frames of audio and watermark signals are taken, and the
process is repeated.
3.2 DATA RECOVERY
The watermarked audio signal is intended to be transmitted through a diverse
number of
channels. In some cases, the channel will introduce noise, convert several
times from digital to
analog and analog to digital, or even use a psychoacoustic auditory model to
process the audio
signal. The watermark bit stream should survive the transmission and be
recoverable.
A very important characteristic is that the developed system does not
require access to the
original audio signal (before watermark) to extract the watermark at the
receiving. The process
of recovery uses the psychoacoustic auditory model, but in this case the
goal is to remove all the
audio components that have less probability of belonging to the watermark
signal. This means
23
that the masking threshold is calculated and the components above it are
removed. The final
signal is the "residual." This residual is then analyzed to find the
possible points where the
watermark is present. If some criterion is applied, the majority of the
false points detected can be
eliminated (i.e. rejecting points too close to fit a watermark).
Synchronization and recovery of
the watermark bit stream are then performed.
3.2.1 MASKING THRESHOLD AND RESIDUAL SIGNAL
The watermarked audio signal after the transmission is symbolized as s2(t).
The process
described in sections 3.1.2 to 3.1.5 is used to calculate the frames sw2(t),
frequency
representation ( ) 2 Sw jw ,and masking threshold T2, respectively. The
residual signal R( jw) is
defined as the signal composed of the components below the masking
threshold. Eq. ( 14 ) can
be changed to:
i
i
Sp j T z
Sw j Sp j T z
R j
i
i i
i
z, according to component
1,2 number of components
0 ( ) ( )
( ) ( ) ( )
( )
2 2
2 2 2
w
w
w w
w
= K
î í ì
>
£
=
( 68 )
3.2.2 RESIDUAL EQUALIZATION
The spectrum of the residual R( jw) is then shaped to be flat. Eq. ( 18 )
can be modified to
shape all the maximum components of each band to be at equal levels. The
factors are found
using:
( )
LBZ HBZ z
z Zt
R j
Fz
to for each band
1,2
max ( )
1
=
=
=
w
w
K
( 69 )
Each one of the components in each critical band z is scaled by the
corresponding factor Fz using
Eq. ( 19 ):
LBZ HBZ z
z Zt
Rfinal j R j Fz
to for each band
1,2
( ) ( )
=
=
=
w
w w
K
3.2.3 TIME DOMAIN RESIDUAL
The residual is taken back to the time domain using the Inverse Fourier
Transform IFT.
r(t) = IFT{Rfinal( jw)}
The time domain r(t) frame is added to the total time domain residual signal
residual(t) at the
point specified by the frame segmentation step. The next frame is then
processed.
3.2.4 SYNCHRONIZATION WITH WATERMARK HEADER
To be able to synchronize and to have a good de-spreading of the watermark
signal, it is
necessary to have knowledge of the parameters used at the generation of the
watermark signal,
such as f0, Tb, m, H, I, N, {header},{c}, etc.
24
3.2.4.1 header(t) Signal Generat ion
The first step is to generate a header(t) waveform signal using the process
of section 3.1.1,
except that only the {header} sequence is used as the input sequence. This
audio signal will be
used to locate the exact positions of the watermark signals in the
residual(t) signal. Frame
segmentation as explained in section 3.1.2 is also required in order to
analyze the whole
residual(t) signal. The parameters for the frame segmentation are chosen to
have up to two
header(t) signals in each frame. Therefore, BLOCK is equal to twice the
number of samples in
header(t), and OVERLAP is equal to one half the number of samples in
header(t). The resulting
frame taken from residual(t) with BLOCK length is called r(t).
3.2.4.2 header(t) Position Detect ion
Eq. ( 56 ) describes an adaptive high-resolution filter that can be used to
detect the presence
of header(t) in the r(t) frame and therefore, all the occurrences of
header(t) in the residual(t)
audio signal.
2 ( )
* ( )
( )
w
w
w
R j
HEADER j
H j =
Where: ( )
( ) FFT( ( ))
( ) FFT ( )
HEADER j header t
R j r t
=
=
w
w
The denominator of the filter is the smoothed version of 2 R( jw) .
Smoothing is done using Eq.
( 57 ), where w(t) is a Hanning window of width 10%. The output of the
filter applied to
R( jw) is:
2
*
( )
( )
( ) ( )
w
w
w w
R j
HEADER j
DET j = R j
This result is transformed to the time domain to be analyzed.
det(t) = real(IFFT(DET( jw)))
A typical output of the filter, det(t), is shown in Figure 30. The peak
shows the position in
samples where the header(t) signal starts in the frame r(t). This detection
is done for all the
frames in the residual(t) signal, and all the positions of the peaks are
stored for further analysis.
A proposed criterion of analysis is to determine the minimum distance
between peaks to decide
which ones have more probability to represent the start of a watermark
signal.
3.2.5 WATERMARK DE-SPREADING
For each peak position found in the residual(t), a selected frame y(t) with
the same length as
the watermark signal is processed. This process is shown in Figure 31. Using
Eq. ( 35 ):
r(t) = c(t) y(t)
Demodulation is performed using Eq. ( 31 ):
cos(2 )
2
( ) ( ) 0f t
T
g t r t
b
= p
To estimate the bit stream:
25
1,2 total bits in bit stream
( )
( 1)
= K
= ò -
i
r g t dt s
s
iT
i i T ( 70 )
The decision rule, to form a recovered bit stream {d^}, is given by Eq. (
38 ),
1,2 total bits in bit stream
1, if 0
^ 1, if 0
= K
î í ì
- £
>
=
i
r
r
d
i
i
i
After this decision, the {header} sequence is discarded from the {d^}bit
stream. This produces the
bit stream, { } I w^ .
3.2.6 WATERMARK DE-INTERLEAVING AND DECODING
The de-interleaving process is done using the same matrix used in the
watermark generation
in section 3.1.1 and shown in Figure 14. The bits are written into rows and
read by columns to
accomplish the de-interleaving process. The de-interleaved sequence is
called { } R w^ . The
decoding of the repeat code of value m is done using Eq. ( 53 ):
1,2 total bits in data sequence
1 ^ 0
1 ^ 0
^
1
1
= K
ï ïî
ï ïí
ì
- £
>
=
å
å
=
=
k
w
w
w m
i
Ri
m
i
Ri
k
The final recovered sequence {w^} is the recovered watermark.
4 SYSTEM PERFORMANCE
4.1 SURVIVAL OVER DIFFERENT CHANNELS
A watermarking system was implemented using a well known mathematical
software
package. The system was composed of two modules: watermark generation and
embedding, and
watermark recovery. The watermark was first generated and embedded in an
audio signal. The
watermarked signal was then tested for recovery of the watermark after
transmission by different
channels, such as sub-band encoding, digital to analog - analog to digital
conversions and radio
transmission.
The music used was a 26 second excerpt of the song "In the Midnight Hour"
(W. Pickett &
S. Cropper) performed by The Commitments. A sampling frequency of 44.1 KHz
was used. Each
of the watermarked audio signals was labeled to reflect the level of the
watermark below the
masking threshold (the A value), i.e. W2, W4, W6 and W8. With these
parameters, a total of 35
watermarks were embedded during the duration of each signal. The four
watermarked music
signals and the original signal were recorded digitally on a compact disc.
The computer was also
equipped with a full duplex sound card with D/A A/D converters. All the
radio systems were
simulated using a multiplex stereo modulator, FM/AM signal generator, and
ordinary consumer
CD player and FM/AM radio receiver. The percentage of correct bits recovered
per watermark
was measured before and after transmission. Two examples are shown in Figure
32 and Figure
33. The percentage of correct bits before transmission is the continuous
line, and the percentage
26
of correct bit after transmission is the dotted line. Also, the offset from
the expected starting
point of each watermark after transmission is measured (in samples), as well
as the total of
watermarks recovered and the average recovery percentage.
4.2 LISTENING TEST
One of the requirements of the watermarking system is to retain the
perceptual quality of the
signal. This is often referred to as "transparency." The transparency of the
watermarking
algorithm was tested using three of the four watermarked audio signals (W2,
W4 and W6) used
in section 4.1. An ABX listening test was used as the testing mechanism. In
an ABX test the
listener can hear selection A (in this case the non-watermarked audio),
selection B (the
watermarked audio) and X (either the watermarked or non-watermarked audio).
The listener is
then asked to decide if selection X is equal to A or B. The number of
correct answers is the basis
to decide if the watermarked audio is perceptually different than the
original audio and would,
therefore, declare the watermarking algorithm as "non-transparent." In the
other case, if the
watermarked audio is perceptually equal to the original audio, the
watermarking algorithm will
be declared as "transparent."
Using the theory explained in Burstein [ 19 ], [ 20 ], different parameters
were selected to
find an appropriate sample size. A criterion of significance a'=0.1 is
selected (also known as
Error Type 1). The Type 2 error risk is assumed b'=0.1. The probability p1
that a listener finds
the right answer by chance is 0.5 in an ABX system. The effect size is
selected as p2=0.7. With
these parameters, the approximated required sample size that meets the
specifications is 37.61
samples. The sample size is selected as n=40. (40 listeners per ABX set).
The critical c (c') is the
minimum number of correct samples which, together with n and p1, can produce
a significance
level a equal to or less than the specified criterion of significance a'.
The calculated c' is 24.55
and can be rounded off to 25. This is the minimum number of correct answers
to accept the
hypothesis that the listener perceives differences between audio A and B.
With c'=25, the
criterion of significance becomes a'=0.78, which is below the required
level. The type 2 error
risk b'=1.11 and does not exceed desired level. The results and their
approximate significance
level are shown in Table 1.
Sample Size Correct Identifications a
W2 40 24 0.14
W4 40 19 0.50
W6 40 19 0.50
Table 1. Listening test results
4.3 DISCUSSION
The survival over different channels showed that after encoding, not all the
watermarks could
be recovered with 100% accuracy. This occurs because of the multiple factors
that affect the
quality of the embedded watermark, such as: the number of audio components
replaced, the gain
of the watermark, and the masking threshold. It is important to note that in
some frames the
watermark information can be very weak, even null. The spread spectrum
technique employed
can partially solve these problems, but if many consecutive frames have no
watermark
information, that specific watermark can not be recovered.
The theoretical position of the watermark and the offset of the actual
watermark represent the
starting position of the {header} of each watermark. This position will not
affect the recovery of
27
the watermark because each watermark is embedded independently of the
others. In the actual
tests three different cases are seen: almost no offset, linearly increasing
offset and varying offset.
When no offset is seen, the original signal and the recorded signal after
transmission where
played at the same speed. In the cases where the offset is linearly, it is
assumed that the speed of
the playback device (in this case an ordinary consumer CD player was
different (slightly slower)
than the recording device. The last case shows the unstable speed variations
of the tape device. If
the speed of the playback device is close enough to the original speed, the
de-spreading can be
successful because the difference in alignment between the watermarked audio
and the despreading
signals (PN sequence, demodulator and {header}) will not greatly affect the
final
result.
Finally, the percentage of correct bits recovered measures quality of the
recovery for each
watermark. Notice that not all the watermarks are recovered (%bits = 0.0),
and not all the
watermarks are recovered in their totality but many of them were recovered
with more than 80%
of the bits. A good bit error detection/correction algorithm or averaging
technique could
substantially improve the recovery of the watermark. A very strong point in
the watermarking
system is the redundancy of watermarks embedded into the audio stream. In
this case, each
watermark lasts approximately 600 ms. Even if just a few watermarks are
recovered, the goal of
transmitting the watermark information within the audio signal and
recovering it afterwards is
accomplished.
The listening test showed that the watermark at -2dB below the masking
threshold (W2) is
the most likely to be heard, but it can not be ensured that people actually
noticed the difference.
For all the other watermarked signals, the results show that the process is
"transparent."
5 CONCLUSIONS
The proposed digital watermarking method for audio signals is based on a
psychoacoustic
auditory model to shape an audio watermark signal that is generated using
spread spectrum
techniques. The method retains the perceptual quality of the audio signal,
while being resistant to
diverse removal attacks, either intentional or unintentional. The recovery
of the watermark is
accomplished without knowledge of the original audio signal. The only
information used
includes the watermarked audio signal, and the parameters used for the
watermark generation.
The psychoacoustic auditory model retrieves the necessary information about
the masking
threshold of the input audio signal. This model is a good approach that can
be used for several
applications such: perceptual coding, masking analysis, or watermark
embedding. The spread
spectrum theory describes two important Direct Sequence techniques, but the
employed
technique is Coded Direct-Sequence Spread Binary Phase-Shift-Keying (coded
DS/BPSK).
Because the normal literature about this topic is reserved for communication
theory, some
assumptions were made to use the theory in an audio bandwidth environment.
Specifically in this
case, the audio information was considered the "noise" or "jammer" signal
that interferes with
the watermark.
Future research could be performed in different aspects of this proposed
algorithm such as:
- System performance with different types of music.
- Experimenting with different spread spectrum encoding parameters.
- Changes in the playback speed of the signal.
- Crosstalk interference.
- Multiple watermark embedding.
28
- Use of techniques to enhance recovery of the watermark (i.e., bit error
detection/ correction,
averaging, etc).
- Real - time implementation.
- Investigate different signal schemes for the generation of the PN
sequence.
6 ACKNOWLEDGMENT
The author wishes to thank Professors Ken Pohlmann and Will Pirkle from the
Music
Engineering program at University of Miami for their valuable advises and
feedback. Also to the
Music Engineer Alex Souppa for his help as technical editor and english
corrector of the author's
master thesis.
7 REFERENCES
[ 1 ] E. Zwicker and U. T. Zwicker, "Audio Engineering and Psychoacoustics:
Matching Signals
to the Final Receiver, the Human Auditory System," J. Audio Eng. Soc., vol.
39, pp. 115 -126
(1991 March)
[ 2 ] T. Sporer and K. Brandenburg, "Constraints of Filter Banks Used for
Perceptual
Measurement," J. Audio Eng. Soc., vol. 43, pp. 107 - 115 (1995 March)
[ 3 ] J. Mourjopoulos and D. Tsoukalas, "Neural Network Mapping to
Subjective Spectra of
Music Sounds," J. Audio Eng. Soc., vol. 40, pp. 253 - 259 (1992 April)
[ 4 ] J. D. Johnston, "Transform Coding of Audio Signals Using Perceptual
Noise Criteria,"
IEEE Journal on Selected Areas in Communications, vol. 6, pp. 314 - 323
(1988 Feb.)
[ 5 ] M. K. Simon, J. K Omura, R A. Scholtz and B. K. Levitt, Spread
Spectrum
Communications Handbook (McGraw-Hill, New York, 1994)
[ 6 ] R. L. Pickholtz, D. L. Schilling, and L. B. Milstein, "Theory of
Spread-Spectrum
Communications - A Tutorial," IEEE Transactions on Communications, vol.
COM-30, pp. 855
- 884 (1982 May)
[ 7 ] C. S. Lindquist, Adaptive & Digital Signal Processing with Digital
Filtering Applications
(Steward & Sons, Miami, 1989)
[ 8 ] L. R. Rabiner, and R. W. Schafer, Digital Processing of Speech Signals
(Prentice Hall, New
Jersey, 1978)
[ 9 ] E. Zwicker, and h. Fastl, Psychoacoustics Facts and Models
(Springer-Verlag, Berlin, 1990)
[ 10 ] D. L. Nicholson, Spread Spectrum Signal Design. LPE & AJ Systems
(Computer Science
Press, Rockville, Maryland, 1988)
[ 11 ]C. Neubauer and J. Herre, "Digital Watermarking and Its Influence on
Audio Quality,"
presented at the 105th Convention of the Audio Engineering Society, J. Audio
Eng. Soc.
(Abstracts), vol. 46, pp. 1041 (1998 November), preprint 4823.
[ 12 ] J. G. Roederer, The Physics and Psychophysics of Music
(Springer-Verlag, New York,
1995)
[ 13 ] J. G. Beerends and J. A. Stemerdink, "A Perceptual Speech-Quality
Measure Based on a
Psychoacoustic Sound Representation," J. Audio Eng. Soc., vol. 42, pp. 115 -
123 (1994 March)
[ 14 ] J. G. Beerends and J. A. Stemerdink, "A Perceptual Audio Quality
Measure Based on a
Psychoacoustic Sound Representation," J. Audio Eng. Soc., vol. 40, pp. 963 -
978 (1992
December)
[ 15 ] C. Colomes, M. Lever, J. B. Rault, Y. F. Dehery and G. Faucon, "A
Perceptual Model
Applied to Audio Bit-Rate Reduction," J. Audio Eng. Soc., vol. 43, pp. 233 -
239 (1995 April)
29
[ 16 ] T. Sporer, G. Gbur, J. Herre and R. Kapust, "Evaluating a Measurement
System," J. Audio
Eng. Soc., vol. 43, pp. 353 - 362 (1995 May)
[ 17 ] M. R. Schroeder, B. S. Atal and J. L. Hall, "Optimizing Digital
Speech Coders by
Exploiting Masking Properties of the Human Ear," J. Acoust. Soc. Am., vol.
66, pp. 1647 - 1652
(1979 Dec.)
[ 18 ] B. Paillard, P. Mabilleau, S. Morissette and J. Soumagne, "PERCEVAL:
Perceptual
Evaluation of the Quality of Audio Signals," J. Audio Eng. Soc., vol. 40,
pp. 21 - 31 (1992
Jan./Feb.)
[ 19 ] H. Burstein, "By the Numbers," Audio, vol. 74, pp. 43 - 48 (1990
Feb.)
[ 20 ] H. Burstein, "Approximation Formulas for Error Risk and Sample Size
in ABX Testing,"
J. Audio Eng. Soc., vol. 36, pp. 879 - 883 (1988 Nov.)
[ 21 ] S. Haykin, Communication Systems 3rd ed. (Wiley, New York, 1994)
[ 22 ] R. L. Shrader, Electronic Communication 5th ed. (McGraw Hill, New
York, 1985)
[ 23 ] L. Boney, A. H. Tewfik and K. N. Hamdy, "Digital Watermarks for Audio
Signals," IEEE
Int.Conf. on Multimedia Computing and Systems, Hiroshima, Japan (June 1996)
[ 24 ] I. J. Cox, "Spread Spectrum Watermark for Embedded Signalling",
United States Patent
5,848,155 (1998 Dec)
30
FFT Power Spectrum Energy per
Critical Band
Spread Masking
Across
Critical Bands
Masking
Threshold
Estimate
IFFT Noise Shaping
s(t) S(jw) Sp(jw)
Spz(z)
B(z)
Sm(z)
T(z)
N(jw)
out(t) OUT(jw)
Figure 1. Psychoacoustic auditory model
0 1 2 4 6 8 10 12 14 16
Frequency [kHz] Frequency [kHz]
80
60
40
20
0
80
60
40
20
0
Excitation level [dB]
Excitation level [dB]
f =8kHz c 1 4
0.02 0.05 0.1 0.2 0.5 1 2 5 10 20
f =0.07 c 0.25 1 4kHz
(a) (b)
Figure 2. Masking curves in (a) linear and (b) logarithmic frequency scale
[ 1 ]
0 2 4 6 8 10 12 14 16 18 20 22 24
critical band rate z [Barks]
60
40
20
0
Excitation level [dB]
Figure 3. Excitation level versus critical band rate for narrow band noises
with various center frequencies [ 1 ]
31
Figure 4. Model of the spreading function, B(z), using Eq. ( 5 )
Transmitter Receiver
Jammer
PG=
Wss
Rb
J
S
=Jammer to Signal power ratio
E
N
b
j
PG
J/S
=
J=Jammer Power
Wss = Bandwidth
Rb = Bit Rate (bits/sec)
S = Power
Figure 5. Basic spread spectrum communications system
2
Tb
Frequency [Hz] Magnitude 2
Tb
2 = N*
Tc
Frequency [Hz]
Magnitude
Figure 6. Spectrum of signal BPSK Figure 7. Spectrum of signal BPSK after
spreading
32
PN
Generator
BPSK
Modulator
d(t) x(t)
c(t)
Figure 8. DS/BPSK modulation
PN
Generator
BPSK
Modulator
d(t) x(t)
c(t)
s(t)
Figure 9. DS/BPSK modified
PN
Generator
PN
Generator
BPSK
Modulator
d(t) x(t)
c(t)
c(t)
s(t)
J(t)
y(t)
r r(t)
cos( )
2
0t
Tb
w
( )dt Tb
0 ò ·
Transmitter
Receiver
Transmission
Channel
Figure 10. Uncoded DS/BPSK
33
Tb
Tc
Ts
d(t)
c(t)
a(t)
c(t)d(t)
c(t)a(t)
Figure 11. Coded and uncoded signals before
spreading
Figure 12. Coded and uncoded signals after
spreading
34
PN
Generator
PN
Generator
c(t)
c(t)
BPSK
MODULATOR
INTERLEAVER
REPEAT
CODER
DEDECODER
INTERLEAVER
d
Tb
Tb
m
m
T = s T
m
b
di
I H
I H f0
s(t)
J(t)
ri
Ts
d^
Transmission
x(t) Channel
y(t)
Figure 13. Repeat code DS/BPSK system
X1 X16 X31 X46 X61
X2 X17 X32 X47 X62
X3 X18 X33 X48 X63
X4 X19 X34 X49 X64
X5 X20 X35 X50 X65
X6 X21 X36 X51 X66
X7 X22 X37 X52 X67
X8 X23 X38 X53 X68
X9 X24 X39 X54 X69
X10 X25 X40 X55 X70
X11 X26 X41 X56 X71
X12 X27 X42 X57 X72
X13 X28 X43 X58 X73
X14 X29 X44 X59 X74
X15 X30 X45 X60 X75
Figure 14. Interleaver matrix with I=5 and H=15
35
s(n) r(n) R(m) C(m) P(m) p(n)
n(n)
+
+
FFT H(m) G(m) IFFT
Ideal
Filter
Post
Processing
Filter
Figure 15. FFT filter assuming additive signal and noise
WATERMARK
GENERATION
Coded DS/BPSK
PSYCHOACOUSTIC
AUDITORY
MODEL
WATERMARK
SHAPING
AND
EMBEDDING
10110...1001
WATERMARK
(bit stream)
AUDIO
WATERMARKED
AUDIO
T(z) TRANSMISSION
CHANNEL
PARAMETERS
Figure 16 Proposed system (watermark generation and embedding)
T(z) r(t)
AUDITORY
MODEL
DE-SPREADING
AND
RECOVERY
ADAPTIVE
HIGH
RESOLUTION
DETECTION
RESIDUAL
GENERATION
HEADER
GENERATION
10110...1001
TRANSMISSION
CHANNEL
RECOVERED
WATERMARK
PARAMETERS
Figure 17 Proposed System (Data recovery)
36
PN
Generator
c(t)
BPSK
MODULATOR
REPEAT INTERLEAVER
CODER
{w}
Tw
m
Tb=
T
m
w
{wR}
I H f0
HEADER s(t)
INJECTION
{wI}
Tb
{header}
{d}={header}+{w}I
x(t)
Figure 18. Watermark generation system
1 1 1 -1 1
1 1 -1 -1 1
1 -1 -1 -1 -1
1 -1 -1 1 -1
1 -1 1 1 -1
1 -1 1 1 -1
-1 -1 1 1 -1
-1 -1 1 1 -1
-1 1 1 1 1
1 1 1 1 1
Figure 19. Interleaver matrix
FS
2
FS
2
A
Rc
2*Rc
Figure 20. Baseband System Parameters
37
FS
2
FS
2
A
2*Rc
Rc Rc
A
2
-f0 f0
Figure 21. Passband System Parameters (anti-aliasing)
FS
2
FS
2
A
LF+Rc HF+Rc
A
2
-f0 f0
Figure 22. Passband System with Frequency Limits LF and HF
Figure 23 Time domain signals: data bit stream, d(t); PN sequence, c(t);
BPSK modulator, sin(t); and
watermark audio signal, x(t)
38
Audio Signal
Watermark signal
LENGTH
FRAME
Figure 24. Frame segmentation and watermark redundancy
Figure 25. (a) Audio signal s(t) and window signal w(t), (b) windowed signal
sw(t), (c) magnitude of
frequency representation Sw(jw), and (d) power spectrum Sp(jw) and energy
per critical band Spz(z)
39
Figure 26. (a) Energy per critical band Spz(z), (b) spreading function B(z),
and (c) Spread energy per
critical band Sm(z)
Figure 27. (a) raw masking threshold Traw(z), and (b) normalized masking
threshold Tnorm(z)
40
Figure 28. (a) Xwnew(z) before shaping, (b) after shaping with A = 0.4
Figure 29. Final masking threshold Tfinal(z), and the power spectrum of (a)
Swnew(jw), (b) Xfinal(jw),
and (c) OUT(jw).
41
Figure 30. Detection peak in det(t)
PN
Generator
c(t)
{w}
Tw
m
{wR}
I H
r(t) HEADER
REMOVAL
{wI}
Td
{header}
y(t) DEINTERLEAVER
DECODER
g(t) {d^} ^ ^
Figure 31. Watermark recovery system
Figure 32 MPEG layer 3 system performance
42
Figure 33 FM stereo (left channel) system performance
DIGITAL WATERMARKING OF AUDIO SIGNALS USING A
PSYCHOACOUSTIC AUDITORY MODEL AND SPREAD SPECTRUM
THEORY
RICARDO A. GARCIA*, AES Member
School of Music Engineering Technology, University of Miami
Coral Gables, FL 33146, USA
A new algorithm for embedding a digital watermark into an audio
signal is proposed. It uses spread spectrum theory to generate a
watermark resistant to different removal attempts and a psychoacoustic
auditory model to shape and embed the watermark into the audio signal
while retaining the signal's perceptual quality. Recovery is performed
without knowledge of the original audio signal. A software system is
implemented and tested for perceptual transparency and data-recovery
performance.
0 INTRODUCTION
Every day the amount of recorded audio data and the possibilities to
distribute it (i.e. by the
Internet, CD recorders, etc) are growing. These factors can lead to an
increase in the illicit
recording, copying and distributing of audio material without respect to the
copyright or
intellectual property of the legal owners. Another concern is the tracking
of audio material over
broadcast media without the use of human listeners or complicated audio
recognition devices.
Audio watermarking techniques promise a solution to some of these problems.
The concept of watermarking has been used for years in the fields of still
and moving
images. The basic idea of a watermark is to include a special "code" or
information within the
transmitted signal. This code should be transparent to the user
(non-perceptible) and resistant
against removal attacks of various types.
In audio signals, the desired characteristics can be translated into:
- Not perceptible (the audio information should appear "the same" to the
average listener
before and after the code is embedded).
- Resistant to degradation because of analog channel transmission. (i.e. TV,
radio and tape
recording).
- Resistant to degradation because of uncompressed-digital media. (i.e. CD,
DAT and wav
files).
- Resistant to removal through the use of sub-band coders or psychoacoustic
models. (i.e.
MPEG, Atrac, etc).
The proposed algorithm generates a digital watermark (i.e. a bit stream)
that is spectrally
shaped and embedded into an audio signal. Spread spectrum theory is used in
the generation of
* Currently with the program in Media Arts & Sciences, Machine Listening
Group,
Massachusetts Institute of Technology (MIT), Cambridge, MA 02139 - 4307 USA.
2
the watermark. The strength of coded direct-sequence/binary-phase-shift
keying (DS/BPSK) is
used to create a robust watermark. The concepts are adapted to better deal
with audio signals in a
restricted audio bandwidth. A psychoacoustic auditory model is applied to
shape and embed the
watermark into the audio signal while retaining its perceptual quality for
the average listener.
A complete psychoacoustic auditory model algorithm is explained in detail.
This
information is useful for other applications involving auditory models. The
spread spectrum
encoding and decoding processes are then presented. The algorithm performs
an analysis of the
incoming signal and searches the frequency domain for "holes" in the
spectrum where the spread
spectrum data can be placed without being perceived by the listener. The
psychoacoustic
auditory model is used to find these frequency "holes."
After transmission, the receiver recovers the embedded spread spectrum
information and
decodes it in order to reconstruct the original bit stream (watermark).
There is no need for the
receiver to have access to the original audio signal.
The algorithm was implemented in a software system to create an encoder and
decoder, and
its performance was evaluated for diverse channels and audio signals. The
survival of the
watermark (number of correct bytes/second) was analyzed for different
configurations of the
encoding system. Each one of these configurations was tested for
transparency using an ABX
listening test and for different channels (i.e. AM Radio, FM stereo radio,
Mini Disc, MPEG layer
3, D/A - A/D conversion, etc).
1 PSYCHOACOUSTIC AUDITORY MODEL
An auditory model is an algorithm that tries to imitate the human hearing
mechanism. It uses
knowledge from several areas such biophysics and psychoacoustics.
From the many phenomena that occur in the hearing process, the one that is
the most important
for this model is "simultaneous frequency masking." The auditory model
processes the audio
information to produce information about the final masking threshold. The
final masking
threshold information is used to shape the generated audio watermark. This
shaped watermark is
ideally imperceptible for the average listener. To overcome the potential
problem of the audio
signal being too long to be processed all at the same time, and also extract
quasi-periodic
sections of the waveform, the signal is segmented in short overlapping
segments, processed and
added back together. Each one of these segments is called a "frame."
The steps needed to form a psychoacoustic auditory model are condensed in
Figure 1. The
first step is to translate the actual audio frame signal into the frequency
domain using the Fast
Fourier Transform. In the frequency domain the power spectrum, energy per
critical band and the
spread energy per critical band are calculated to estimate the masking
threshold. This masking
threshold is used to shape the "noise or watermark" signal to be
imperceptible (below the
threshold). Finally frequency domain output is translated into the time
domain and the next
frame is processed.
1.1 Short Time Fourier Transform (STFT)
The cochlea can be considered as a mechanical to electrical transducer, and
its function is to
make a time to frequency transformation of the audio signal. To be more
specific, the audio
information, in time, is translated in first instance into a
frequency-spatial representation inside
the basilar membrane. This spatial representation is perceived by the
nervous system and
translated into a frequency-electrical representation.
3
This phenomenon is modeled using the short time Fourier Transform (STFT).
The STFT uses
successive, overlapped windows from the time domain input signal.
1.2 Simultaneous Frequency Masking and Bark Scale
Simultaneous masking of sound occurs when two sounds are played at the same
time and one
of them is masked or "hidden" because of the other. The formal definition
says that masking
occurs when a test tone or "maskee" (usually a sinusoidal tone) is barely
audible in the presence
of a second tone or "masker." The difference in sound pressure level between
the masker and
maskee is called the "masking level."[ 1 ]
It is easier to measure the masking level for narrow band noise maskers
(with a defined
center frequency) and sinusoidal tone maskees. Figure 2 (a) and (b) display
some curves that
show the masking threshold for different narrow band noise maskers centered
at 70, 250, 1000
and 4000 Hz. The level of all the maskers is 60 dB. The broken line
represents the "threshold in
quiet." Average listeners will not hear any sound below this threshold.
Figure 2 (a) uses a linear
and (b) uses a logarithmic frequency scales.
The shape of all the masking curves is very different across the frequency
range in both
graphs. There are some similarities in the shape of the curves below 500 Hz
in the linear
frequency scale (a), and some similarities above 500 Hz in the logarithmic
frequency scale (b). A
more useful scale has been introduced that is known as "critical band rate"
or "Bark scale." The
concept of the Bark scale is based on the well-researched assumption [ 1 ]
that the basilar
membrane in the hearing mechanism analyzes the incoming sound through a
spatial-spectral
analysis. This is done in small sectors or regions of the basilar membrane
that are called "critical
bands." If all the critical bands are added together in a way that the upper
limit of one is the
lower limit of the next one, the critical band rate scale is obtained. Also
a new unit has been
introduced, the "Bark" that is by definition one critical band wide.
Figure 3 shows the same masking curves from Figure 2 in a Bark scale. Notice
that the shape
of the masking curves is almost identical across the frequency range.
Various approximations
may be used to translate frequency into a Bark scale [ 2 ]:
÷ ÷
ø
ö
ç ç
è
æ
÷ø
ö
çèæ
+
÷ø
ö
çè
= - æ -
2
1 1
7500
3.5tan
1000
0.76 *
13tan
f f
z ( 1 )
and [ 3 ]:
0.53
1960
26.81* -
+
=
f
f
z ( 2 )
where f is the frequency in Hertz and z is the mapped frequency in Barks.
Eq. ( 1 ) is more accurate, but the Eq. ( 2 ) is easier to compute. Figure 3
shows the excitation
level of several narrow band noises with diverse center frequencies in a
Bark scale.
1.3 Power Spectra
The first step in the frequency domain (linear, logarithmic or bark scales)
is to calculate the
power spectra of the incoming signal. This is calculated with:
{ } { }
2
2 2
( )
( ) Re ( ) Im ( )
w
w w w
Sw j
Sp j Sw j Sw j
=
= +
( 3 )
The energy per critical band, Spz(z) , is defined as:
4
å=
=
HBZ
LBZ
Spz z Sp j
w
( ) ( w) ( 4 )
Where: z = 1,2,.,total of critical bands Zt; LBZ and HBZ the lower and
higher frequencies in
the critical band z.
The power spectrum Sp(jw) and the energy per critical band Spz(z) are the
base of the
analysis in the frequency domain. They will be used to compute the spread
masking threshold.
1.4 Basilar Membrane Spreading Function
A model that approximates the basilar membrane spreading function, without
taking in
account the change in the upper slope is defined [ 3 ]:
B(z) =15.91+ 7.5(z + 0.474) -17.5 1+ (z + 0.474)2 ( 5 )
where z is the normalized Bark scale. Figure 4 shows B(z) .
The auditory model uses the information about the energy in each critical
band given by Eq. (
4 ) and uses Eq. ( 5 ) to calculate the spread masking across critical bands
Sm(z) . This is done
using:
Sm(z) = Spz(z) * B(z) ( 6 )
This operation is a convolution between the basilar membrane spreading
function and the total
energy per critical band. A true spreading calculation should include all
the components in each
critical band, but for the purposes of this algorithm, the use of the energy
per critical band
Spz(z) is a close approximation. Sm(z) can be interpreted as the energy per
critical band after
taking in account the masking occasioned by neighboring bands.
1.5 Masking Threshold Estimate
1.5.1 Masking Index
There are two different indexes used to model masking. The first one is used
when a tone is
masking noise (masker = tone, maskee = noise), and it is defined to be 14.5
+ z dB below the
spread masking across critical bands Sm(z) . In this case z is the center
frequency of the masker
tone using a bark scale. The second index is used when noise is masking a
tone (masker = noise,
maskee=tone), and is defined to be 5.5 dB below Sm(z) , regardless of the
center frequency [ 4 ].
1.5.2 Spectral Flatness Measure (SFM) and Tonality Factor a
The spectral flatness measure (SFM) is used to determine if the actual frame
is noise-like or
tone-like and then to select the appropriate masking index. The SFM is
defined as the ratio of the
geometric to the arithmetic mean of Spz(z) , expressed in dB as:
Zt
Zt
z
Zt
z
dB
Spz z
Zt
Spz z
SFM
1
1
1
( )
1
( )
10log10
ï ïþ
ï ïý
ü
ï ïî
ï ïí
ì
=
å
Õ
=
= ( 7 )
with Zt = total number of critical bands on the signal
5
The value of the SFM is used to generate the "tonality factor" that will
help to select the right
masking index for the actual frame. The tonality factor is defined in [ 3 ],
[ 4 ] as the minimum
of the ratio of the calculated SFM over a SMF maxima and 1:
÷ ÷ø
ö
ç çè
æ
= min , 1
dBmax
dB
SFM
SFM a ( 8 )
with SFM dB dB 60 max = - .
Therefore, if the analyzed frame is tone-like, the tonality factor a will be
close to 1, and if the
frame is noise-like, a will be close to 0. The tonality factor a is used to
calculate the masking
energy offset O(z), is defined as [ 3 ], [ 4 ]:
O (z) =a(14.5 + z) + (1-a)5.5 ( 9 )
The offset O(z)is subtracted from the spread masking threshold to estimate
the raw masking
threshold Traw(z) .
( ) ÷
ø
ö
çè
æ -
= 10
( )
log10 ( ) ( ) 10
O z
Sm z
Traw z ( 10 )
1.5.3 Threshold Normalization
The use of the spreading function B(z) increases the energy level in each
one of the critical
bands of the spectrum Sm(z) . This effect has to be undone using a
normalization technique, to
return Traw(z) to the desired level. The energy per critical band calculated
with Eq. ( 4 ) is also
affected by the number of components in each critical band. Higher bands
have more
components than lower bands, affecting the energy levels by a different
amount. The
normalization used [ 4 ] simply divides each one of the components of
Traw(z) by the number
of points in the respective band z P .
z P
Traw z
Tnorm z
( )
( ) = ( 11 )
Where:
z 1,2, .Zt
P number of points in each band z z
= ¼
=
1.5.4 Final Masking Threshold
After normalization, the last step is to take in to account the absolute
auditory threshold or
"hearing threshold." The hearing threshold varies across the frequency range
as stated in Zwicker
and Zwicker [ 1 ]. In the proposed auditory model the hearing threshold will
be simplified to use
the worst case threshold (the lowest). That is defined as a sinusoidal tone
of 4000 Hz with one bit
of dynamic range [ 4 ]. These values are chosen based on the data from
experimental research
that shows that the most sensitive range of the human ear is in the range of
2500 to 4500 Hz [ 1 ].
For a frequency of 4000 Hz, the measured sound intensity is 10-12 Watt/m2,
that equals a loudnes
of 0 phons at that frequency [ 12 ].The chosen amplitude (one bit) is the
smallest possible
amplitude value in a digital sound format. The hearing threshold is then
calculated with [ 4 ]:
TH = max(Pp( jw) ) ( 12 )
where:
( ) sin(2 4000 )
( ) power spectrum of the probe signal ( )
p t t
Pp j p t
p
w
=
=
6
The final threshold T(z) is:
T(z) = max(Tnorm(z),TH) ( 13 )
1.6 Noise Shaping Using the Masking Threshold
The objective of the auditory model is to find a usable masking threshold.
The final masking
threshold is always compared with the values of the power spectrum of the
signal Sp( jw) . This
can be interpreted as "below this threshold, the information is not relevant
for human hearing."
This means that if the frequency components that fall below the masking
threshold are removed;
the average listener will notice no difference between the original sound
signal and the altered
version.
Another very important consequence of this is that if these components are
not just discarded
but replaced with new components they will be, as before, inaudible for the
listener. This
assumes that the new components do not change the average energy
considerably in their critical
band. Let the frame with the new components be called N( jw) . The objective
is to use the final
masking threshold to select which components from Sp( jw) can be replaced
with components
from N( jw) . The components of N( jw) are shaped to stay below the final
masking threshold.
The final signal, that includes components from Sw( jw)and N( jw) , ideally
retains the
perceptual quality of the original signal for the average listener.
The following steps are used to remove the components from Sw( jw) , shape
the vector
N( jw) and mix them:
Calculate the "new" version of the sound signal (after removing some
components):
i
i
Sp j T z
Sw j Sp j T z
Swnew j
i
i i
i
z, according to component
1,2 number of components
0 ( ) ( )
( ) ( ) ( )
( )
w
w
w w
w
= K
î í ì
<
³
=
( 14 )
Remove the unneeded components in the N( jw) vector:
i
i
N j Sp j T z
Sp j T z
Nnew j
i i
i
i
z, according to component
1,2 number of components
( ) ( ) ( )
0 ( ) ( )
( )
w
w w
w
w
= K
î í ì
<
³
=
( 15 )
Calculate the power spectrum of Nnew( jw) :
2 Nnewp( jw) = Nnew( jw) ( 16 )
and then, the energy per critical band:
å=
=
HBZ
LBZ
Nnewpz z Nnewp j
w
( ) ( w) ( 17 )
Where: z = 1,2,.,total of critical bands Zt; LBZ and HBZ the lower and
higher frequencies in
the critical band z.
7
The shaping is done applying a factor z F to each critical band. These
factors are given by:
( )
LBZ HBZ z
z Zt
Nnew j
T j
F A z
to for each band
1,2
max ( )
( )
=
=
=
w
w
w
K
( 18 )
The coefficient A is used as the "gain of the noise signal". Varies from 0
to 1 and weights the
embedded noise below the threshold of masking. The factors z F are applied
using:
LBZ HBZ z
z Zt
Nfinal j Nnew j Fz
to for each band
1,2
( ) ( )
=
=
=
w
w w
K ( 19 )
The final step is to mix both spectrums, the altered Swnew( jw) and the
shaped Nfinal( jw) to
form the composite signal OUT( jw) :
OUT( jw) = Swnew( jw) + Nfinal( jw) ( 20 )
2 SPREAD SPECTRUM
One of the requirements of a watermarking algorithm is that the watermark
should resist
multiple types of removal attacks. A removal attack is considered as
anything that can degrade or
destroy the embedded watermark. Another factor to be considered is that the
masking threshold
of the actual audio signal determines the embedding of the watermark,
because the watermark is
embedded in the "spare components" found using the psychoacoustic auditory
model. From this
point of view, the watermark has to be the least intrusive to the audio
signal, and therefore, the
actual audio data can be seen as the main obstacle for a good watermarking
algorithm. This is
because the audio will use all the needed bandwidth and the watermark will
use what is left after
the auditory model analysis.
The desired watermarking technique should be resistant to degradation
because of:
- The used transmission channel: analog or digital.
- High-level wide-band noise (in this case, the "noise" is the actual audio
signal). This is
often related as "low signal to noise ratio".
- The use of psychoacoustic algorithms on the final watermarked audio.
A communication theory technique that meets the requirements is the "spread
spectrum
technique", as described thoroughly in Simon et al. [ 5 ] and Pickholtz et
al.
[ 6 ]. "Spread spectrum is a means of transmission in which the signal
occupies a bandwidth in
excess of the minimum necessary to send the information; the band spread is
accomplished by
means of a code which is independent of the data, and a synchronized
reception with the code at
the receiver is used for despreading and subsequent data recovery." [ 6 ]
In the following analysis, the process of generating a watermark that will
be embedded in an
audio signal is expressed in spread spectrum terminology. The original audio
signal will be
called "noise" and the bit stream that conforms the watermark sequence will
be the data signal.
The watermark sequence is transformed in a watermark audio signal and then
the audio signal
(noise) is added to it. This process of adding noise to a channel or signal
is called "jamming."
The objective of a jammer in a communication system is to degrade the
performance of the
transmission, exploiting knowledge of the communication system. In the
watermark algorithm
8
the audio signal (i.e. music) is considered the jammer, and it has much more
power than the
transmitted bit stream (watermark).
2.1 Basic Concepts
The primary challenge that a receiver must overcome is intentional jamming,
especially if the
jammer has much more power than the transmitted signal. Classical
communications theoretical
investigations about additive white Gaussian noise help to analyze the
problem. White Gaussian
noise is a signal which has infinite power spread uniformly over all
frequencies; but even under
these circumstances communication can be achieved due to the fact that on
each of the "signal
coordinates" the power of the noise component is limited (not infinite).
Therefore, if the noise
component in the signal coordinates is not too large, communication can be
made. This is usually
applied in a typical narrow-band signal, where just the noise components in
the signal bandwidth
are taken into account as possible factors that can do harm to the
communication. With this
knowledge, the best strategy to combat intentional jamming is to select
signal coordinates where
the jammer to signal ratio is the smallest possible.
Assume a communication link with many signal coordinates available to choose
from, and
only a small subset of these is used at any time. If the jammer can not
determine which subset is
being used, it is forced to jam all the coordinates and therefore, all its
power will be distributed
among all the coordinates, with little power in each of them. If the jammer
chooses to jam only
some of the coordinates, the power over each of them is larger, but the
jammer lacks the
knowledge of which coordinates to jam. The protection against the jammer is
enhanced, as more
signal coordinates are available to choose from.
Having a signal of bandwidth W and duration T, the number of coordinates
available is given
by:
î í ì
@
non - coherent signals
2 coherent signals
WT
WT
N ( 21 )
T is the time used to send a standard symbol. To make N larger when T is
fixed, two techniques
can be applied:
§ Direct sequence spreading (DS): this is the selected approach in this
algorithm.
§ Frequency hopping (FH)
The signals created with these techniques are called "spread spectrum
signals."
2.1.1 Models and Fundamental Parameters
The basic system is shown in Figure 5, with the following parameters:
Wss = Total spread spectrum signal bandwidth available
Rb = Data rate ( bits / second )
S = Signal power (at the input of the receiver)
J = Jammer power (at the input of the receiver)
Wss is defined as the total available spread spectrum bandwidth that could
be used by the
transmitter, but it is not guaranteed that it will be used during the actual
transmission. Neither is
it guaranteed that the spectrum will be continuous. Rb is the uncoded bit
data rate used during
transmission. The signal and the jammer powers S and J are the averaged
power at the receiver.
This does not change even if the jammer and/or the signal are pulsating.
9
2.1.2 Jammer Waveforms
The number of possible jammer waveforms that a jammer can apply to a
communication
system is infinite. The principal types include:
- Broadband Noise Jammer: Spreads Gaussian noise of a total power J evenly
over the total
frequency range of the spread bandwidth Wss.
- Partial Band Noise Jammer: Spreads noise of total power J evenly over a
frequency range of
bandwidth WJ, which is contained in the total spread bandwidth Wss. r is the
fraction of the
total spread spectrum bandwidth that is being jammed.
- Pulse Jammer: Transmits the jammer waveform during a fraction r of the
time, the average
power is J, but the peak power during transmission is higher.
2.2 Coherent Direct-Sequence Systems
Coherent direct-sequence systems use a pseudorandom sequence and a modulator
signal to
modulate and transmit the data bit stream. The main difference between the
uncoded and coded
versions is that the coded version uses redundancy and "scrambles" the data
bit stream before the
modulation is done and reverses the process at the reception. The
watermarking algorithm uses
the coded scheme, but the uncoded is studied because is easier to understand
and is the
foundation of the coded scheme.
2.2.1 Uncoded Direct-Sequence Spread Binary Phase-Shift-Keying
Uncoded Direct-Sequence Spread Binary Phase-Shift-Keying is known as Uncoded
DS/BPSK. It may be explained with a simple example. BPSK signals are often
expressed as:
( 1) , integer
2
( ) 2 sin 0
£ < + =
úû
ù
êë
= é +
nT t n T n
d
s t S t
b b
np
w
( 22 )
where
Tb is the data bit time ÷ ÷
ø
ö
ç çè
æ
b R
1
{ } n d is the sequence of data bits, with the possible values of 1 or -1;
and equal probability of occurrence.
Eq. ( 22 ) can be expressed as:
( 1) , integer
( ) 2 cos( ) 0
£ < + =
=
nT t n T n
s t d S t
b b
n w
( 23 )
BPSK can be seen as phase modulation in Eq.( 22 ) or amplitude modulation in
Eq. ( 23 ). The
spectrum of a BPSK signal is usually of the form shown in Figure 6. This is
a (sin 2 x) x2
function, and the first null bandwidth is b 1 T . This shows the minimum
bandwidth needed to
transmit the signal s(t) and to recover it at the receiver.
Spread spectrum theory requires the signal to be spread over a larger
spectrum than the
minimum needed for transmission. The spreading of the direct sequence is
done using a
pseudorandom (PN) binary sequence {c}. The values of this sequence are 1
or -1 and its speed is
N times faster than the{d} data rate. The time, Tc, of each bit on a PN
sequence is known as a
"chip" and is given by:
10
N
T
T b
c = ( 24 )
The direct sequence spread spectrum signal has the form:
[ ]
integer
0,1,2 1
( 1)
2 cos( )
( ) 2 sin / 2
0
0
=
= -
+ £ < + +
=
= -
+
+
n
k N
nT kT t nT k T
d c S t
x t S t d c
b c b c
n nN k
n nN k
K
w
w p
( 25 )
The signal is very similar to the common BPSK, except that the bit rate is N
times faster and the
power spectrum is N times wider, as shown in Figure 7. The processing gain
is given by:
N
R
W
PG
b
= SS = ( 26 )
WSS is the direct sequence spread spectrum bandwidth
c b T
N
T
1 1 = .
If the data function is defined as:
integer
( ) , ( 1)
=
= £ < +
n
d t d nT t n Tn b b ( 27 )
and the PN sequence is:
integer
( ) , ( 1)
=
= £ < +
k
c t c kT t k Tk c c ( 28 )
Eq. ( 25 ) can be expanded as:
[ ]
( ) ( ) 2 cos( )
( ) 2 sin ( ) ( ) / 2
0
0
c t d t S t
x t S t c t d t
w
w p
=
= +
( 29 )
Figure 8 shows the block diagram for the normal DS/BPSK modulation; and
Figure 9 shows
an equivalent model used in the next step of the analysis. Figure 11 shows
the signals d(t) and
c(t) and Figure 12 shows c(t)d(t) with N=6. From Figure 9, the equivalent
form of x(t) is given
by:
x(t) = c(t)s(t) ( 30 )
Where ( ) ( ) 2 cos( ) 0s t = d t S w t ( 31 )
This is the original BPSK signal. The property:
c2 (t) = 1 for all t ( 32 )
is the key point exploited to "recover" the original BPSK signal:
c(t)x(t) = s(t) ( 33 )
If the receiver possesses a copy of the PN sequence and can synchronize the
local copy with
the received signal x(t), it is able to de-spread the signal and recover the
transmitted data.
11
2.2.1.1 Constant Power Broadband Noise Jammer
A jammer, J(t), with constant power J is shown in Figure 10. The system is
also assumed to
have no noise from the transmission channel. An ideal BPSK demodulator is
assumed after the
received signal y(t) is multiplied by the PN sequence. The channel output
is:
y(t) = x(t) + J (t) ( 34 )
This is multiplied by the PN sequence c(t):
( ) ( ) ( )
( ) ( ) ( ) ( )
( ) ( ) ( )
s t c t J t
c t x t c t J t
r t c t y t
= +
= +
=
( 35 )
This term shows the original BPSK signal plus a noise given by c(t)J(t). The
output of the
conventional BPSK detector is then:
r d E n b = + ( 36 )
where: d is the data bit for the actual Tb second interval.
Eb=STb is the bit energy.
n is the equivalent noise component.
n is further defined as:
= ò b
b
T
c t J t t dt
T
n
0
( ) ( )cos( 0 )
2 w ( 37 )
The usual decision rule for BPSK is:
î í ì
- £
>
=
1, if 0
^ 1, if 0
r
r
d ( 38 )
2.2.2 Coded Direct Sequence Spread Binary Phase-Shift-Keying
Several types of coding techniques can be used that provide extra gain and
force the worst
case jammer to be a constant power jammer. Coding techniques usually require
the data rate to
be decreased or the bandwidth increased because of the redundancy inherent
to the coding. In
spread spectrum systems, coding does not require an increase of the
bandwidth or decrease of the
bit rate. These properties can be seen in a simple example. If k=2 (constant
length) the rate is
R=1/2 bits per coded symbol of convolutional code. For each data bit of the
sequence {d}, the
encoder generates two coded bits. For the kth transmission interval, the two
data bits are:
ak = (ak1,ak 2) ( 39 )
where:
î í ì -
¹
=
=
=
-
-
1
1
2
1
1
1
k k
k k
k
k k
d d
d d
a
a d
( 40 )
If Tb is the data bit time, each coded bit time is given by:
2
b
s
T
T = ( 41 )
Defining:
integer
( 1/ 2) ( 1)
( 1/ 2)
( )
2
1
=
î í ì
+ £ < +
£ < +
=
k
a k T t k T
a kT t k T
a t
k b b
k b b
( 42 )
12
In Figure 11 the uncoded data signal d(t), the PN sequence c(t) and the
coded signal a(t) are
shown for N=6. In Figure 12 the multiplicated signals d(t)c(t) and a(t)c(t)
are shown. With
ordinary BPSK, the coded signal a(t) would have twice the bandwidth of the
uncoded signal; but
after spreading with the PN sequence, the final bandwidth is the same as the
original. One of the
simplest coding schemes is the "repeat code." It sends m bits with the same
value, d, for each
data bit. The rate is then R=1/m bits per coded symbol. In this case, the
resulting coded bits are:
( , , , ) 1 2 m a = a a K a ( 43 )
Where: a d i m i = =1,2,K, ( 44 )
Also, each coded bit ai has a transmission time of:
m
T
T b
s = ( 45 )
It is very important to note that if m<N, the bandwidth of the spread signal
does not change. The
complete coded DS/BPSK system is shown in Figure 13.
The interleaver scrambles the bits in time at the transmission, and the
deinterleaver
reconstructs the data sequence at the receiver. After the interleaver, the
signal is BPSK
modulated and then multiplied by the PN sequence. At this point the
transmitted DS/BPSK
signal looks like the one in Eq. ( 30 ).
x(t) = c(t)s(t)
where s(t) is the common BPSK (with coding). The input at the receiver is
the same as that in
Eq. ( 34 ):
y(t) = x(t) + J (t)
After multiplication with c(t) (de-spreading), it becomes Eq. ( 35 ):
r(t) = s(t) + c(t)J (t)
The output of the detector after the de-interleaver is given by:
i m
Z n
m
E
r a i i
b
i i
= 1,2,K,
= +
( 46 )
where n1,n2,.nm are independent zero mean Gaussian random variables with
variance NJ/(2r). r
is the fraction of time that the pulse jammer is on, and Zi is the jammer
state:
î í ì
=
0 jammer off during transmission
1 jammer on during transmission
i
i
i a
a
Z ( 47 )
With probability equal to:
{ }
{ } r
r
= = -
= =
Pr 0 1
Pr 1
i
i
Z
Z
( 48 )
2.2.2.1 Interleaver and Deinterleaver
The idea of using an interleaver to scramble the data bits at transmission
and a deinterleaver
to unscramble the bits at reception causes the pulse jamming interference on
each affected data
bit to be independent from each other. In the ideal interleaving and
deinterleaving process, the
13
variables Z1,Z2,.,Zm become independent random variables. Assume that there
is no interleaver
and/or deinterleaver in the system shown in Figure 13. The output of the
channel is given by:
i m
Zn
m
E
r d i
b
i
= 1,2,K,
= +
( 49 )
and because there is no interleaver/deinterleaver:
i m
Z Z
a d
i
i
=1,2,K,
=
=
( 50 )
Also, it is assumed that the jammer was on during the whole data bit
transmission Tb. Because
there is no interleaver/deinterleaver, the optimum decision rule is:
å
å
=
=
= +
=
m
i
b i
m
i
i
d mE Z n
r r
1
1 ( 51 )
Eq. ( 38 ) is used as a decision rule:
î í ì
- £
>
=
1, if 0
^ 1, if 0
r
r
d
This bit error probability is the same for uncoded DS/BPSK; this means that
without a
interleaver/deinterleaver, there is no difference between uncoded systems
and simple repeat code
systems. Therefore, the use of a interleaver/deinterleaver is mandatory in
order to achieve a good
error probability measure against a pulse jammer.
Selection of the decision technique that determines the value of the coded
bits {r} requires
knowledge about the state of the channel. With an ideal
interleaver/deinterleaver, the output of
the channel is given by Eq. ( 46 ):
i m
Z n
m
E
r a i i
b
i i
= 1,2,K,
= +
where Z1,Z2,.,Zm and n1,n2,.,nm are considered to be independent random
variables. The
decoder takes r1,r2,.,rm and finds d1,d2,.,dm with possible values of 1
or -1. This analysis is
valid only for the instances where the state of the channel is unknown
(there is no information
regarding the state of the jammer signal).
2.2.2.2 Hard Decision Decoder
The hard decision decoder performs a binary decision on each coded bit
received:
i m
r
r
d
i
i
i
1,2, ,
1 0
^ 1 0
= K
î í ì
- £
>
=
( 52 )
The final decision in decoding the transmitted bit is:
14
ï ïî
ï ïí
ì
- £
>
=
å
å
=
=m
i
i
m
i
i
k
d
d
d
1
1
1 ^ 0
1 ^ 0
^ ( 53 )
2.2.2.3 Interleaver Matrix
The interleaving techniques will improve the performance in pulse jammer
environments
because it makes the noise components become statistically independent
variables. A block
interleaver with depth I=5 and interleaver span H=15 is shown in Figure 14.
The coded symbols
are written to the interleaver matrix along columns, while the transmitted
symbols are read out of
the matrix along rows. If the coded symbol sequence is x1,x2,x3. the
sequence that comes out of
the interleaver matrix is x1,x16,x31,x46,x61,. . At the receiver, the
deinterleaver performs the
inverse process, writing symbols into rows and reading them by columns. A
jamming pulse of
duration b symbols, with b £ I will result in these jammed symbols at the
deinterleaver output to
be separated at least by H symbols.
2.3 Synchronization of Spread-Spectrum Systems
Because a pseudorandom sequence PN is used at the transmitter to modulate
the signal, the
first requirement at the receiver is to have a local copy of this PN
sequence. The copy is needed
to de-spread the incoming signal. This is done by multiplying the incoming
signal by the local
PN sequence copy. To accomplish a good de-spreading, the local copy has to
be synchronized
with the incoming signal and the PN sequence that was used in the spreading
process.
The process of synchronization is usually performed in two steps: first, a
coarse alignment of
the PN sequence is done with a precision of less than a "chip." This is
called "PN acquisition."
After this, a fine synchronization takes care of the final alignment and
corrects the small
differences in the clock during transmission. This is called "PN tracking."
Theoretically,
acquisition and tracking can be done in the same step with a structure of
matched filters or
correlators searching with high resolution the incoming signal and comparing
it with the local
PN sequence.
2.3.1 Fast Fourier Transform (FFT) Scalar Filters
These filters are implemented in the frequency domain, and they use the Fast
Fourier
Transform (forward and backward). They work over a set of N samples (usually
in the frequency
domain) [ 7 ]. The block diagram of an adaptive digital filter is shown in
Figure 15.
Where: s(n) is the input signal
n(n) is the noise (unwanted) signal
r(n) is the input to the filter
R(m) is the frequency representation of the signal (n)
H(m) is the transfer function of the filter
C(m) is the output (in frequency domain) after the filter is applied
G(m) is the transfer function of the post-processing filter
P(m) is the output after the post-processing filter
p(n) is the output signal in the time domain
The following relationships are given:
15
( ) FFT ( ( ))
( ) ( ) ( )
( ) ( ) ( )
( ) FFT( ( ))
( ) ( ) ( )
p n 1 P m
P m G m C m
C m H m R m
R m r n
r n s n n n
= -
=
=
=
= +
( 54 )
2.3.1.1 High-resolution Detection FFT Scalar Filter
The high-resolution detection filter outputs a peak when the desired signal
s(n) and noise
n(n) are applied to it. The transfer function is given by:
2 2 ( ) ( )
*( )
( )
S m N m
S m
H m
+
= ( 55 )
This version of high-resolution detection assumes that the noise and the
signal are uncorrelated
(orthogonal). The output of this filter C(m) must be transformed to the time
domain to detect the
level and the position of the peak on the output vector c(n). This position
can be interpreted as
the exact point where the desired signal starts within the processed set of
samples N.
2.3.1.2 Adaptive Filtering
Adaptive filters require a learning process and use adaption techniques to
form the transfer
function of the desired filter H(m). The components of the transfer function
are updated
periodically with actual values taken from the signal or with estimates made
using stored data.
The class 1/3 high-resolution detection filter is given by [ 7 ]:
2 ( )
*( )
( )
R m
S m
H m = ( 56 )
where S*(m) is the conjugate of the spectrum of the desired signal to detect
and R(m) is the
magnitude of the spectrum of the actual input of the system.
The expression R(m) is used to denote the "smoothing" process. This process
is done to
estimate the average spectrum of the signal plus noise from the actual input
of the system. The
smoothing used is called "inner block averaging" or "frequency domain
averaging" and it is
defined as:
or ( ) ( )
( ) ( )
2
1
( )
r r t b t
R j R j B j
b =
= w * w
p
w
( 57 )
The frequency averaging window B(jw) is convolved with the spectrum of the
input signal. This
is equivalent to a temporal weighting of the input r(t) by b(t) in the time
domain. The window is
usually selected to be a percentage of the input vector length.
3 PROPOSED SYSTEM
Different systems have been applied to watermarking of audio signals. All of
them are
classified as "steganographic systems" because they deal with the concept of
hiding data within
the signal. Boney et al. [ 23 ] proposed a system where a PN sequence was
filtered using a filter
that approached the masking characteristics of the human auditory system in
the frequency and
16
time domains. Some other techniques have been imported from the fields of
video and still
image watermarking. Cox [ 24 ] proposes a multiplatform system capable of
extract a
pseudorandom sequence without the use of the original unwatermarked data.
The watermarking algorithm proposed in this paper mixes the psychoacoustic
auditory model
and the spread spectrum communication technique to achieve its objective. It
is comprised of
two main steps: first, the watermark generation and embedding and second,
the watermark
recovery. The watermark generation and embedding process is shown in Figure
16. A bit stream
that represents the watermark information is used to generate a noise-like
audio signal using a set
of known parameters to control the spreading. At the same time, the audio
(i.e. music) is
analyzed using a psychoacoustic auditory model. The final masking threshold
information is
used to shape the watermark and embed it into the audio. The output is a
watermarked version of
the original audio that can be stored or transmitted.
The watermark recovery is shown in Figure 17. The input is the watermarked
audio after
transmission (i.e. music + noise, low quality, etc). An auditory
psychoacoustic model is used to
generate a residual. At the same time as the known parameters are used to
generate the header of
the watermark. Using an adaptive high-resolution filter, all the residual is
scanned to find all the
occurrences of the known header and therefore the initial position of each
possible watermark.
After this, the same known parameters used to generate the header are used
to de-spread and
recover the watermark.
3.1 WATERMARK GENERATION AND EMBEDDING
3.1.1 WATERMARK GENERATION
The objective of the watermark generation is to generate a watermark audio
signal x(t) that
contains the watermark bit stream data. This watermark signal can be
transmitted and then
processed for data recovery. The technique used to generate the watermark
signal x(t) is "coded
DS/BPSK spread spectrum." The process is condensed in Figure 18.
Where: {w} is the original digital bit stream(watermark)
m is the repetition code factor
{ } R w is the watermark after the coding process (repeat code)
I,H = width and length of the interleaver matrix
{ } I w is the watermark after the interleaver process
{header}= is the header sequence
{ } { } { } I d = header + w = sequence to be spread and transmitted
f0 = frequency used by the BPSK modulator
The process can be explained with a simple example: Let {w} be the watermark
bit stream.
All the bit streams used are bipolar (value 1 or -1). Defining {w}with a
length of 16 bits as the
sequence:
{w}= { 1 1 -1 1 -1 -1 1 -1 1 1 -1 1 1 1 -1 -1}
Using Eq. ( 43 ) to generate the repeat code, and choosing m=3, the { } R w
sequence is:
17
{ }
1 1 1 1 1 1 1 1 1 1 1 1 }
1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1
{ 1 1 1 1 1 1 1 1 1 1 1 1
- - - - - -
- - -
- - - - - - - - -
= - - - R w
The next step is to perform interleaving. To do this, the values of the
interleaving matrix are
chosen; in this case, I=5, H=10, (see Figure 14). The resulting matrix is
shown in Figure 19. The
last two spaces are padded with 1's. Using the interleaving matrix, the
output sequence { } I w is:
{ }
1 1 }
1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1
{ 1 1 1 1 1 1 1 1 1 1 1 1
- - -
- - - - - - -
- - - - - - -
= - - - - I w
The selected header is a sequence usually composed by 1's.
{header}= {1 1 1 1 1 1 1 1 1 1}
The final data sequence {d} is obtained concatenating the {header} and the
{ } I w :
{ } { } { }
{ }
1 1 1 1 1 1 1 1 1 1 1 1 }
1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1
{ 1 1 1 1 1 1 1 1 1 1 1 1
- -
- - - - - - -
- - - - - -
- - - - - -
=
= +
d
d header wI
The PN sequence {c} can be generated by any means. Usually this is done
using a
pseudorandom number generator. In this case, the PN sequence is assumed to
be long enough to
spread a complete bit stream (header and data) without repeating any portion
of it. The important
factor is that the transmitter and the receiver must have a copy of the
whole PN sequence{c}.
This sequence is ideally uncorrelated with the {d} sequence, and has the
form:
{c}= { 1 -1 1 1 -1 -1 1 -1 1 1 -1 1 K}
3.1.1.1 Spread Spectrum Parameter Selection
Audio signals are usually considered to be baseband signals [ 21 ]. The
described spread
spectrum technique can be applied to passband systems (with f0>0) or
baseband systems (f0=0)
without losing generality. The selection of all the parameters is based on
the considerations of
how the overall watermarked audio signal will be transmitted or stored. The
frequency response
of those systems determines which frequencies are likely to be present at
the receiver. Let a
baseband bandlimited signal, with no modulation (f0=0) have the magnitude
spectrum shown in
Figure 20. With amplitude modulation (f0>0), the spectrum will have the form
shown in Figure
21. FS is the sampling frequency of the system. To avoid aliasing because of
the use of
modulation, the modulation frequency should be:
18
Rc
FS
Rc £ f £ -
2 0 ( 58 )
If a system possesses a lower frequency limit LF and/or an upper frequency
limit HF, the
modulation frequency f0 have to be selected in a way that the sidebands fall
between the lower an
upper limits, as shown in Figure 22. If a sideband falls outside of these
limits, aliasing or data
loss could result. Taking into account, the selection of parameters should
be done using:
2
0,
0
FS
LF HF
LF Rc f HF Rc
³ £
+ £ £ -
( 59 )
The parameters selected must satisfy Eq. ( 58 ) and Eq. ( 59 ), along with
the following
relationships:
Rd = is the data bits per second
m = is the repetition code factor
N = is the spreading factor, Eq. ( 26 )
Rb = Rd*m is the coded bits per second
Tb = 1/Rb is the time of each coded bit
Rc=N*Rb is thePN sequence bits per second
Tc=Tb/N is the time of each PN bit or "chip"
Assuming a frequency response similar to FM Radio [ 22 ] with LF = 50 Hz and
HF = 15000
Hz, for the actual example, a set of spread spectrum parameters that satisfy
all the requirements
is:
N = 3
m = 3
Rd = 100 bits/sec
Rb = 300 bits/sec
Rc = 900 bits/sec
f0 = 3500 Hz
Note that N and m are selected with small values for this example. The
modulation is done using
Eq. ( 31 ):
( ) ( ) 2 cos( ) 0s t = d t S w t
The spreading is done using Eq. ( 30 ):
x(t) = c(t)s(t)
The output of the system is the watermarked audio waveform x(t) shown in
Figure 23.
3.1.2 FRAME SEGMENTATION
To overcome the potential problem of the audio signal or the watermark
signal being too
long to be processed using a single FFT, the signal is segmented in short
overlapping segments,
processed and added back together [ 8 ]. Another consideration for the
watermark algorithm is
that the audio signal has to be longer than the watermark signal. Therefore,
the watermark can be
repeated several times during the duration of the audio signal. This
redundancy is one of the
important features in the watermarking algorithm. Figure 24 shows audio and
watermark signals
that will be segmented. The watermark is repeated several times. If the
total length of the audio
signal is LENGTH samples, the desired length of the analysis frame is BLOCK
samples, and the
19
overlap between consecutive frames is OVERLAP samples, the total number of
FRAMES is
given by:
BLOCK OVERLAP
LENGTH OVERLAP
FRAMES
-
-
= ( 60 )
In Figure 24 two equal length frames were selected to be processed. One from
the audio
signal and the other from the respective point in the watermark signal. The
last frame is zeropadded
if it is shorter than BLOCK samples. These padded samples are discarded in
post
processing. From this point on, all processes described are applied to the
audio or watermark
signal frames, not the entire signal.
3.1.3 FREQUENCY REPRESENTATION
The Short Time Fourier Transform (STFT) is used to acquire a frequency
representation of
the actual frames. Before doing the STFT, a Hamming window is applied to
both signals [ 7 ], [ 8
]. This improves the representation of the signal in the frequency domain
reducing the leakage. If
s(t) is the actual audio signal frame and x(t) the actual watermark signal
frame, then the
windowing is done using:
sw(t) = s(t)w(t) ( 61 )
xw(t) = x(t)w(t) ( 62 )
The Hamming window is defined as:
sampling period
( ) ( )
1,2
2
( ) 0.54 0.46cos
=
=
=
÷ø
ö
çè
= + æ
T
w t w nT
n BLOCK
BLOCK
n
w n
K
p
( 63 )
The frequency representation of the audio frame is:
Sw( jw) = FT{sw(t)} ( 64 )
and the watermark frame:
Xw( jw) = FT{xw(t)} ( 65 )
The power spectra is found using Eq. ( 3 ):
2 Sp( jw) = Sw( jw) ( 66 )
The indices of the actual frequency representations have to be mapped to the
Bark scale.
Once this index mapping is done, the representation in the critical band
scale is formed by
mapping the components to the respective position on the critical band axis.
The relationship
between each component index, i, and the corresponding frequency, fi, that
it represents is given
by:
Sampling Frequency
2
1,2
( 1) *
=
=
-
=
FS
BLOCK
i
BLOCK
i FS
fi
K
( 67 )
20
The relationship between each frequency fi and the bark scale or critical
band scale zI is found
using Eq. ( 1 ):
÷ ÷
ø
ö
ç ç
è
æ
÷ø
ö
çè
æ + ÷
ø
ö
çè
= - æ -
2
1 1
7500
3.5tan
1000
0.76*
13tan i i
i
f f
z
This relationship between each component index i and the frequency fi or
critical band zi that it
represents can be calculated at the beginning of the algorithm and stored in
a table. The energy
per critical band is calculated using Eq. ( 4 ):
å=
=
HBZ
LBZ
Spz z Sp j
w
( ) ( w)
Where: z = 1,2,.,total of critical bands Zt; LBZ and HBZ the lower and
higher frequencies in
the critical band z. Figure 25 (a) shows the original audio frame s(t) in
the time domain and the
shape of the Hamming window w(t); (b) shows the sw(t) frame after the
windowing process; (c)
shows the magnitude of Sw( jw) , and (d) shows the power spectrum Sp( jw)
and the energy per
critical band Spz(z) .
3.1.4 BASILAR MEMBRANE SPREADING FUNCTION
The basilar membrane spreading function determines how much of the energy of
each critical
band is contributed to the neighboring bands. The spreading function B(z) is
calculated using Eq.
( 5 ):
K 2, 1,0,1,2K
15.91 7.5( 0.474) 17.5 1 ( 0.474)2
= - -
= + + - + +
k
B k k k
The spreading across bands is computed by the convolution of the spreading
function B(z) and
the energy per critical band Spz(z) ,using Eq. ( 6 ):
Sm(z) = Spz(z) * B(z)
Figure 26 (a) shows the energy per critical band Spz(z) , (b) shows the
spreading function B(z)
for 9 points, and (c) shows the spread energy per critical band Sm(z).
3.1.5 MASKING THRESHOLD ESTIMATE
The Spectral Flatness Measure (SFM) of the actual audio frame Sw( jw) is
taken using
Eq. ( 7 ):
Zt
Zt
z
Zt
z
dB
Spz z
Zt
Spz z
SFM
1
1
1
( )
1
( )
10log10
ï ïþ
ï ïý
ü
ï ïî
ï ïí
ì
=
å
Õ
=
=
with Zt = total number of critical bands in each frame
The energy per critical band Spz(z) is used rather than spread energy per
critical band Sm(z)
to avoid false results due to smoothing of the signal. The tonality factor a
is then calculated
using Eq. ( 8 ):
÷ ÷ø
ö
ç çè
æ
= min ,1
dB max
dB
SFM
SFM a
21
with SFM dB dB 60 max = - .
The masking energy offset O(z) is then calculated using Eq. ( 9 ):
O(z) =a(14.5 + z) + (1-a)5.5
The raw masking threshold, Traw(z), is calculated with Eq. ( 10 ):
( ) ÷
ø
ö
çè
æ -
= 10
( )
log10 ( ) ( ) 10
O z
Sm z
Traw z
The raw masking threshold is normalized using Eq. ( 11 ):
z P
Traw z
Tnorm z
( )
( ) =
where:
z 1,2, .Zt
P number of points in each band z z
= ¼
=
To calculate the final masking threshold T it is necessary to first
calculate the hearing
threshold (or threshold in quiet) TH. It is defined as a sinusoidal tone of
4000 Hz with one bit of
dynamic range. Using Eq. ( 12 ):
TH = max(Pp( jw) )
Where:
( ) sin(2 4000 )
( ) power spectrum of the probe signal ( )
p t t
Pp j p t
p
w
=
=
Then the final masking threshold T is calculated using Eq.( 13 ):
T(z) = max(Tnorm(z),TH)
with: z=1,2,..Zt
Figure 27 (a) shows the raw masking threshold Traw(z) and (b) shows the
normalized
threshold Tnorm(z).
3.1.6 WATERMARK SPECTRAL SHAPING
The final masking threshold T is used to determine which components of the
audio signal
Sw( jw) can be removed without affecting the perceptual quality of the
signal. The power
spectrum Sp( jw) is compared against the final masking threshold T. The
components that fall
below it are removed in Sw( jw) . The new frame with only the components
above the threshold
is called Swnew( jw) . Eq. ( 14 ) is used:
i
i
Sp j T z
Sw j Sp j T z
Swnew j
i
i i
i
z, according to component
1,2 number of components
0 ( ) ( )
( ) ( ) ( )
( )
w
w
w w
w
= K
î í ì
<
³
=
Then the unneeded components of the watermark signal Xw( jw) are removed.
These
components correspond to the non-removed components in Sw( jw) . Eq. ( 15 )
is used:
i
i
Xw j Sp j T z
Sp j T z
Xwnew j
i i
i
i
z, according to component
1,2 number of components
( ) ( ) ( )
0 ( ) ( )
( )
w
w w
w
w
= K
î í ì
<
³
=
22
The factors that will shape the new watermark Xwnew( jw) are found using Eq.
( 18 ):
( )
LBZ HBZ z
z Zt
Xwnew j
T j
F A z
to for each band
1,2
max ( )
( )
=
=
=
w
w
w
K
The square root of the final threshold is divided by the maximum magnitude
component
found in the energy of the new watermark in each critical band. Each one of
these factors is
scaled using the gain A, that varies from 0 to 1, and controls the overall
magnitude of the
watermark signal in relation with the audio signal.
Each one of the components in each critical band k is scaled by the
corresponding factor
using Eq. ( 19 ):
LBZ HBZ z
z Zt
Xfinal j Xwnew j Fz
to for each band
1,2
( ) ( )
=
=
=
w
w w
K
Figure 28 shows the final masking threshold and the watermark signal before
shaping (a) and
after shaping (b). Note that the watermark falls below the threshold of
masking. The factor A
gives control of "how much gain" will have the watermark related with the
masking threshold (A
is a value from 0 to 1).
3.1.7 AUDIO AND WATERMARK SIGNAL COMBINATION
The final output OUT( jw) is the sum of the new audio, Swnew( jw) , and the
final
watermark Xfinal( jw) . This is given by the Eq. ( 20 ):
OUT( jw) = Swnew( jw) + Xfinal( jw)
Figure 29 shows the final masking threshold Tfinal(z), and the power
spectrum of (a) Swnew(jw),
(b) Xfinal(jw), and (c) OUT(jw).
3.1.8 TRANSFORMATION TO THE TIME DOMAIN
The Inverse Fourier Transform is used to convert the frequency domain
information back to
the time domain.
out(t) = IFT{OUT( jw)}
This output frame out(t) is added to the correspondent point at the total
time domain output
output(t). The next frames of audio and watermark signals are taken, and the
process is repeated.
3.2 DATA RECOVERY
The watermarked audio signal is intended to be transmitted through a diverse
number of
channels. In some cases, the channel will introduce noise, convert several
times from digital to
analog and analog to digital, or even use a psychoacoustic auditory model to
process the audio
signal. The watermark bit stream should survive the transmission and be
recoverable.
A very important characteristic is that the developed system does not
require access to the
original audio signal (before watermark) to extract the watermark at the
receiving. The process
of recovery uses the psychoacoustic auditory model, but in this case the
goal is to remove all the
audio components that have less probability of belonging to the watermark
signal. This means
23
that the masking threshold is calculated and the components above it are
removed. The final
signal is the "residual." This residual is then analyzed to find the
possible points where the
watermark is present. If some criterion is applied, the majority of the
false points detected can be
eliminated (i.e. rejecting points too close to fit a watermark).
Synchronization and recovery of
the watermark bit stream are then performed.
3.2.1 MASKING THRESHOLD AND RESIDUAL SIGNAL
The watermarked audio signal after the transmission is symbolized as s2(t).
The process
described in sections 3.1.2 to 3.1.5 is used to calculate the frames sw2(t),
frequency
representation ( ) 2 Sw jw ,and masking threshold T2, respectively. The
residual signal R( jw) is
defined as the signal composed of the components below the masking
threshold. Eq. ( 14 ) can
be changed to:
i
i
Sp j T z
Sw j Sp j T z
R j
i
i i
i
z, according to component
1,2 number of components
0 ( ) ( )
( ) ( ) ( )
( )
2 2
2 2 2
w
w
w w
w
= K
î í ì
>
£
=
( 68 )
3.2.2 RESIDUAL EQUALIZATION
The spectrum of the residual R( jw) is then shaped to be flat. Eq. ( 18 )
can be modified to
shape all the maximum components of each band to be at equal levels. The
factors are found
using:
( )
LBZ HBZ z
z Zt
R j
Fz
to for each band
1,2
max ( )
1
=
=
=
w
w
K
( 69 )
Each one of the components in each critical band z is scaled by the
corresponding factor Fz using
Eq. ( 19 ):
LBZ HBZ z
z Zt
Rfinal j R j Fz
to for each band
1,2
( ) ( )
=
=
=
w
w w
K
3.2.3 TIME DOMAIN RESIDUAL
The residual is taken back to the time domain using the Inverse Fourier
Transform IFT.
r(t) = IFT{Rfinal( jw)}
The time domain r(t) frame is added to the total time domain residual signal
residual(t) at the
point specified by the frame segmentation step. The next frame is then
processed.
3.2.4 SYNCHRONIZATION WITH WATERMARK HEADER
To be able to synchronize and to have a good de-spreading of the watermark
signal, it is
necessary to have knowledge of the parameters used at the generation of the
watermark signal,
such as f0, Tb, m, H, I, N, {header},{c}, etc.
24
3.2.4.1 header(t) Signal Generat ion
The first step is to generate a header(t) waveform signal using the process
of section 3.1.1,
except that only the {header} sequence is used as the input sequence. This
audio signal will be
used to locate the exact positions of the watermark signals in the
residual(t) signal. Frame
segmentation as explained in section 3.1.2 is also required in order to
analyze the whole
residual(t) signal. The parameters for the frame segmentation are chosen to
have up to two
header(t) signals in each frame. Therefore, BLOCK is equal to twice the
number of samples in
header(t), and OVERLAP is equal to one half the number of samples in
header(t). The resulting
frame taken from residual(t) with BLOCK length is called r(t).
3.2.4.2 header(t) Position Detect ion
Eq. ( 56 ) describes an adaptive high-resolution filter that can be used to
detect the presence
of header(t) in the r(t) frame and therefore, all the occurrences of
header(t) in the residual(t)
audio signal.
2 ( )
* ( )
( )
w
w
w
R j
HEADER j
H j =
Where: ( )
( ) FFT( ( ))
( ) FFT ( )
HEADER j header t
R j r t
=
=
w
w
The denominator of the filter is the smoothed version of 2 R( jw) .
Smoothing is done using Eq.
( 57 ), where w(t) is a Hanning window of width 10%. The output of the
filter applied to
R( jw) is:
2
*
( )
( )
( ) ( )
w
w
w w
R j
HEADER j
DET j = R j
This result is transformed to the time domain to be analyzed.
det(t) = real(IFFT(DET( jw)))
A typical output of the filter, det(t), is shown in Figure 30. The peak
shows the position in
samples where the header(t) signal starts in the frame r(t). This detection
is done for all the
frames in the residual(t) signal, and all the positions of the peaks are
stored for further analysis.
A proposed criterion of analysis is to determine the minimum distance
between peaks to decide
which ones have more probability to represent the start of a watermark
signal.
3.2.5 WATERMARK DE-SPREADING
For each peak position found in the residual(t), a selected frame y(t) with
the same length as
the watermark signal is processed. This process is shown in Figure 31. Using
Eq. ( 35 ):
r(t) = c(t) y(t)
Demodulation is performed using Eq. ( 31 ):
cos(2 )
2
( ) ( ) 0f t
T
g t r t
b
= p
To estimate the bit stream:
25
1,2 total bits in bit stream
( )
( 1)
= K
= ò -
i
r g t dt s
s
iT
i i T ( 70 )
The decision rule, to form a recovered bit stream {d^}, is given by Eq. (
38 ),
1,2 total bits in bit stream
1, if 0
^ 1, if 0
= K
î í ì
- £
>
=
i
r
r
d
i
i
i
After this decision, the {header} sequence is discarded from the {d^}bit
stream. This produces the
bit stream, { } I w^ .
3.2.6 WATERMARK DE-INTERLEAVING AND DECODING
The de-interleaving process is done using the same matrix used in the
watermark generation
in section 3.1.1 and shown in Figure 14. The bits are written into rows and
read by columns to
accomplish the de-interleaving process. The de-interleaved sequence is
called { } R w^ . The
decoding of the repeat code of value m is done using Eq. ( 53 ):
1,2 total bits in data sequence
1 ^ 0
1 ^ 0
^
1
1
= K
ï ïî
ï ïí
ì
- £
>
=
å
å
=
=
k
w
w
w m
i
Ri
m
i
Ri
k
The final recovered sequence {w^} is the recovered watermark.
4 SYSTEM PERFORMANCE
4.1 SURVIVAL OVER DIFFERENT CHANNELS
A watermarking system was implemented using a well known mathematical
software
package. The system was composed of two modules: watermark generation and
embedding, and
watermark recovery. The watermark was first generated and embedded in an
audio signal. The
watermarked signal was then tested for recovery of the watermark after
transmission by different
channels, such as sub-band encoding, digital to analog - analog to digital
conversions and radio
transmission.
The music used was a 26 second excerpt of the song "In the Midnight Hour"
(W. Pickett &
S. Cropper) performed by The Commitments. A sampling frequency of 44.1 KHz
was used. Each
of the watermarked audio signals was labeled to reflect the level of the
watermark below the
masking threshold (the A value), i.e. W2, W4, W6 and W8. With these
parameters, a total of 35
watermarks were embedded during the duration of each signal. The four
watermarked music
signals and the original signal were recorded digitally on a compact disc.
The computer was also
equipped with a full duplex sound card with D/A A/D converters. All the
radio systems were
simulated using a multiplex stereo modulator, FM/AM signal generator, and
ordinary consumer
CD player and FM/AM radio receiver. The percentage of correct bits recovered
per watermark
was measured before and after transmission. Two examples are shown in Figure
32 and Figure
33. The percentage of correct bits before transmission is the continuous
line, and the percentage
26
of correct bit after transmission is the dotted line. Also, the offset from
the expected starting
point of each watermark after transmission is measured (in samples), as well
as the total of
watermarks recovered and the average recovery percentage.
4.2 LISTENING TEST
One of the requirements of the watermarking system is to retain the
perceptual quality of the
signal. This is often referred to as "transparency." The transparency of the
watermarking
algorithm was tested using three of the four watermarked audio signals (W2,
W4 and W6) used
in section 4.1. An ABX listening test was used as the testing mechanism. In
an ABX test the
listener can hear selection A (in this case the non-watermarked audio),
selection B (the
watermarked audio) and X (either the watermarked or non-watermarked audio).
The listener is
then asked to decide if selection X is equal to A or B. The number of
correct answers is the basis
to decide if the watermarked audio is perceptually different than the
original audio and would,
therefore, declare the watermarking algorithm as "non-transparent." In the
other case, if the
watermarked audio is perceptually equal to the original audio, the
watermarking algorithm will
be declared as "transparent."
Using the theory explained in Burstein [ 19 ], [ 20 ], different parameters
were selected to
find an appropriate sample size. A criterion of significance a'=0.1 is
selected (also known as
Error Type 1). The Type 2 error risk is assumed b'=0.1. The probability p1
that a listener finds
the right answer by chance is 0.5 in an ABX system. The effect size is
selected as p2=0.7. With
these parameters, the approximated required sample size that meets the
specifications is 37.61
samples. The sample size is selected as n=40. (40 listeners per ABX set).
The critical c (c') is the
minimum number of correct samples which, together with n and p1, can produce
a significance
level a equal to or less than the specified criterion of significance a'.
The calculated c' is 24.55
and can be rounded off to 25. This is the minimum number of correct answers
to accept the
hypothesis that the listener perceives differences between audio A and B.
With c'=25, the
criterion of significance becomes a'=0.78, which is below the required
level. The type 2 error
risk b'=1.11 and does not exceed desired level. The results and their
approximate significance
level are shown in Table 1.
Sample Size Correct Identifications a
W2 40 24 0.14
W4 40 19 0.50
W6 40 19 0.50
Table 1. Listening test results
4.3 DISCUSSION
The survival over different channels showed that after encoding, not all the
watermarks could
be recovered with 100% accuracy. This occurs because of the multiple factors
that affect the
quality of the embedded watermark, such as: the number of audio components
replaced, the gain
of the watermark, and the masking threshold. It is important to note that in
some frames the
watermark information can be very weak, even null. The spread spectrum
technique employed
can partially solve these problems, but if many consecutive frames have no
watermark
information, that specific watermark can not be recovered.
The theoretical position of the watermark and the offset of the actual
watermark represent the
starting position of the {header} of each watermark. This position will not
affect the recovery of
27
the watermark because each watermark is embedded independently of the
others. In the actual
tests three different cases are seen: almost no offset, linearly increasing
offset and varying offset.
When no offset is seen, the original signal and the recorded signal after
transmission where
played at the same speed. In the cases where the offset is linearly, it is
assumed that the speed of
the playback device (in this case an ordinary consumer CD player was
different (slightly slower)
than the recording device. The last case shows the unstable speed variations
of the tape device. If
the speed of the playback device is close enough to the original speed, the
de-spreading can be
successful because the difference in alignment between the watermarked audio
and the despreading
signals (PN sequence, demodulator and {header}) will not greatly affect the
final
result.
Finally, the percentage of correct bits recovered measures quality of the
recovery for each
watermark. Notice that not all the watermarks are recovered (%bits = 0.0),
and not all the
watermarks are recovered in their totality but many of them were recovered
with more than 80%
of the bits. A good bit error detection/correction algorithm or averaging
technique could
substantially improve the recovery of the watermark. A very strong point in
the watermarking
system is the redundancy of watermarks embedded into the audio stream. In
this case, each
watermark lasts approximately 600 ms. Even if just a few watermarks are
recovered, the goal of
transmitting the watermark information within the audio signal and
recovering it afterwards is
accomplished.
The listening test showed that the watermark at -2dB below the masking
threshold (W2) is
the most likely to be heard, but it can not be ensured that people actually
noticed the difference.
For all the other watermarked signals, the results show that the process is
"transparent."
5 CONCLUSIONS
The proposed digital watermarking method for audio signals is based on a
psychoacoustic
auditory model to shape an audio watermark signal that is generated using
spread spectrum
techniques. The method retains the perceptual quality of the audio signal,
while being resistant to
diverse removal attacks, either intentional or unintentional. The recovery
of the watermark is
accomplished without knowledge of the original audio signal. The only
information used
includes the watermarked audio signal, and the parameters used for the
watermark generation.
The psychoacoustic auditory model retrieves the necessary information about
the masking
threshold of the input audio signal. This model is a good approach that can
be used for several
applications such: perceptual coding, masking analysis, or watermark
embedding. The spread
spectrum theory describes two important Direct Sequence techniques, but the
employed
technique is Coded Direct-Sequence Spread Binary Phase-Shift-Keying (coded
DS/BPSK).
Because the normal literature about this topic is reserved for communication
theory, some
assumptions were made to use the theory in an audio bandwidth environment.
Specifically in this
case, the audio information was considered the "noise" or "jammer" signal
that interferes with
the watermark.
Future research could be performed in different aspects of this proposed
algorithm such as:
- System performance with different types of music.
- Experimenting with different spread spectrum encoding parameters.
- Changes in the playback speed of the signal.
- Crosstalk interference.
- Multiple watermark embedding.
28
- Use of techniques to enhance recovery of the watermark (i.e., bit error
detection/ correction,
averaging, etc).
- Real - time implementation.
- Investigate different signal schemes for the generation of the PN
sequence.
6 ACKNOWLEDGMENT
The author wishes to thank Professors Ken Pohlmann and Will Pirkle from the
Music
Engineering program at University of Miami for their valuable advises and
feedback. Also to the
Music Engineer Alex Souppa for his help as technical editor and english
corrector of the author's
master thesis.
7 REFERENCES
[ 1 ] E. Zwicker and U. T. Zwicker, "Audio Engineering and Psychoacoustics:
Matching Signals
to the Final Receiver, the Human Auditory System," J. Audio Eng. Soc., vol.
39, pp. 115 -126
(1991 March)
[ 2 ] T. Sporer and K. Brandenburg, "Constraints of Filter Banks Used for
Perceptual
Measurement," J. Audio Eng. Soc., vol. 43, pp. 107 - 115 (1995 March)
[ 3 ] J. Mourjopoulos and D. Tsoukalas, "Neural Network Mapping to
Subjective Spectra of
Music Sounds," J. Audio Eng. Soc., vol. 40, pp. 253 - 259 (1992 April)
[ 4 ] J. D. Johnston, "Transform Coding of Audio Signals Using Perceptual
Noise Criteria,"
IEEE Journal on Selected Areas in Communications, vol. 6, pp. 314 - 323
(1988 Feb.)
[ 5 ] M. K. Simon, J. K Omura, R A. Scholtz and B. K. Levitt, Spread
Spectrum
Communications Handbook (McGraw-Hill, New York, 1994)
[ 6 ] R. L. Pickholtz, D. L. Schilling, and L. B. Milstein, "Theory of
Spread-Spectrum
Communications - A Tutorial," IEEE Transactions on Communications, vol.
COM-30, pp. 855
- 884 (1982 May)
[ 7 ] C. S. Lindquist, Adaptive & Digital Signal Processing with Digital
Filtering Applications
(Steward & Sons, Miami, 1989)
[ 8 ] L. R. Rabiner, and R. W. Schafer, Digital Processing of Speech Signals
(Prentice Hall, New
Jersey, 1978)
[ 9 ] E. Zwicker, and h. Fastl, Psychoacoustics Facts and Models
(Springer-Verlag, Berlin, 1990)
[ 10 ] D. L. Nicholson, Spread Spectrum Signal Design. LPE & AJ Systems
(Computer Science
Press, Rockville, Maryland, 1988)
[ 11 ]C. Neubauer and J. Herre, "Digital Watermarking and Its Influence on
Audio Quality,"
presented at the 105th Convention of the Audio Engineering Society, J. Audio
Eng. Soc.
(Abstracts), vol. 46, pp. 1041 (1998 November), preprint 4823.
[ 12 ] J. G. Roederer, The Physics and Psychophysics of Music
(Springer-Verlag, New York,
1995)
[ 13 ] J. G. Beerends and J. A. Stemerdink, "A Perceptual Speech-Quality
Measure Based on a
Psychoacoustic Sound Representation," J. Audio Eng. Soc., vol. 42, pp. 115 -
123 (1994 March)
[ 14 ] J. G. Beerends and J. A. Stemerdink, "A Perceptual Audio Quality
Measure Based on a
Psychoacoustic Sound Representation," J. Audio Eng. Soc., vol. 40, pp. 963 -
978 (1992
December)
[ 15 ] C. Colomes, M. Lever, J. B. Rault, Y. F. Dehery and G. Faucon, "A
Perceptual Model
Applied to Audio Bit-Rate Reduction," J. Audio Eng. Soc., vol. 43, pp. 233 -
239 (1995 April)
29
[ 16 ] T. Sporer, G. Gbur, J. Herre and R. Kapust, "Evaluating a Measurement
System," J. Audio
Eng. Soc., vol. 43, pp. 353 - 362 (1995 May)
[ 17 ] M. R. Schroeder, B. S. Atal and J. L. Hall, "Optimizing Digital
Speech Coders by
Exploiting Masking Properties of the Human Ear," J. Acoust. Soc. Am., vol.
66, pp. 1647 - 1652
(1979 Dec.)
[ 18 ] B. Paillard, P. Mabilleau, S. Morissette and J. Soumagne, "PERCEVAL:
Perceptual
Evaluation of the Quality of Audio Signals," J. Audio Eng. Soc., vol. 40,
pp. 21 - 31 (1992
Jan./Feb.)
[ 19 ] H. Burstein, "By the Numbers," Audio, vol. 74, pp. 43 - 48 (1990
Feb.)
[ 20 ] H. Burstein, "Approximation Formulas for Error Risk and Sample Size
in ABX Testing,"
J. Audio Eng. Soc., vol. 36, pp. 879 - 883 (1988 Nov.)
[ 21 ] S. Haykin, Communication Systems 3rd ed. (Wiley, New York, 1994)
[ 22 ] R. L. Shrader, Electronic Communication 5th ed. (McGraw Hill, New
York, 1985)
[ 23 ] L. Boney, A. H. Tewfik and K. N. Hamdy, "Digital Watermarks for Audio
Signals," IEEE
Int.Conf. on Multimedia Computing and Systems, Hiroshima, Japan (June 1996)
[ 24 ] I. J. Cox, "Spread Spectrum Watermark for Embedded Signalling",
United States Patent
5,848,155 (1998 Dec)
30
FFT Power Spectrum Energy per
Critical Band
Spread Masking
Across
Critical Bands
Masking
Threshold
Estimate
IFFT Noise Shaping
s(t) S(jw) Sp(jw)
Spz(z)
B(z)
Sm(z)
T(z)
N(jw)
out(t) OUT(jw)
Figure 1. Psychoacoustic auditory model
0 1 2 4 6 8 10 12 14 16
Frequency [kHz] Frequency [kHz]
80
60
40
20
0
80
60
40
20
0
Excitation level [dB]
Excitation level [dB]
f =8kHz c 1 4
0.02 0.05 0.1 0.2 0.5 1 2 5 10 20
f =0.07 c 0.25 1 4kHz
(a) (b)
Figure 2. Masking curves in (a) linear and (b) logarithmic frequency scale
[ 1 ]
0 2 4 6 8 10 12 14 16 18 20 22 24
critical band rate z [Barks]
60
40
20
0
Excitation level [dB]
Figure 3. Excitation level versus critical band rate for narrow band noises
with various center frequencies [ 1 ]
31
Figure 4. Model of the spreading function, B(z), using Eq. ( 5 )
Transmitter Receiver
Jammer
PG=
Wss
Rb
J
S
=Jammer to Signal power ratio
E
N
b
j
PG
J/S
=
J=Jammer Power
Wss = Bandwidth
Rb = Bit Rate (bits/sec)
S = Power
Figure 5. Basic spread spectrum communications system
2
Tb
Frequency [Hz] Magnitude 2
Tb
2 = N*
Tc
Frequency [Hz]
Magnitude
Figure 6. Spectrum of signal BPSK Figure 7. Spectrum of signal BPSK after
spreading
32
PN
Generator
BPSK
Modulator
d(t) x(t)
c(t)
Figure 8. DS/BPSK modulation
PN
Generator
BPSK
Modulator
d(t) x(t)
c(t)
s(t)
Figure 9. DS/BPSK modified
PN
Generator
PN
Generator
BPSK
Modulator
d(t) x(t)
c(t)
c(t)
s(t)
J(t)
y(t)
r r(t)
cos( )
2
0t
Tb
w
( )dt Tb
0 ò ·
Transmitter
Receiver
Transmission
Channel
Figure 10. Uncoded DS/BPSK
33
Tb
Tc
Ts
d(t)
c(t)
a(t)
c(t)d(t)
c(t)a(t)
Figure 11. Coded and uncoded signals before
spreading
Figure 12. Coded and uncoded signals after
spreading
34
PN
Generator
PN
Generator
c(t)
c(t)
BPSK
MODULATOR
INTERLEAVER
REPEAT
CODER
DEDECODER
INTERLEAVER
d
Tb
Tb
m
m
T = s T
m
b
di
I H
I H f0
s(t)
J(t)
ri
Ts
d^
Transmission
x(t) Channel
y(t)
Figure 13. Repeat code DS/BPSK system
X1 X16 X31 X46 X61
X2 X17 X32 X47 X62
X3 X18 X33 X48 X63
X4 X19 X34 X49 X64
X5 X20 X35 X50 X65
X6 X21 X36 X51 X66
X7 X22 X37 X52 X67
X8 X23 X38 X53 X68
X9 X24 X39 X54 X69
X10 X25 X40 X55 X70
X11 X26 X41 X56 X71
X12 X27 X42 X57 X72
X13 X28 X43 X58 X73
X14 X29 X44 X59 X74
X15 X30 X45 X60 X75
Figure 14. Interleaver matrix with I=5 and H=15
35
s(n) r(n) R(m) C(m) P(m) p(n)
n(n)
+
+
FFT H(m) G(m) IFFT
Ideal
Filter
Post
Processing
Filter
Figure 15. FFT filter assuming additive signal and noise
WATERMARK
GENERATION
Coded DS/BPSK
PSYCHOACOUSTIC
AUDITORY
MODEL
WATERMARK
SHAPING
AND
EMBEDDING
10110...1001
WATERMARK
(bit stream)
AUDIO
WATERMARKED
AUDIO
T(z) TRANSMISSION
CHANNEL
PARAMETERS
Figure 16 Proposed system (watermark generation and embedding)
T(z) r(t)
AUDITORY
MODEL
DE-SPREADING
AND
RECOVERY
ADAPTIVE
HIGH
RESOLUTION
DETECTION
RESIDUAL
GENERATION
HEADER
GENERATION
10110...1001
TRANSMISSION
CHANNEL
RECOVERED
WATERMARK
PARAMETERS
Figure 17 Proposed System (Data recovery)
36
PN
Generator
c(t)
BPSK
MODULATOR
REPEAT INTERLEAVER
CODER
{w}
Tw
m
Tb=
T
m
w
{wR}
I H f0
HEADER s(t)
INJECTION
{wI}
Tb
{header}
{d}={header}+{w}I
x(t)
Figure 18. Watermark generation system
1 1 1 -1 1
1 1 -1 -1 1
1 -1 -1 -1 -1
1 -1 -1 1 -1
1 -1 1 1 -1
1 -1 1 1 -1
-1 -1 1 1 -1
-1 -1 1 1 -1
-1 1 1 1 1
1 1 1 1 1
Figure 19. Interleaver matrix
FS
2
FS
2
A
Rc
2*Rc
Figure 20. Baseband System Parameters
37
FS
2
FS
2
A
2*Rc
Rc Rc
A
2
-f0 f0
Figure 21. Passband System Parameters (anti-aliasing)
FS
2
FS
2
A
LF+Rc HF+Rc
A
2
-f0 f0
Figure 22. Passband System with Frequency Limits LF and HF
Figure 23 Time domain signals: data bit stream, d(t); PN sequence, c(t);
BPSK modulator, sin(t); and
watermark audio signal, x(t)
38
Audio Signal
Watermark signal
LENGTH
FRAME
Figure 24. Frame segmentation and watermark redundancy
Figure 25. (a) Audio signal s(t) and window signal w(t), (b) windowed signal
sw(t), (c) magnitude of
frequency representation Sw(jw), and (d) power spectrum Sp(jw) and energy
per critical band Spz(z)
39
Figure 26. (a) Energy per critical band Spz(z), (b) spreading function B(z),
and (c) Spread energy per
critical band Sm(z)
Figure 27. (a) raw masking threshold Traw(z), and (b) normalized masking
threshold Tnorm(z)
40
Figure 28. (a) Xwnew(z) before shaping, (b) after shaping with A = 0.4
Figure 29. Final masking threshold Tfinal(z), and the power spectrum of (a)
Swnew(jw), (b) Xfinal(jw),
and (c) OUT(jw).
41
Figure 30. Detection peak in det(t)
PN
Generator
c(t)
{w}
Tw
m
{wR}
I H
r(t) HEADER
REMOVAL
{wI}
Td
{header}
y(t) DEINTERLEAVER
DECODER
g(t) {d^} ^ ^
Figure 31. Watermark recovery system
Figure 32 MPEG layer 3 system performance
42
Figure 33 FM stereo (left channel) system performance