AVOZES

The Audio-Video Australian English Speech Data Corpus


Recording Equipment

A clip-on microphone was attached to the speaker's clothes on the chest about 20cm below the mouth. The microphone was an omni-directional Sennheiser MKE 10-3 microphone with a frequency response of 50Hz-20kHz. The microphone system was directly connected to the DV recorder, where the microphone's output was recorded as mono sound on DV tape with a 48kHz sampling frequency. In the 48kHz sampling mode, two channels are recorded for stereo audio but in case of mono audio input, both channels contain the same signal. The DV recorder was a JVC HR-DVS1U miniDV/S-VHS video recorder, which also featured an IEEE-1394 (FireWire, iLink) DV in/out connector.

The two video cameras are standard, colour analogue NTSC cameras mounted side by side on a rig. The cameras were placed on the rig with a slight vergence of approximately 5 degrees towards the centre. The output of the stereo cameras was multiplexed into one video signal using field multiplexing, then sent to a Hitachi IP5005 video card. In this technique, a device containing a video switching integrated circuit selects the signal from one video stream as the odd field of the video output, while the signal from the other video stream becomes the even field. This requires to first de-interleave the odd-even fields of the video frames from each camera. Multiplexing video signals in the analogue phase has the advantage that it can be applied to virtually any video hardware system. Images from two cameras can be stored in a single video frame. Stereo image processing can be performed within the computer's memory using only one image processing board. Single video stream processing is thus transformed into stereo vision processing.

A weakness of the field multiplexing technique is that only half the vertical resolution of the original video frame from each camera is available, as two video streams are compressed into a single frame. One other weakness is the delay of 16.6ms between the images from the two video streams, which is inherent in the NTSC standard, or any other interlaced video/TV standard. That is, first all the lines of one field, let's say the odd lines, are processed, then all the lines of the other field. The field frequency is 60Hz in the NTSC standard, or 30Hz frame frequency, and hence there is a 16.6ms delay between fields.

Stereo video in a single frame
In the Hitachi IP5005 video card, the video signal was unscrambled, so that the video sequences on tape show the output from the left camera in the top half and the output of the right camera in the bottom half of each video frame (see example on the right). The video signal was then sent from the video card to the DV recorder, where it was recorded as an NTSC YUV 4:1:1 signal at 29.97Hz frame rate.

Synchronisation between the audio and video signals is implicit in the locked mode of the DV standard, which was used for the AVOZES recordings. Due to the audio signal being transmitted directly to the DV recorded, while the video signal undergoes the above mentioned processing, the video stream lags the audio stream by one video frame. This delay has deliberately not been corrected during editing of the AVOZES sequences, so any user of AVOZES must be aware of it and take it into account in their analyses.

[Back to Homepage] [AVOZES Homepage] [Back to Research]


© Roland Göcke
Last modified: Tue Mar 22 15:59:03 AUS Eastern Daylight Time 2005