De-reverberation evaluation of VocalFusion Stereo Dev Kit (XK-VF3500-L33)

oplatek · Post by **oplatek** » Wed Oct 03, 2018 7:19 pm

Hi all!

I am in the process of evaluating the VocalFusion microphone array XK-VF3500-L33.
My ultimate goal is to improve Automatic Speech Recognition (ASR) in terms of Word Error Rate (WER) for far-field speech.

I recorded a small dataset and played it back from a studio monitor (Yamaha HS7).
For recordings taken from three meters, I expected a slightly better performance.
The Xmos ASR-tuned second channel got me only 5% absolute improvement over the raw signal from a single microphone (third channel) .

So I started investigating the de-reverberation capabilities by RT60 evaluation with pink noise and clapping.
Surprisingly, the single raw microphone obtained lower RT60 duration:
For Raw input (3rd channel): It took approx. 220 ms for the signal to drop by 30dB. The measured RT60 of the room is around 440 ms.
For Processed input (2nd channel): It took 330 ms to drop by 30dB => RT60 is 660 ms.

I have the following questions:
1. Is it expected behavior?
2. Will you recommend me other testing scenarios ideally with known benchmark values?
- I am looking for benchmarks which do not depend on ASR but correlate well with ASR WER

Thank you for your help!

Ondra