I am in the process of evaluating the VocalFusion microphone array XK-VF3500-L33.
My ultimate goal is to improve Automatic Speech Recognition (ASR) in terms of Word Error Rate (WER) for far-field speech.
I recorded a small dataset and played it back from a studio monitor (Yamaha HS7).
For recordings taken from three meters, I expected a slightly better performance.
The Xmos ASR-tuned second channel got me only 5% absolute improvement over the raw signal from a single microphone (third channel) .
So I started investigating the de-reverberation capabilities by RT60 evaluation with pink noise and clapping.
Surprisingly, the single raw microphone obtained lower RT60 duration:
For Raw input (3rd channel): It took approx. 220 ms for the signal to drop by 30dB. The measured RT60 of the room is around 440 ms.
For Processed input (2nd channel): It took 330 ms to drop by 30dB => RT60 is 660 ms.
I have the following questions:
1. Is it expected behavior?
2. Will you recommend me other testing scenarios ideally with known benchmark values?
- I am looking for benchmarks which do not depend on ASR but correlate well with ASR WER
Thank you for your help!
Voice related projects and technical discussions
1 post • Page 1 of 1
Who is online
Users browsing this forum: No registered users and 1 guest