Donate. I desperately need donations to survive due to my health

Get paid by answering surveys Click here

Click here to donate

Remote/Work from Home jobs

General framing method in the field of speech recognition based on neural network?

I have some confusions about the way to preprocess voice recordings as inputs to a neural net in the field of speech recognition.

I'm more confused the way a voice sample is fed to the network. Typically, if there's a voice clip containing only one word, feeding its spectrogram to the network and the most possible word would be possibly picked by the neural net.

If there's a voice clip containing a sentence, is the same idea applied? Which is framing the sample into chunks containing single words then feed them into neural net??

But to this point more problems come to my mind:1. A sliding window of fixed length could probably split a complete word into two parts which will probably decrease the accuracy of the prediction. 2. The NN's input length couldn't be infinite, even if we could pad a rather short voice clip to fit a NN(possibly RNN) of long input length,how could the NN be trained to tackle a super long voice recording?

Extendedly, in a live speech recognition or translation system, the voice is streamed to the compute server following a time order endlessly. How's the input like in this circumstance? Caching a voice clip of a certain length then feeding them to the neural net? Under this environment, how's the NN structure like?Is there a general word detection algorithm applied to mark a start of a man's speech then frame them, instead of a window size fixed framing algorithm?

Correct me if there's some wrong understandings as a layman new to DL. It's appreciated if anyone could leave some links to papers or technical reports. Thanks.

Comments