Donate. I desperately need donations to survive due to my health

Get paid by answering surveys Click here

Click here to donate

Remote/Work from Home jobs

How can I build a “Not A Word” detector for speech recordings in English if I don't have training data?

I want the detector to read an audio recording and to output time intervals when there was neither speech nor silence in the recording. I looked for existing solutions on the web but did not find anything practical that can be used in production. I wonder if I can build it using an existing, trained model. Such that I could tune it without training data. I can produce some amount of test data, but I don't have a large budget to spend on data labeling.

"Not A Word" is still a human vocalization, e.g. babbling, um and other filler sounds. The recordings contain a single speaker. Low background sounds are possible. The recordings are audio files 30 seconds or longer. Max time is to 2 minutes 30 seconds.

Please let me know if I can provide any other details that might be helpful.

Your ideas and advice are much appreciated.

Thanks!

Tigran

P.S. One possible is below. It will likely have low accuracy rate and poor performance. Therefore, this solution is not suitable for production solution. Still, here is how it works. It splits the recordings into 1-second slices and uploads those that are not silence to Google Speech-To-Text service. If the result is empty, then that slices contain some sound, which is neither a word or silence.

function getTimeValuesWhenNotWordsOrSilence(recording) {
  let notWordsOrSilence = [];
  const slices = splitIntoOneSecondSlices(recording);
  let count = 0;
  for (slice in slices) {
    count += 1;
    if (isNotSilence(slice)) {
        const result = sendToGCPSpeechToText(slice);
        if (result === '') {
            notWordsOrSilence.push({ time: count })
        }
  }
  return notWordsOrSilence;
}

Comments