How to get the text from an audio file with Python

I get some examples from here https://github.com/1learnfromdata/keyword_extraction_from_audio

First install librosa

pip install librosa

Then install torch

Torch is quite big

and finally transformers

Then the processing will be like:

You wait for a minute… (it’s 5:52 at the moment I’m writing this). I suggest you now to use sublime repl, but the command line to start the script.

Then at 5:53 this goes on

The percentage goes really slow, so I think it will take some minutes (5:54)

Consider that the audio file is long 2 minutes

Another minute (5:55) and we are at 26%

5:57 35%

6:09 58% So, we will need another 15 minutes at least… so half an hour for 2 minutes of audio?

6:20 70%

At the end (6:38 or less) I found this

Some weights of Wav2Vec2ForCTC were not initialized from the model checkpoint at facebook/wav2vec2-base-960h and are newly initialized: ['wav2vec2.masked_spec_embed']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
C:\Users\giova\AppData\Local\Programs\Python\Python39\lib\site-packages\librosa\core\audio.py:165: UserWarning: PySoundFile failed. Trying audioread instead.
  warnings.warn("PySoundFile failed. Trying audioread instead.")
Traceback (most recent call last):
  File "C:\Users\giova\AppData\Local\Programs\Python\Python39\lib\site-packages\librosa\core\audio.py", line 149, in load
    with sf.SoundFile(path) as sf_desc:
  File "C:\Users\giova\AppData\Local\Programs\Python\Python39\lib\site-packages\soundfile.py", line 629, in __init__
    self._file = self._open(file, mode_int, closefd)
  File "C:\Users\giova\AppData\Local\Programs\Python\Python39\lib\site-packages\soundfile.py", line 1183, in _open
    _error_check(_snd.sf_error(file_ptr),
  File "C:\Users\giova\AppData\Local\Programs\Python\Python39\lib\site-packages\soundfile.py", line 1357, in _error_check
    raise RuntimeError(prefix + _ffi.string(err_str).decode('utf-8', 'replace'))
RuntimeError: Error opening 'audio_files/1_audi_file.wav': System error.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "G:\speech recognition\keyword_extraction_from_audio\speech_recognition.py", line 21, in <module>
    speech, rate = librosa.load(f"audio_files/{i+1}_audi_file.wav", sr=16000)
  File "C:\Users\giova\AppData\Local\Programs\Python\Python39\lib\site-packages\librosa\core\audio.py", line 166, in load
    y, sr_native = __audioread_load(path, offset, duration, dtype)
  File "C:\Users\giova\AppData\Local\Programs\Python\Python39\lib\site-packages\librosa\core\audio.py", line 190, in __audioread_load
    with audioread.audio_open(path) as input_file:
  File "C:\Users\giova\AppData\Local\Programs\Python\Python39\lib\site-packages\audioread\__init__.py", line 111, in audio_open
    return BackendClass(path)
  File "C:\Users\giova\AppData\Local\Programs\Python\Python39\lib\site-packages\audioread\rawread.py", line 62, in __init__
    self._fh = open(filename, 'rb')
FileNotFoundError: [Errno 2] No such file or directory: 'audio_files/1_audi_file.wav'
>>>

So, I decided to use a small wav file of a couple of seconds

It happens that I had to install another module

After this I launched again the script and here’s what I found after a couple of seconds

The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'Wav2Vec2CTCTokenizer'. 
The class this function is called from is 'Wav2Vec2Tokenizer'.
C:\Users\giova\AppData\Local\Programs\Python\Python39\lib\site-packages\transformers\models\wav2vec2\tokenization_wav2vec2.py:417: FutureWarning: The class `Wav2Vec2Tokenizer` is deprecated and will be removed in version 5 of Transformers. Please use `Wav2Vec2Processor` or `Wav2Vec2CTCTokenizer` instead.
  warnings.warn(
Some weights of Wav2Vec2ForCTC were not initialized from the model checkpoint at facebook/wav2vec2-base-960h and are newly initialized: ['wav2vec2.masked_spec_embed']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
HI EVERYBODY COME VISIT MY BLOGHI EVERYBODY COME VISIT MY BLOG
Traceback (most recent call last):
  File "G:\speech recognition\keyword_extraction_from_audio\speech_recognition.py", line 51, in <module>
    nlp = spacy.load("en_core_web_lg")
  File "C:\Users\giova\AppData\Local\Programs\Python\Python39\lib\site-packages\spacy\__init__.py", line 51, in load
    return util.load_model(
  File "C:\Users\giova\AppData\Local\Programs\Python\Python39\lib\site-packages\spacy\util.py", line 328, in load_model
    raise IOError(Errors.E050.format(name=name))
OSError: [E050] Can't find model 'en_core_web_lg'. It doesn't seem to be a Python package or a valid path to a data directory.

So not all perfect, but here’s the output

HI EVERYBODY COME VISIT MY BLOGHI EVERYBODY COME VISIT MY BLOG

Ok, apert BLOGHI and the repetition (?), it’s all good. A voice said: Hy everybody come visit my blog. I will see if in the future there’s a way to make this more clear.

Finally, I just used this

I don’t know what spacy is for, but to print the text we just need this script (and the modules above, except spacy) and a wav file. That’s it.

# A python package for music and audio analysis.
# https://librosa.org/doc/latest/index.html
import librosa
import torch
from transformers import Wav2Vec2ForCTC, Wav2Vec2Tokenizer

tokenizer = Wav2Vec2Tokenizer.from_pretrained("facebook/wav2vec2-base-960h")
model = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-base-960h")

collection_of_text = []
# for i in range(1):
speech, rate = librosa.load(f"myfile.wav", sr=16000)
input_values = tokenizer(speech, return_tensors='pt').input_values
with torch.no_grad():
    logits = model(input_values).logits
predicted_ids = torch.argmax(logits, dim=-1)
transcription = tokenizer.batch_decode(predicted_ids)[0]
collection_of_text.append(transcription)
final_complete_speech = ""
for i in collection_of_text:
    final_complete_speech += i

print(final_complete_speech)

This gives the right result

HI EVERYBODY COME VISIT MY BLOG