I get some examples from here https://github.com/1learnfromdata/keyword_extraction_from_audio
First install librosa
pip install librosa
Then install torch
Torch is quite big
and finally transformers
Then the processing will be like:
You wait for a minute… (it’s 5:52 at the moment I’m writing this). I suggest you now to use sublime repl, but the command line to start the script.
Then at 5:53 this goes on
The percentage goes really slow, so I think it will take some minutes (5:54)
Consider that the audio file is long 2 minutes
Another minute (5:55) and we are at 26%
5:57 35%
6:09 58% So, we will need another 15 minutes at least… so half an hour for 2 minutes of audio?
6:20 70%
At the end (6:38 or less) I found this
Some weights of Wav2Vec2ForCTC were not initialized from the model checkpoint at facebook/wav2vec2-base-960h and are newly initialized: ['wav2vec2.masked_spec_embed'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference. C:\Users\giova\AppData\Local\Programs\Python\Python39\lib\site-packages\librosa\core\audio.py:165: UserWarning: PySoundFile failed. Trying audioread instead. warnings.warn("PySoundFile failed. Trying audioread instead.") Traceback (most recent call last): File "C:\Users\giova\AppData\Local\Programs\Python\Python39\lib\site-packages\librosa\core\audio.py", line 149, in load with sf.SoundFile(path) as sf_desc: File "C:\Users\giova\AppData\Local\Programs\Python\Python39\lib\site-packages\soundfile.py", line 629, in __init__ self._file = self._open(file, mode_int, closefd) File "C:\Users\giova\AppData\Local\Programs\Python\Python39\lib\site-packages\soundfile.py", line 1183, in _open _error_check(_snd.sf_error(file_ptr), File "C:\Users\giova\AppData\Local\Programs\Python\Python39\lib\site-packages\soundfile.py", line 1357, in _error_check raise RuntimeError(prefix + _ffi.string(err_str).decode('utf-8', 'replace')) RuntimeError: Error opening 'audio_files/1_audi_file.wav': System error. During handling of the above exception, another exception occurred: Traceback (most recent call last): File "G:\speech recognition\keyword_extraction_from_audio\speech_recognition.py", line 21, in <module> speech, rate = librosa.load(f"audio_files/{i+1}_audi_file.wav", sr=16000) File "C:\Users\giova\AppData\Local\Programs\Python\Python39\lib\site-packages\librosa\core\audio.py", line 166, in load y, sr_native = __audioread_load(path, offset, duration, dtype) File "C:\Users\giova\AppData\Local\Programs\Python\Python39\lib\site-packages\librosa\core\audio.py", line 190, in __audioread_load with audioread.audio_open(path) as input_file: File "C:\Users\giova\AppData\Local\Programs\Python\Python39\lib\site-packages\audioread\__init__.py", line 111, in audio_open return BackendClass(path) File "C:\Users\giova\AppData\Local\Programs\Python\Python39\lib\site-packages\audioread\rawread.py", line 62, in __init__ self._fh = open(filename, 'rb') FileNotFoundError: [Errno 2] No such file or directory: 'audio_files/1_audi_file.wav' >>>
So, I decided to use a small wav file of a couple of seconds
It happens that I had to install another module
After this I launched again the script and here’s what I found after a couple of seconds
The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. The tokenizer class you load from this checkpoint is 'Wav2Vec2CTCTokenizer'. The class this function is called from is 'Wav2Vec2Tokenizer'. C:\Users\giova\AppData\Local\Programs\Python\Python39\lib\site-packages\transformers\models\wav2vec2\tokenization_wav2vec2.py:417: FutureWarning: The class `Wav2Vec2Tokenizer` is deprecated and will be removed in version 5 of Transformers. Please use `Wav2Vec2Processor` or `Wav2Vec2CTCTokenizer` instead. warnings.warn( Some weights of Wav2Vec2ForCTC were not initialized from the model checkpoint at facebook/wav2vec2-base-960h and are newly initialized: ['wav2vec2.masked_spec_embed'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference. HI EVERYBODY COME VISIT MY BLOGHI EVERYBODY COME VISIT MY BLOG Traceback (most recent call last): File "G:\speech recognition\keyword_extraction_from_audio\speech_recognition.py", line 51, in <module> nlp = spacy.load("en_core_web_lg") File "C:\Users\giova\AppData\Local\Programs\Python\Python39\lib\site-packages\spacy\__init__.py", line 51, in load return util.load_model( File "C:\Users\giova\AppData\Local\Programs\Python\Python39\lib\site-packages\spacy\util.py", line 328, in load_model raise IOError(Errors.E050.format(name=name)) OSError: [E050] Can't find model 'en_core_web_lg'. It doesn't seem to be a Python package or a valid path to a data directory.
So not all perfect, but here’s the output
HI EVERYBODY COME VISIT MY BLOGHI EVERYBODY COME VISIT MY BLOG
Ok, apert BLOGHI and the repetition (?), it’s all good. A voice said: Hy everybody come visit my blog. I will see if in the future there’s a way to make this more clear.
Finally, I just used this
I don’t know what spacy is for, but to print the text we just need this script (and the modules above, except spacy) and a wav file. That’s it.
# A python package for music and audio analysis. # https://librosa.org/doc/latest/index.html import librosa import torch from transformers import Wav2Vec2ForCTC, Wav2Vec2Tokenizer tokenizer = Wav2Vec2Tokenizer.from_pretrained("facebook/wav2vec2-base-960h") model = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-base-960h") collection_of_text = [] # for i in range(1): speech, rate = librosa.load(f"myfile.wav", sr=16000) input_values = tokenizer(speech, return_tensors='pt').input_values with torch.no_grad(): logits = model(input_values).logits predicted_ids = torch.argmax(logits, dim=-1) transcription = tokenizer.batch_decode(predicted_ids)[0] collection_of_text.append(transcription) final_complete_speech = "" for i in collection_of_text: final_complete_speech += i print(final_complete_speech)
This gives the right result
HI EVERYBODY COME VISIT MY BLOG
Subscribe to the newsletter for updates
Tkinter templates
My youtube channel
Twitter: @pythonprogrammi - python_pygame