Insanely Fast Whisper：超快的Whisper语音识别脚本

项目简介

这篇内容介绍了OpenAI的Whisper Large v2语音转录模型的超快速能力。通过使用Transformers和Optimum技术，可以在不到10分钟内转录300分钟（5小时）的音频。作者提供了几种优化方式，包括批处理、半精度处理以及BetterTransformer，以提高转录速度。最终，作者以实际测试数据展示了不同优化方式的速度对比。同时，还提到了Whisper.cpp的性能测试、4位推断性能测试、以及一个CLI工具的社区展示。这篇文章旨在展示如何使Whisper模型在转录2-3小时的音频时更加高效。

基本上你需要做的就是这样：

1 import torch

2 from transformers import pipeline

3 

4 pipe = pipeline("automatic-speech-recognition",

5                  "openai/whisper-large-v2",

6                 torch_dtype=torch.float16,

7                 device="cuda:0")

8

9 pipe.model = pipe.model.to_bettertransformer()

10

11 outputs = pipe("<FILE_NAME>",

12                  chunk_length_s=30,

13                  batch_size=24,

14                  return_timestamps=True)

15

16 outputs["text"]

不相信？以下是我们在免费的 Google Colab T4 GPU 上运行的一些基准测试！

我们走吧！！

在这里，我们将深入研究优化，使 Whisper 更快以获得乐趣和利润！我们的目标是能够在尽可能短的时间内转录 2-3 小时长的音频。我们将从最基本的用法开始，然后逐步提高速度！

用于我们基准测试的唯一合适的测试音频是 Lex 采访 Sam Altman。我们将使用与他的播客相对应的音频文件。我将其上传到中心的一个小数据集上。

安装

1 pip install -q --upgrade torch torchvision torchaudio

2 pip install -q git+https://github.com/huggingface/transformers

3 pip install -q accelerate optimum

4 pip install -q ipython-autotime

让我们下载与播客对应的音频文件。

wget https://huggingface.co/datasets/reach-vb/random-audios/resolve/main/sam_altman_lex_podcast_367.flac

基本Case

1 import torch

2 from transformers import pipeline

3

4 pipe = pipeline("automatic-speech-recognition",

5                 "openai/whisper-large-v2",

6                  device="cuda:0")

1 outputs = pipe("sam_altman_lex_podcast_367.flac",

2                  chunk_length_s=30,

3                  return_timestamps=True)

4

5 outputs["text"][:200]

示例输出：

We have been a misunderstood and badly mocked org for a long time. When we started, we announced the org at the end of 2015 and said we were going to work on AGI, people thought we were batshit insan

转录整个播客的时间：31分1秒

项目链接

https://github.com/Vaibhavs10/insanely-fast-whisper

出自：https://mp.weixin.qq.com/s/imKBWr0aSLucE8UJ6Lanqg