hngopher.com

       [HN Gopher] Whisper - open source speech recognition by OpenAI
       ___________________________________________________________________
        
       Whisper - open source speech recognition by OpenAI
        
       Author : _just7_
       Score  : 850 points
       Date   : 2022-09-21 16:16 UTC (6 hours ago)
        
 (HTM) web link (openai.com)
 (TXT) w3m dump (openai.com)
        
       | wongarsu wrote:
       | > About a third of Whisper's audio dataset is non-English, and it
       | is alternately given the task of transcribing in the original
       | language or translating to English. We find this approach is
       | particularly effective at learning speech to text translation and
       | outperforms the supervised SOTA on CoVoST2 to English translation
       | zero-shot.
       | 
       | That's intriguing. You can just set the model to transcribe
       | everything into English, no matter which language the speaker is
       | using, and it just works. Given that many people are much better
       | at understanding English than at speaking it, this might make
       | voice interfaces much more accessible without much work.
        
       | FloatArtifact wrote:
       | This would be a cool thing to integrate into Dragonfly
       | https://github.com/dictation-toolbox/dragonfly
        
       | rexreed wrote:
       | I'd love to find a way to test this with longer audio but I don't
       | have GPU resources and not exactly sure how to load that into the
       | Colab. Is anyone planning on hosting or sharing a model that can
       | be used by others to test longer form audio (for podcast
       | transcription)?
        
       | londons_explore wrote:
       | I've never seen transcription and translation combined into a
       | single step like this before...
       | 
       | Have I been living under a rock, or is this new?
       | 
       | I assume it should help performance, because it means emphasis,
       | timing and tone can be used to inform the translation. Helps make
       | better guesses about information missing from the source
       | language.
        
       | jerpint wrote:
       | I recorded myself speaking French and was able to translate
       | decently well on my laptop. Very impressive!
        
       | jfoster wrote:
       | It seems like OpenAI are finally living up to their name for once
       | with this release? Anything I'm missing?
       | 
       | From what I can gather:
       | 
       | 1. Includes model weights. I can't find the URL, but they
       | reference them enough and have a CLI tool, so I presume I just
       | haven't found them yet.
       | 
       | 2. Includes code: https://github.com/openai/whisper
       | 
       | 3. Released under MIT License:
       | https://github.com/openai/whisper/blob/main/LICENSE
        
         | thesausageking wrote:
         | It's one model and in a non-strategic area where there are
         | existing open source projects (Kaldi, DeepSpeech, ...).
         | 
         | For a company that raised $1B, that's not exactly living up to
         | their name and original mission.
        
           | whimsicalism wrote:
           | > It's one model and in a non-strategic area where there are
           | existing open source projects (Kaldi, DeepSpeech, ...).
           | 
           | I can already tell this is much better than any of the
           | existing open source projects with the exception of the wav2*
           | sequence of projects and potentially nvidia's nemo.
        
         | StevenWaterman wrote:
         | (Model weights from
         | https://github.com/openai/whisper/blob/main/whisper/__init__...
         | )
         | 
         | "tiny.en": "https://openaipublic.azureedge.net/main/whisper/mod
         | els/d3dd5..."
         | 
         | "tiny": "https://openaipublic.azureedge.net/main/whisper/models
         | /65147..."
         | 
         | "base.en": "https://openaipublic.azureedge.net/main/whisper/mod
         | els/25a85..."
         | 
         | "base": "https://openaipublic.azureedge.net/main/whisper/models
         | /ed3a0..."
         | 
         | "small.en": "https://openaipublic.azureedge.net/main/whisper/mo
         | dels/f953a..."
         | 
         | "small": "https://openaipublic.azureedge.net/main/whisper/model
         | s/9ecf7..."
         | 
         | "medium.en": "https://openaipublic.azureedge.net/main/whisper/m
         | odels/d7440..."
         | 
         | "medium": "https://openaipublic.azureedge.net/main/whisper/mode
         | ls/345ae..."
         | 
         | "large": "https://openaipublic.azureedge.net/main/whisper/model
         | s/e4b87..."
        
           | mmastrac wrote:
           | Large is 3GB to save everyone a click. Tiny is 72MB.
        
             | anigbrowl wrote:
             | That's unexpectedly lightweight - enough to run in some
             | phones.
        
         | solarmist wrote:
         | This kind of model is harder to abuse, so I guess it passed
         | their internal checks much more easily.
         | 
         | I can understand not releasing GPT-3, even if I disagree with
         | the decision.
        
           | ignoramous wrote:
           | > _This kind of model is harder to abuse, so I guess it
           | passed their internal checks much more easily._
           | 
           | The version I choose to believe: _stability.ai_ ate DALL-E
           | for lunch, and that woke them up.
        
             | solarmist wrote:
             | This is probably also true.
        
           | jfoster wrote:
           | True. The potential of GPT-3 to cause internet mayhem was/is
           | significant. I would argue that the mere act of announcing it
           | was still a catalyst for an eventual GPT-3-like model being
           | released. In revealing it, they established a target for what
           | open source models could aim to achieve, and simultaneously
           | got bad actors thinking about ways to abuse it.
        
           | dwohnitmok wrote:
           | > I can understand not releasing GPT-3, even if I disagree
           | with the decision.
           | 
           | Why do you disagree?
        
             | bigyikes wrote:
             | I don't see how GPT-3 is any more dangerous than Stable
             | Diffusion, Photoshop, that fake news website the crazy
             | person you're friends with on Facebook really likes, or any
             | of the number of other tools and services that can be used
             | to generate or spread fake information.
        
               | jfoster wrote:
               | All of your examples are limited in some way, but GPT-3
               | wouldn't have any meaningful limits.
               | 
               | Stable Diffusion: Marks images as AI-generated.
               | (invisible watermark, but still, it's there)
               | 
               | Photoshop: Requires time & effort from a human.
               | 
               | Fake news website: Requires time & effort from a human.
        
               | xkapastel wrote:
               | I wouldn't really say Stable Diffusion marks images as
               | AI-generated. There's a script in the Stable Diffusion
               | repository that will do that, but it's not connected to
               | the model itself in a meaningful way. I use Stable
               | Diffusion a lot and I've never touched this script.
               | 
               | https://github.com/CompVis/stable-
               | diffusion/blob/69ae4b35e0a...
        
               | capableweb wrote:
               | What "script" are you using for doing txt2img? The
               | watermark function is automatically called when you use
               | the CLI in two places, https://github.com/CompVis/stable-
               | diffusion/blob/69ae4b35e0a... and
               | https://github.com/CompVis/stable-
               | diffusion/blob/69ae4b35e0a...
               | 
               | Trivial to remove, I give you that. But AFAIK, the
               | original repository + most forks put the watermark
               | automatically unless you've removed it on your own.
        
               | spullara wrote:
               | SD only does that if you don't delete the line of code
               | that does it...
        
             | mmh0000 wrote:
             | Because why should the wealthy and connected be the only
             | ones -allowed- have access to such life improving
             | technology?
        
             | solarmist wrote:
             | Two reasons. First, someone else will release something
             | similar. Second, I didn't see a related push from them to
             | work with other in the industry to do something productive
             | towards safety with the time they got by delaying
             | availability of these kinds of models. So it felt
             | disingenuous.
        
       | bredren wrote:
       | This is dropping right in the middle of Interspeech 2022.
       | 
       | I don't believe OpenAI has anyone presenting at the conference,
       | so presumably this was timed to coincide with that and get buzz
       | at the conference.
       | 
       | Curious how this model compares with foss STT from the startup
       | Coqui.
        
       | revskill wrote:
       | It's actually better than Google Meet subtitle system.
        
       | blueberrychpstx wrote:
       | This is absolute garbage python as I am neither a python
       | developer, nor a good developer. I was trying to play around with
       | real time transcriptions. However, it does work!
       | 
       | > * recording * done recording Recording saved to file.wav Press
       | enter to transcribe
       | 
       | /Users/laptop/Development/Personal/Public/pythonProject1/venv/lib
       | /python3.9/site-packages/whisper/transcribe.py:70: UserWarning:
       | FP16 is not supported on CPU; using FP32 instead
       | warnings.warn("FP16 is not supported on CPU; using FP32 instead")
       | Detected language: english Goodbye, I need to go pick up my wife.
       | Press enter to start recording
       | 
       | Any improvements welcome here.
       | 
       | ``` # This is a sample Python script.
       | 
       | # Press ^R to execute it or replace it with your code. # Press
       | Double | to search everywhere for classes, files, tool windows,
       | actions, and settings.
       | 
       | def print_hi(name): # Use a breakpoint in the code line below to
       | debug your script. print(f'Hi, {name}') # Press [?]F8 to toggle
       | the breakpoint.
       | 
       | def record_microphone(seconds): import pyaudio import wave
       | CHUNK = 1024         FORMAT = pyaudio.paInt16         CHANNELS =
       | 1         RATE = 44100         RECORD_SECONDS = seconds
       | WAVE_OUTPUT_FILENAME = "file.wav"              p =
       | pyaudio.PyAudio()              stream = p.open(format=FORMAT,
       | channels=CHANNELS,                         rate=RATE,
       | input=True,                         frames_per_buffer=CHUNK)
       | print("* recording")              frames = []              for i
       | in range(0, int(RATE / CHUNK * RECORD_SECONDS)):             data
       | = stream.read(CHUNK)             frames.append(data)
       | print("* done recording")              stream.stop_stream()
       | stream.close()         p.terminate()              wf =
       | wave.open(WAVE_OUTPUT_FILENAME, 'wb')
       | wf.setnchannels(CHANNELS)
       | wf.setsampwidth(p.get_sample_size(FORMAT))
       | wf.setframerate(RATE)         wf.writeframes(b''.join(frames))
       | wf.close()              return WAVE_OUTPUT_FILENAME
       | 
       | if __name__ == '__main__': seconds = 5 while True: print("Press
       | enter to start recording") input() filename =
       | record_microphone(seconds) print("Recording saved to " +
       | filename) print("Press enter to transcribe") input() import
       | whisper model = whisper.load_model("base")
       | result = model.transcribe(filename)
       | print(result["text"])
       | 
       | ```
        
       | yawnxyz wrote:
       | Oh man I remember LOVING Micro Machines as a kid.
       | 
       | But also, this tool seems much better than Otter.ai, which gets
       | every third word wrong when transcribing microbiology recordings
        
       | alexb_ wrote:
       | Combine the translation + transcription with voice synthesis, and
       | once compute power allows for this to be miniaturized we will be
       | able to have babel-fish technology in real life.
        
       | no1youknowz wrote:
       | This is awesome. But I really want the other way.
       | 
       | To be able to give it text and hear the speech. A TTS (text to
       | speech).
       | 
       | As a language learner, the ability to create my own sentences
       | (based on existing ones I have, in changing a word here or
       | there). Would be amazing.
       | 
       | How long till we have this I wonder. I know I could use a service
       | to do this currently. But having something running locally, I'd
       | prefer.
       | 
       | Hopefully someone in the OpenAI team reads this. :)
        
         | TaylorAlexander wrote:
         | I suspect this is coming. I mean we do have decent text to
         | speech systems already, but in this vein of "we used neural
         | networks and now it's very very good" you can imagine that with
         | something like GPT-3, to extend it they could use this speech
         | to text system so you could speak to it for input, and then a
         | natural progression is that it can use text to speech to return
         | the output, so you just have a voice oriented conversational
         | system.
         | 
         | So I think TTS is a logical part of the system. I also think
         | that there are peculiarities of voice interaction that aren't
         | captured in text training datasets, so they would need to do
         | some fine tuning on actual voice conversation to make it feel
         | natural.
         | 
         | All in due time I suppose.
        
       | noreally_ wrote:
       | A notebook is available to try with your microphone on Colab
       | here: https://colab.research.google.com/drive/1nBZ-
       | pDIaIi3N1DIIXvJ...
       | 
       | I'm surprised by the quality on non-English languages, given that
       | 80+% of the training data is English, and the rest is split
       | between tens of languages.
        
         | bambax wrote:
         | Thanks! I played with this in French and posted the results as
         | replies to this comment:
         | https://news.ycombinator.com/item?id=32928643
         | 
         | It's sometimes close to perfect, and sometimes goes off the
         | rail; I think that maybe the model tries to establish some sort
         | of consistency for each sentence; if starts wrong for the first
         | few words of a sentence, it can't build the rest properly.
         | 
         | But it's super fun.
        
       | goffi wrote:
       | Really interesting, I can see ton of potential uses.
       | 
       | 2 questions:
       | 
       | 1) how does it compare to state of the art FOSS solutions? I'm
       | seeking about DeepSpeech or Vosk
       | 
       | 2) would it be somehow possible to associate timestamp to the
       | words recognized? That would be amazing for things such as audio
       | editing or skipping to a particular location on a video
        
         | nshm wrote:
         | You properly mentioned timestamps. There are many other
         | important properties of good ASR system like vocabulary
         | adaptability (if you can introduce new words) or streaming. Or
         | confidences. Or latency of the output. Compared to Vosk models
         | this model can not work in streaming manner, so not very
         | suitable for real-time applications.
         | 
         | But in general the model is robust and accurate and trained on
         | the amount of speech we never dreamed about in Vosk. We will
         | certainly benefit from this model as a teacher (together with
         | others like gigaspeech models). I recently wrote about it
         | https://alphacephei.com/nsh/2022/06/14/voting.html
        
         | goffi wrote:
         | > goffi
         | 
         | for 2), it's actually written in the description: "phrase-level
         | timestamps", so it should be possible (phrase level is neat for
         | skipping to a special location on a video, but maybe not for
         | audio editing).
        
       | IceWreck wrote:
       | Is there a list of system requirements somewhere ? Can it run on
       | cheaper low memory GPUs ? maybe CPUs ?
        
         | StevenWaterman wrote:
         | Their models range from 70mb to 3gb. The largest model is
         | smaller than the optimised stable diffusion. Not sure what the
         | inference speed is like, haven't tried it myself yet.
        
           | IceWreck wrote:
           | I just tested it myself. Its fast enough on colab, couple of
           | seconds but not sure if its fast enough to transcribe
           | realtime audio yet.
        
       | [deleted]
        
       | mewse-hn wrote:
       | I know this isn't a tech support forum but maybe someone here
       | knows. I'm attempting the sample python code from the github and
       | _almost_ get a transcription running on my work laptop without a
       | GPU, but I run into this error message:
       | 
       | >>> result = whisper.decode(model, mel, options)
       | 
       | Traceback (most recent call last):
       | 
       | [snip]
       | 
       | RuntimeError: "slow_conv2d_cpu" not implemented for 'Half'
       | 
       | It looks like a Torch error, is there some twiddling with
       | "options" I can do to get it to run?
        
         | mewse-hn wrote:
         | I seem to have worked around it by tweaking the "options" line
         | from the sample code to this:
         | 
         | >>> options = whisper.DecodingOptions(fp16=False)
        
       | O__________O wrote:
       | Anyone know if it is possible to output IPA using this?
       | 
       | International Phonetic Alphabet (IPA)
       | 
       | - https://wikipedia.org/wiki/International_Phonetic_Alphabet
       | 
       | _________
       | 
       | EDIT: Based on list of languages in the tokenizer code here,
       | doesn't appear IPA is supported:
       | 
       | https://github.com/openai/whisper/blob/5f8d4bcc254d4f3e833d3...
        
       | jcims wrote:
       | Did respectably with some mumble rap:
       | https://controlc.com/d353dafb
       | 
       | (some NSFW words in the lyrics obv)
        
         | derangedHorse wrote:
         | Whisper performed a lot better than I would've expected it to!
        
       | mmh0000 wrote:
       | Okay this is super impressive. I just downloaded Whisper and fed
       | it a random flac file I had handy and it did a really good job.
       | Also impressive that it works on my weak CPU:
       | 
       | A 3m07s flac took 5m to transcribe:                 $ whisper
       | --device cpu 'BLACKPINK - BORN PINK/01 Pink Venom.flac'
       | Detecting language using up to the first 30 seconds. Use
       | `--language` to specify the language       Detected language:
       | korean       [00:00.000 --> 00:10.000]  Blackpink
       | [00:11.000 --> 00:14.000]  Kick in the door, wave in the coco
       | [00:14.000 --> 00:16.000]  pabkonineun cinge ggyeodeul saenggag
       | malgo       [00:16.000 --> 00:19.000]  I talk to talk, run ways I
       | walk walk       [00:19.000 --> 00:21.000]  him gamgo pab pab an
       | bwado ceog       [00:21.000 --> 00:24.000]  By one and two by two
       | [00:24.000 --> 00:26.000]  nae songgeut du hanae tamyeon ajieun
       | jung       [00:26.000 --> 00:30.000]  gas jasyo jigeum hwaryeohae
       | T makes no sense       [00:30.000 --> 00:32.000]  You couldn't
       | get a dollar out of me       [00:33.000 --> 00:38.000]  ja oneul
       | bamiya nuntobeul pumgo       [00:38.000 --> 00:41.000]  mihoneul
       | bbaeseum down       [00:41.000 --> 00:43.000]  Look what you made
       | us do       [00:43.000 --> 00:47.000]  ceonceonhi neol jamjaeul
       | paieo       [00:48.000 --> 00:52.000]  jami nal mankeum
       | areumdaweo       [00:52.000 --> 00:53.000]  I bring the pain like
       | [00:53.000 --> 00:57.000]  diseutab, paengpaeng, diseutab,
       | paengpaeng, diseutab, paengpaeng, paengpaeng       [00:57.000 -->
       | 00:58.000]  Get em, get em, get em       [00:58.000 -->
       | 01:00.000]  Straight till you don't like       [01:00.000 -->
       | 01:01.000]  Whoa, whoa, whoa       [01:01.000 --> 01:03.000]
       | Straight till you don't like       [01:03.000 --> 01:04.000]  Ah,
       | ah, ah       [01:04.000 --> 01:05.000]  Taste that, pink venom
       | [01:05.000 --> 01:06.000]  Taste that, pink venom
       | [01:06.000 --> 01:08.000]  Taste that, pink venom
       | [01:08.000 --> 01:09.000]  Get em, get em, get em
       | [01:09.000 --> 01:11.000]  Straight till you don't like
       | [01:11.000 --> 01:12.000]  Whoa, whoa, whoa       [01:12.000 -->
       | 01:13.000]  Straight till you don't like       [01:13.000 -->
       | 01:14.000]  Ah, ah, ah       [01:14.000 --> 01:15.000]  Blackpink
       | and Amo       [01:15.000 --> 01:17.000]  Got it by the smack ram
       | [01:17.000 --> 01:18.000]  But rest in peace       [01:18.000 -->
       | 01:19.000]  Please light up a candle       [01:19.000 -->
       | 01:20.000]  This the knife of a vando       [01:20.000 -->
       | 01:22.000]  Messed up and I'm still in saline       ...SNIP...
        
         | lunixbochs wrote:
         | Looks like it defaults to the model called "small".
         | 
         | I just ran some benchmarks - M1 Max, pytorch, with a 1.29
         | second flac (looks like the matrix math was running on a single
         | thread):                   tiny         146.522ms detect_lang
         | 549.131ms decode_one         0.057ms tokenizer
         | base         354.885ms detect_lang         1046.679ms
         | decode_one         0.011ms tokenizer              small
         | 803.892ms detect_lang         3194.503ms decode_one
         | 0.017ms tokenizer              medium         2279.689ms
         | detect_lang         10128.255ms decode_one         0.023ms
         | tokenizer              large         3656.478ms detect_lang
         | 17249.024ms decode_one         0.016ms tokenizer
        
       | lazylion2 wrote:
       | I ran it on this clip
       | 
       | https://clips.twitch.tv/ReliablePopularWerewolfOSkomodo-pcuw...
       | 
       | because... hard accent.
       | 
       | first run whisper thought its welsh so I had to run with
       | --language en , and it did pretty well
       | 
       | https://i.imgur.com/TQiYU9X.png
       | 
       | took 36 seconds in Google colab
        
       | manishsharan wrote:
       | Oh this is a relief to have something opensource in this field. I
       | had using Mozilla Deepspeech for transcribing my voice notes ,
       | often with hilarious to incomprehensible results. DeepSpeech is
       | dead ; so I will be sure to check this out.
        
       | w10-1 wrote:
       | Naively, training the same model on multiple languages has
       | interesting implications.
       | 
       | On one hand, it may capture something "deeper" about language.
       | 
       | On the other hand, it's likely to do great in general, but miss
       | particularities of some language.
       | 
       | Understanding the coverage of the training model seems a
       | perennial problem. Is there any (shorthand) way to compare
       | language model training corpora?
       | 
       | Clearly if they use common subsets we have a literal comparison.
       | I'm more interested in whether there's progress in characterizing
       | corpora by speech styles, fluency, vocabulary sets, (noise)
       | environment, emotionality, proposition types, etc.
       | 
       | (btw: 25 minutes for a 9-minute segment on a 12-thread x86. Lots
       | of jargon spelled as it sounds. Sentences capitalized but no
       | punctuation. Overall good.)
        
       | dindindin wrote:
       | I'm not in the Speech Recognition circles and am looking for open
       | source speech recognition I can play around with - would this be
       | the new state of the art?
        
         | mercurywells wrote:
         | For me as a deaf person the current state of art (in terms of
         | speed & usability) is the Recorder app on a Google Pixel phone
         | (4a/6 Pro is what I've used)
        
         | StevenWaterman wrote:
         | Yes
        
         | visarga wrote:
         | Most probably
        
       | The5thElephant wrote:
       | How is it Apple, Google, or Microsoft are not further ahead of
       | the game on speech recognition like this? They have the resources
       | to hire the best ML researchers and throw tons of computing hours
       | at it, yet Siri, Google, and Cortana continue to struggle to get
       | anywhere near this level of comprehension.
        
         | wongarsu wrote:
         | Siri and Cortana have to run at least in real time, with
         | reasonable compute resources. Probably faster than real time
         | when the audio gets shipped off to the cloud and transcribed
         | there. This model can't do that (in the "large" version, which
         | the examples use).
         | 
         | Also, you are comparing Whisper's highlight reel with everyday
         | performance of other models. Nobody shows their weaknesses in
         | their highlight reel.
        
           | alex_marchant wrote:
           | Siri until ios 15 was done in the cloud IIRC.
        
           | coder543 wrote:
           | Someone else in this thread[0] said Whisper was running at
           | 17x real time for them. So, even a weak machine might be able
           | to do an acceptable approximation of real time with Whisper.
           | 
           | Also, I feel like shipping to the cloud and back has been
           | shown to be just as fast as on device transcription in a lot
           | of scenarios. Doing it on device is primarily a benefit for
           | privacy and offline, not necessarily latency. (Although,
           | increasingly powerful smartphone hardware is starting to give
           | the latency edge to local processing.)
           | 
           | Siri's dictation has had such terrible accuracy for me (an
           | American English speaker without a particularly strong
           | regional accent) and everyone else I know for so many years
           | that it is just a joke in my family. Google and Microsoft
           | have much higher accuracy in their models. The bar is so low
           | for Siri that I automatically wonder how much Whisper is
           | beating Siri in accuracy... because I assume it has to be
           | better than that.
           | 
           | I really wish there was an easy demo for Whisper that I could
           | try out.
           | 
           | [0]: https://news.ycombinator.com/item?id=32928207
        
             | lunixbochs wrote:
             | 17x realtime _on a 3090_
             | 
             | I did some basic tests on CPU, the "small" Whisper model is
             | in the ballpark of 0.5x realtime, which is probably not
             | great for interactive use.
             | 
             | My models in Talon run closer to 100x realtime on CPU.
        
               | coder543 wrote:
               | "CPU" isn't necessarily the benchmark, though. Most
               | smartphones going back years have ML inference
               | accelerators built in, and both Intel and AMD are
               | starting to build in instructions to accelerate
               | inference. Apple's M1 and M2 have the same inference
               | accelerator hardware as their phones and tablets. The
               | question is whether this model is a good fit for those
               | inference accelerators, and how well it works there, or
               | how well it works running on the integrated GPUs these
               | devices all have.
               | 
               | Brute forcing the model with just traditional CPU
               | instructions is fine, but... obviously going to be pretty
               | slow.
               | 
               | I have no experience on the accuracy of Talon, but I've
               | heard that most open source models are basically overfit
               | to the test datasets... so their posted accuracy is often
               | misleading. If Whisper is substantially better in the
               | real world, that's the important thing, but I have no
               | idea if that's the case.
        
               | lunixbochs wrote:
               | See https://news.ycombinator.com/item?id=32929029 re
               | accuracy, I'm working on a wider comparison. My models
               | are generally more robust than open-source models such as
               | Vosk and Silero, but I'm definitely interested in how my
               | stuff compares to Whisper on difficult held-out data.
               | 
               | > Brute forcing the model with just traditional CPU
               | instructions is fine, but... obviously going to be pretty
               | slow.
               | 
               | It's not that simple. Many of the mobile ML accelerators
               | are more targeted for conv net image workloads, and
               | current-gen Intel and Apple CPUs have dedicated hardware
               | to accelerate matrix math (which helps quite a bit here,
               | and these instructions were in use in my tests).
               | 
               | Also, not sure which model they were using at 17x
               | realtime on the 3090. (If it's one of the smaller models,
               | that bodes even worse for non-3090 performance.) The 3090
               | is one of the fastest ML inference chips in the world, so
               | it doesn't necessarily set realistic expectations.
               | 
               | There are also plenty of optimizations that aren't
               | applied to the code we're testing, but I think it's
               | fairly safe to say the Large model is likely to be slow
               | on anything but a desktop-gpu-class accelerator just due
               | to the sheer parameter size.
        
               | lunixbochs wrote:
               | Ok, my test harness is ready. My A40 box will be busy
               | until later tonight, but on an NVIDIA A2 [1], this is the
               | batchsize=1 throughput I'm seeing. Common Voice, default
               | Whisper settings, card is staying at 97-100% utilization:
               | tiny.en: ~18 sec/sec       base.en: ~14 sec/sec
               | small.en: ~6 sec sec/sec       medium.en: ~2.2 sec/sec
               | large: ~1.0 sec/sec (fairly wide variance when ramping up
               | as this is slow to process individual clips)
               | 
               | [1] https://www.nvidia.com/en-us/data-center/products/a2/
        
               | coder543 wrote:
               | Isn't the A2 much weaker than a 3090? So those results
               | are promising.
               | 
               | EDIT: for what it's worth, Nvidia rated the A2 at 18
               | TFLOPS of FP16, and Apple rates the current A16 Neural
               | Engine at 17 TFLOPS of FP16. I'm sure it's not an "apples
               | to apples" comparison.
        
           | The5thElephant wrote:
           | Good point about realtime or not, however with ML I have
           | found the weaknesses get addressed pretty fast by someone.
           | There is a big step between proof of concept and practical
           | application though, so we shall see.
        
         | Kuinox wrote:
         | OpenAI is owned by Microsoft FYI.
        
           | neongreen wrote:
           | Is it? Googling suggests that Microsoft invested in OpenAI
           | but doesn't actually own it.
        
             | Kuinox wrote:
             | Oh, my bad looks like they only bought an exclusive license
             | to GPT3.
        
         | fxtentacle wrote:
         | This AI has a 30 second delay on the audio processing because
         | it needs to be able to "look into the future" to get these good
         | results. That 30s delay would be unacceptable for
         | Siri/Google/Cortana.
        
           | coder543 wrote:
           | A lot of models we currently use seem to do the same thing.
           | The model will transcribe a "best effort" interpretation in
           | real time, then as you can continue speaking, you'll see it
           | go back and make corrections. I'm sure you can feed the first
           | X seconds you have into the model, followed by (30-X) seconds
           | of silence, and it will do real time transcription just
           | fine... it would be weird if this broke anything. Then, as
           | you get more speech, you continue getting better
           | transcription of the first 30 seconds, then you switch to a
           | 30 second sliding window.
           | 
           | Maybe I'm missing something, but I don't see the problem
           | here.
        
             | fxtentacle wrote:
             | Yes, that's because Whisper - like pretty much all of them
             | - uses a Transformer encoder with Attention layers. And the
             | Attention layers learn to look into the future.
             | 
             | And yes, what you describe could be done. But no, it won't
             | reduce latency that much, because the model itself learns
             | to delay the prediction w.r.t. the audio stream. That's why
             | ASR-generated subtitles usually need to be re-aligned after
             | the speech recognition step. And that's why there is
             | research such as the FastEmit paper to prevent that, but
             | then it is a trade-off between latency and quality again.
             | 
             | Also, running your "low-latency" model with 1s chunks means
             | you now need to evaluate the AI 30x as often as if you'd be
             | using 30s chunks.
        
               | coder543 wrote:
               | You just said the models pretty much all work the same
               | way, then you said doing what I described won't help. I'm
               | confused. Apple and Google both offer real time, on
               | device transcription these days, so _something_ clearly
               | works. And if you say the models already all do this,
               | then running it 30x as often isn 't a problem anyways,
               | since again... people are used to that.
               | 
               | I doubt people run online transcription for long periods
               | of time on their phone very often, so the battery impact
               | is irrelevant, and the model is ideally running (mostly)
               | on a low power, high performance inference accelerator
               | anyways, which is common to many SoCs these days.
        
               | fxtentacle wrote:
               | I meant that most research that has been released in
               | papers or code recently uses the same architecture. But
               | all of those research papers use something different than
               | Apple and Google.
               | 
               | As for running the AI 30x, on current hardware that'll
               | make it slower than realtime. Plus all of those 1GB+
               | models won't fit into a phone anyway.
        
         | beastman82 wrote:
         | In my unmeasured empirical observation Google has amazing
         | speech recognition
        
           | jeffbee wrote:
           | I tried feeding the four examples from this announcement into
           | Google as dictation inputs and it just sits there blankly. On
           | the JFK speech test file in the repo, Google understands
           | perfectly. The samples in the announcement are clearly
           | outside the capabilities of anything Google has launched
           | publicly, but I don't know how that translates to overall
           | utility in every day applications.
        
           | The5thElephant wrote:
           | I agree they have the best compared to Apple, Amazon,
           | Microsoft. However I don't think it is as good as what is
           | being shown here by OpenAI.
        
             | Vetch wrote:
             | My experience with the APIs is Google is excellent and
             | Microsoft is slightly better. And the offline model I've
             | been using that's nearly as good as both is facebook's
             | wav2vec2-large-960h-lv60-self.
             | 
             | Don't believe what's on marketing pages, they rarely
             | transfer to the real world. Will have to make time to try
             | it and see. In theory, given task diversity and sheer
             | number of hours, it should be a lot more robust but will
             | wait on evidence before believing any claims on SoTA.
        
       | andy_xor_andrew wrote:
       | Hold on, it does not only speech recognition, but also language
       | translation, in the same model?
       | 
       | What an interesting approach. What benefits does this have over
       | having two dedicated models, one for speech-to-text, and another
       | for translation?
       | 
       | It just seems so odd, given the problems of speech-to-text and
       | Spanish-to-English seems so different from one another (in terms
       | of the problem domain). Seems so unusual to have both handled by
       | one model!
       | 
       | Does knowledge of speech-to-text carry over into knowledge of
       | translation? Does knowledge of translation carry over into
       | knowledge of speech-to-text? So weird.
        
         | newhaus1994 wrote:
         | My understanding is that multi-modal models are the primary
         | focus of OpenAI right now, due to their stated goal of
         | achieving AGI. This product is probably better thought of as an
         | offshoot of their work to create a fully generalizable model,
         | rather than a specific attempt to provide
         | translation/transcription services.
        
         | TaylorAlexander wrote:
         | It seems these days that language-oriented models are commonly
         | becoming multilingual by default. There are a lot of common
         | threads when understanding sentence construction between
         | different languages. French and English have different rules
         | but they will still have things like nouns, adjectives,
         | subjects, prepositions, etc. It seems that by training models
         | on many languages you get both a more robust understanding of
         | language, and it saves you the trouble of having to make many
         | more localized models for every language. I also believe that
         | the other languages help the models construct sentences in
         | languages which have very small training sets. If it has a few
         | examples in a rare language as well as good translations to a
         | better-known language, then it can provide good support for the
         | rare language.
         | 
         | We also see in image generation models that multi-modal
         | networks are more powerful than single purpose networks. As we
         | move towards more advanced AI systems I suspect we will see
         | more and more generalizable networks with distinct advantages
         | over separate networks that get plugged together.
        
           | magicalhippo wrote:
           | Would a multilingual modal perhaps also be better at
           | understanding non-natives speech?
        
       | thuttinger wrote:
       | I tried running it in realtime with live audio input (kind of).
       | 
       | If you want to give it a shot, you can find the python script in
       | this repo: https://github.com/tobiashuttinger/openai-whisper-
       | realtime
       | 
       | A bit more context on how it works: The systems default audio
       | input is captured with python, split into small chunks and is
       | then fed to OpenAI's original transcription function. It tries
       | (currently rather poorly) to detect word breaks and doesn't split
       | the audio buffer in those cases. With how the model is designed,
       | it doesn't make the most sense to do this, but i found it would
       | be worth trying. It works acceptably well.
        
       | minimaxir wrote:
       | The model output can be tweaked to produce audio embeddings (akin
       | to BERT for text embeddings and CLIP for image embeddings), which
       | can lead to some _interesting_ applications as the previous two
       | examples have demonstrated.
        
         | FerociousTimes wrote:
         | What do you mean exactly by audio embeddings?
        
           | minimaxir wrote:
           | Represent a given set of audio inputs as a numeric vector,
           | which can then for example be finetuned for other ML/AI
           | problems or placed in an embeddings database for easy ANN
           | search with similar audio clips. In the extreme case it could
           | facilitate better AI audio generation similar to how CLIP can
           | guide a VQGAN.
           | 
           | Although the 30 second minimum input is a bit of a bummer
           | since it may not allow much granularity in the resulting
           | embeddings.
        
       | lynguist wrote:
       | How can I use this (or something similar) for live translation? I
       | don't mind if there's a 30s delay.
       | 
       | As in I don't want to input a file, I want to input the
       | microphone sound.
        
         | agnos wrote:
         | Would also like to know this. It looks like they're processing
         | the audio file in 30 second chunks, so a naive approach of
         | keeping a buffer of 30-second input stream chunks and just
         | continually writing to an output .mp3 could work...
        
         | blueberrychpstx wrote:
         | Was wondering the same.
         | 
         | I really wish I would have been paying attention in Unix
         | class...
         | 
         | Something like `microphone | chunk 3s | whisper | stdout` would
         | be SO COOL!!! I think that's possible but too lazy to look
         | more.
        
       | spywaregorilla wrote:
       | Hmm are there any noteworthy open sourced speech to speech
       | models? Like transform a spoken line to another voice, copying
       | both the words spoken and the inflections?
        
       | cercatrova wrote:
       | Their Scottish accent example is pretty good, I'd like to see it
       | work on some very strong English accents like this one:
       | https://www.youtube.com/watch?v=nJ7QB3om-QY
        
         | homarp wrote:
         | Detected language: english
         | 
         | [00:00.000 --> 00:05.400] Gordy and County Kerry are
         | investigating the theft of up to 60 sheep on Mount Brandon.
         | 
         | [00:05.400 --> 00:10.400] One of the farmers is offering a
         | reward for information leading to the return of the use,
         | 
         | [00:10.400 --> 00:12.200] which are worth thousands of euro.
         | 
         | [00:12.200 --> 00:14.200] Well, I'm fine with that.
         | 
         | [00:14.200 --> 00:15.200] That's right.
         | 
         | [00:15.200 --> 00:16.200] Do you own them?
         | 
         | [00:16.200 --> 00:17.200] Anyone can say it.
         | 
         | [00:17.200 --> 00:18.200] Fine with that.
         | 
         | [00:18.200 --> 00:22.720] Last Saturday, Mikey Joe O'Shea
         | brought his flock of Scotch sheep down from the mountain
         | 
         | [00:22.720 --> 00:25.320] commonage ahead of lambing.
         | 
         | [00:25.320 --> 00:29.840] He discovered over 50 were missing,
         | allowing for a number of deaths and
         | 
         | [00:29.840 --> 00:30.840] strays.
         | 
         | [00:30.840 --> 00:34.600] Mikey is convinced over 45 sheep have
         | been stolen.
         | 
         | [00:34.600 --> 00:35.600] It was a good night.
         | 
         | [00:35.600 --> 00:36.600] It would be a full moon there.
         | 
         | [00:36.600 --> 00:37.600] It would be a good night.
         | 
         | [00:37.600 --> 00:38.600] It would be bright out.
         | 
         | [00:38.600 --> 00:40.600] There could be anyone going up in the
         | mountains.
         | 
         | [00:40.600 --> 00:41.600] It would be a good night.
         | 
         | [00:41.600 --> 00:43.600] Well, that was 45 sheep missing.
         | 
         | [00:43.600 --> 00:49.600] Mikey and the lambs and everything in
         | the sheep, they counted out a nice bit of money.
         | 
         | [00:49.600 --> 00:52.200] They've been doing the boat in
         | Nassan.
         | 
         | [00:52.200 --> 00:53.200] It's a big one. [00:53.200 -->
         | 00:54.200] It's a big one. [00:54.200 --> 00:55.200] It's a big
         | one.
         | 
         | [00:55.200 --> 00:59.000] Mikey's next door neighbor says some
         | of his sheep have also been stolen.
         | 
         | [00:59.000 --> 01:00.000] Come back. [01:00.000 --> 01:01.000]
         | Come back. [01:01.000 --> 01:02.000] Come back.
         | 
         | [01:02.000 --> 01:03.000] I've been missing about 10 years.
         | 
         | [01:03.000 --> 01:04.000] It's not all that difficult.
         | 
         | [01:04.000 --> 01:06.320] All they've got to do is have a good
         | dog.
         | 
         | [01:06.320 --> 01:10.560] Have a good dog and go at night, some
         | moonshine night.
         | 
         | [01:10.560 --> 01:11.560] Just put the dog around him.
         | 
         | [01:11.560 --> 01:14.120] Put him on a trailer and walk him.
         | 
         | [01:14.120 --> 01:18.360] And then probably somebody else to
         | pick him up.
         | 
         | [01:18.360 --> 01:29.960] Everybody's doing it north, but he's
         | doing it.
        
           | cercatrova wrote:
           | Wow that is incredibly impressive. At 0:53 is it translating
           | as well? Didn't sound like English to me.
        
         | mod wrote:
         | Those are Irish.
        
       | biggerChris wrote:
       | We have reached sentient mode.
        
       | dom96 wrote:
       | This really makes me want to build a Amazon Echo/Google Nest/etc
       | replacement that's open hardware, open source and most
       | importantly recognises voice completely offline. I find that I
       | don't use these smart devices for much more than setting timers
       | anyway so this seems like an easy project.
       | 
       | I just wonder what system requirements Whisper has and whether
       | there are open source voice recognition models that are
       | specifically built for embedded devices.
        
         | solarkraft wrote:
         | Are you thinking about reimplementing Mycroft?
         | 
         | The Mycroft has done a lot of cool and important work in the
         | field to ship an actual personal assistant product (stuff like
         | wake word detection).
        
           | dom96 wrote:
           | hah, of course someone had the idea already and executed on
           | it. But yeah, basically that but without the screen (probably
           | would go a long way to decrease the cost, $299 is pretty
           | steep for such a device)
        
         | suyash wrote:
         | This is only one side of the coin, you still need really good
         | models for Speech Synthesis and then be able to have it all
         | working in almost real time, ideally locally on device.
        
           | ricopags wrote:
           | As far as TTS goes, Mycroft.ai[0] has released a decent
           | offline one.
           | 
           | [0]https://mycroft.ai/
        
         | MacsHeadroom wrote:
         | I really want all this too. The smallest model is ~80mb and the
         | largest is 3gb. Not sure about system requirements yet; but
         | models that small suggest this may be doable locally on a
         | single board computer.
         | 
         | Edit: According to this comment[0] the base model runs in real
         | time on an M1 CPU. The tiny model apparently decodes an audio
         | file twice as fast. These are promising results.
         | 
         | [0] https://news.ycombinator.com/item?id=32927360#32929739
        
           | dom96 wrote:
           | I'd be interested to see how well it performs on something
           | like an RPi. M1 is pretty beefy.
        
       | TOMDM wrote:
       | Given how robust it seems to be with fast speech, I wonder if you
       | could save cycles by speeding up the audio before feeding it in.
        
       | eatsyourtacos wrote:
       | Can this be used as a real-time transcription or is it too slow
       | for that?
       | 
       | Curious what anyone is using these days for a real-time
       | transcription. It doesn't have to be perfect, but just good
       | enough.
       | 
       | My kids watch some youtube vidoes where people will make a mod
       | where it converts them talking to text then look for keywords and
       | spawn a boss in Terraria if you say the wrong keyword etc.
       | 
       | I made a clone of that with the .NET System.Speech.Recognition
       | library. It... works.. but my biggest problem is that #1 it waits
       | until you are done speaking to translate to text on the callback,
       | so there was too much of a delay for it to be fun.. the point is
       | that it will be checking a stream of chatter. #2 is the
       | recognition is pretty crap, I mean it's nearly good enough for my
       | silly purpose but it's still pretty bad.
        
         | blueberrychpstx wrote:
         | If your family uses Apple devices, Apple offers free on-device
         | speech recognition. Only caveat is that it needs to be
         | restarted every minute due to whatever stupid limitation (or
         | bug) they've introduced.
         | 
         | https://developer.apple.com/documentation/speech/recognizing...
         | 
         | Also, see `requiresOnDeviceRecognition`
        
           | [deleted]
        
         | [deleted]
        
         | nshm wrote:
         | Try https://github.com/alphacep/vosk-
         | api/blob/master/csharp/demo...
        
         | whimsicalism wrote:
         | It might require too much work for what you are looking for,
         | but the wav2letter library is the best real-time transcription
         | OSS I have found by a considerable margin.
        
           | davidzweig wrote:
           | Out of interest, did you try Nemo?
           | https://github.com/NVIDIA/NeMo
        
             | whimsicalism wrote:
             | No. I dont think it had streaming capabilities when i was
             | doing this test two years ago, although i see it does now.
        
         | TaylorAlexander wrote:
         | The base model seems to run faster than real time on my
         | machine. The "medium" model is larger and runs more slowly -
         | roughly real time or maybe slightly slower.
        
         | suyash wrote:
         | Depends if you're trying to run it offline or over the cloud.
        
       | tgtweak wrote:
       | Good to see them releasing model weights - hopefully now that
       | Stable Diffusion is out they will release Dall-E 2 source and
       | weights as well.
        
       | knaik94 wrote:
       | I got a super weird results with the 'medium' and language
       | Japanese (with a --task translate). The song is False Sympathy by
       | Mondo Grosso.
       | 
       | "[01:17.000 --> 01:32.000] Translated by Releska" when using the
       | translate to english. That entire part of the song is
       | instrumental. This line does not appear at all in the original
       | transcribe only in the opus format rip.
       | 
       | It shows up in the yt rip in format 251 (opus), but not in format
       | 140 (aac from youtube), nor the flac rip. All three are giving
       | different results.
       | 
       | The translation quality is tied to bitrate. Same song converted
       | to different words, the only difference being bitrates and
       | formats. Converting my own rip with the same parameters as yt
       | (opus @140 and then @130) didn't allow me to reproduce this
       | error.
       | 
       | The model hung for a solid extra minute at the end when
       | translating to english, the last 90ish seconds of the song took
       | real time 60 seconds, while the entire rest took about 90. The
       | same behavior was not observed with the transcribe.
       | 
       | Some of the english words are incorrect but that was expected.
       | The first Japanese "mistake" I found was "Quan tehaEr Ren no"
       | instead of "subeteha hutarino". With the left being what whisper
       | wrote. A single random word "hey" was transcribed/translated to
       | english even though it's the singer elongating the Yuan  while
       | singing the Le Yuan . "Luo chiteyuku Er Ren deXi garetaEr Ren
       | noragu HEY" instead of "Luo chiteiku Suo detsunagareta Er Ren
       | noLe Yuan " .
       | 
       | I am using the official subtitles released on the youtube video.
       | 
       | It's a complex Japanese song with both japanese and english, and
       | the original transcribe took about 20 real time seconds to start
       | with the first line, 130 seconds for the whole song. It seems to
       | be showing results in 20 second window increments, but this seems
       | to depend on what it considers audio and what it is throwing
       | away.
       | 
       | On my computer I wasn't able to use the large model because I ran
       | out of VRAM, I have 8gb, not sure how much more it'd require. So
       | I ran it with medium.
       | 
       | The song is False Sympathy by Mondo Grosso. The mv is suggestive,
       | in case that matters. I grabbed a fresh audio rip from Youtube
       | because I didn't want to take it out of my cd case.
       | 
       | https://www.youtube.com/watch?v=B6Y-WsgpzlQ
       | 
       | It is translating this version differently from the director's
       | cut version. I ripped both as opus.
       | 
       | There is something weird about how it is handling the opus
       | encoded version, as I find the same "Translated by Releska" in a
       | wav version transcoded from the opus.
        
       | amrrs wrote:
       | Here's a live demo on Hugging Face Spaces if you want to try -
       | https://huggingface.co/spaces/Amrrs/openai-whisper-live-tran...
        
         | clemnt wrote:
         | this is amazing! got it working in French too
        
       | TaylorAlexander wrote:
       | Hey this looks great! I like to record audio notes while driving
       | in my car after work, to kind of decompress my thoughts from the
       | day. But I never go back and listen as they can be long and
       | meandering. Sometimes in the audio log I will sum up my thoughts,
       | but this might be 20 minutes in and hard to find. I really wish I
       | had transcriptions so I could easily scan the full contents. I
       | have tried Mozilla Deepspeech (I don't want a cloud solution) and
       | I was surprised to find that I could not get Deepspeech to
       | reliably transcribe them. There is a bit of road noise, though I
       | think for a human listener they are easy to understand. It looks
       | like this one might actually do the trick!
       | 
       | EDIT: Tried it and it worked great! It is very easy to use. I
       | just did the pip install line in the readme and was ready to go.
       | You literally just run the one pip install line, and then you run
       | the program in the format "whisper my_audio.wav" and it goes.
       | Really nice job OpenAI!
        
         | zhynn wrote:
         | I do this too! I have been doing it for about a year now, and
         | haven't ever run into someone else that does this kind of
         | audio-journaling. Would you be up for comparing notes sometime
         | about how it is working out for you? I am finding that it is
         | extremely effective form of self-care, but with lots of
         | personal caveats. I would be so interested to hear your
         | experience.
        
           | blueberrychpstx wrote:
           | Count me in!! Working on tools actually to turn these
           | transcriptions into something more social
        
           | tekacs wrote:
           | I do this too, and I've built some software for it just for
           | myself.
           | 
           | I'd love to chat and hear about how you use this! My email is
           | in my profile, or I'm @tekacs on Twitter (and everywhere). :)
        
           | TaylorAlexander wrote:
           | Oh cool! Yeah I have stopped doing it lately as I was not
           | really using them (I would like to use them for making rough
           | notes for future youtube video scripts), though in general it
           | does seem like good self care too even if I don't review
           | them. That said I just tried the base model on one of my
           | voice logs and it was pretty good! Trying the medium model
           | now and it seems basically perfect. So I will have to start
           | doing these logs more!
           | 
           | Anyway I am pretty terrible with email but short exchanges
           | can work for me, or maybe we can connect over signal. Send me
           | a message to my email in my profile and I would be happy to
           | sync up!
        
         | Snitch-Thursday wrote:
         | Google's recorder app for android will let you record audio
         | files and make some transcriptions, right on the device.
        
           | Tenoke wrote:
           | I just tested it and it was pretty mediocre at least with my
           | accent. I can definitely benefit from a decent app for quick
           | note recording with a button press->transcribe->upload to
           | gdrive/good UI app for later grepping.
        
             | TaylorAlexander wrote:
             | Was this with the default base model, or the medium or
             | large model? This can be specified with the --model flag.
        
               | Tenoke wrote:
               | I meant the 'Google's recorder app' from the parent
               | comment and not Whisper.
        
           | capableweb wrote:
           | Is that application actually doing on-device transcription?
           | Under "Data safety" on the Google Play page it says "This app
           | may share these data types with third parties: Audio" which
           | doesn't exactly instill confidence that my audio will 100%
           | always stay on my device. It also says "Data is encrypted in
           | transit" but if data stays on the device, why it has to be
           | "encrypted in transit"? There should be no transit at all.
        
         | petercooper wrote:
         | I'll probably explore using this, but I've used an app called
         | Just Press Record to do what you say. Runs on Apple Watch too,
         | so you can tap a complication at any time in the day, speak,
         | and you get a transcript on your phone, etc.
        
       | anigbrowl wrote:
       | Oh nice - I have an immediate use case for this. This looks
       | accessible enough that the sci-fi dream of instantaneous audio
       | translation is suddenly within reach.
        
       | petercooper wrote:
       | Just tested this on some developer podcasts which usually fail
       | hard given they're full of technical jargon, brand names, etc.
       | Whisper is a revolution! It's picking up terms like Heroku,
       | DigitalOcean, GitHub, ECS, AWS, etc. and capitalizing properly -
       | something nothing else did unless you provided a whole pile of
       | guiding vocabulary.
        
         | ma2rten wrote:
         | Did these podcasts have transcripts? You might be inadvertently
         | evaluating it on data that it was trained on, which is
         | basically cheating. Even if not, it might be trained on similar
         | podcasts. Judging how good these kinds of models are is really
         | hard.
        
           | WiSaGaN wrote:
           | True. The test should only be done on the material released
           | _after_ the model.
        
       | Jnr wrote:
       | Cool!
       | 
       | I am one of the top contributors to the tiny Mozilla Common Voice
       | data-set for my language. The data-set is very small compared to
       | other popular languages and none of the other mentioned data-sets
       | contribute to that language to train the model of Whisper.
       | 
       | And even with so little data to train on it still works
       | surprisingly well.
        
       | jdmoreira wrote:
       | Looking forward to see if this works well with foreign accents
        
         | mminer237 wrote:
         | They have an example in the post with a very thick Scottish
         | accent. You should listen to it. It's pretty impressive.
        
       | localy wrote:
       | Are there any published benchmarks available outlining how this
       | compares to other open source ASR software, such as Coqui.ai?
        
       | bickett wrote:
       | Hard to keep up with all the great things. The AI community is
       | really moving quick right now.
        
       | aidenn0 wrote:
       | For those on NixOS, here's a quick and dirty flake.nix that will
       | let you make a venv in which to "pip install"'
       | 
       | Just put it in a flake.nix, and "nix develop" followed by
       | "virtualenv ./venv; . ./venv/bin/activate; pip install
       | git+https://github.com/openai/whisper.git"                   {
       | description = "Python 3.9 development environment";
       | outputs = { self, nixpkgs }:             let               system
       | = "x86_64-linux";               pkgs = import nixpkgs { inherit
       | system; };             in {
       | devShells.${system}.default = pkgs.mkShell {
       | buildInputs = [                   pkgs.ffmpeg
       | pkgs.python39                   pkgs.python39Packages.pip
       | pkgs.python39Packages.numpy
       | pkgs.python39Packages.pytorch
       | pkgs.python39Packages.virtualenv                 ];
       | };             };         }
        
         | aidenn0 wrote:
         | This should, in theory, work with CUDA; my GPU doesn't have
         | enough RAM to do it (it runs out at 2.9GiB allocated, I have
         | 4GiB, but am running a compositing desktop, which chews up
         | about 600MiB; not sure where the other ~400MiB went)
         | 
         | [edit]
         | 
         | I confirmed CUDA worked with the "small" model, which used
         | 3.3GB of GPU ram, and resulted in _much_ poorer recognition
         | than the  "medium" model on my CPU (but it ran at least two
         | orders of magnitude faster).                   {
         | description = "Python 3.9 development environment";
         | outputs = { self, nixpkgs }:           let             system =
         | "x86_64-linux";             pkgs = import nixpkgs {
         | inherit system;               config.allowUnfree = true;
         | config.cudaSupport = true;             };           in {
         | devShells.${system}.default = pkgs.mkShell {
         | buildInputs = with pkgs; [                 cudatoolkit
         | linuxPackages.nvidia_x11                 cudaPackages.cudnn
         | libGLU libGL                 xorg.libXi xorg.libXmu freeglut
         | xorg.libXext xorg.libX11 xorg.libXv xorg.libXrandr zlib
         | ncurses5 stdenv.cc binutils                 ffmpeg
         | python39                 python39Packages.pip
         | python39Packages.numpy
         | python39Packages.pytorch-bin
         | python39Packages.virtualenv               ];
         | shellHook = ''                   export
         | LD_LIBRARY_PATH="${pkgs.linuxPackages.nvidia_x11}/lib"
         | '';                       };           };         }
        
           | magicalhippo wrote:
           | CUDA worked fine with large on my 2080Ti FWIW. The speedup is
           | ridiculous, as expected. My Ryzen 3800X used almost an hour
           | transcribing a minute worth of speech, while the 2080Ti does
           | it in like 10-20 seconds.
        
       | BasilPH wrote:
       | Any opinions on what this means for speech-to-text companies like
       | rev.ai and assmembly.ai ?
       | 
       | We've tested open source solutions for s2t, like kaldi, but the
       | quality was not good enough. However, one of the main advantages
       | of a service like assembly.ai to me was that they offer sentence
       | splitting in form of punctuation and speaker detection, which
       | Kaldi does not.
       | 
       | So I guess I answered my own question to some degree: A S2T
       | service is more than just S2T. We already see assembly.ai add
       | more and more features (like summarisation, PID redaction ect.)
       | that are a value-add to plain S2T.
       | 
       | Still, curious to hear what your take on that is.
        
         | nshm wrote:
         | You can apply public punctation model from Vosk on top of Kaldi
         | output, you can also get speaker labels with existing open
         | source software.
         | 
         | On quick video transcription test this model is more accurate
         | than AssemblyAI and Rev AI. It will be harder for them to sell
         | pure ASR now. Some more business-oriented applications will
         | still be important though, for example ASR as part of
         | callcenter analytics solution or as a part of medical ERP
         | system.
         | 
         | The value of automatic summarization is small, without AI it is
         | very hard to make it right, you need to be an expert in the
         | field to understand what is important.
        
       | adeptima wrote:
       | Japanese results looks pretty impressive!
       | 
       | Took matsukoukuzira14Tou gaHai An niDa chiShang gerareru
       | osutoraria(2022Nian 9Yue 21Ri )
       | https://www.youtube.com/watch?v=bZkNIzeRBk4
       | 
       | Extracted audio with youtube-dl -f bestaudio
       | https://www.youtube.com/watch\?v\=bZkNIzeRBk4
       | 
       | Converted into [00:00.000 --> 00:13.000] osutorariaNan Bu noDao
       | de, Zhen tsuXiang kuzira14Dong gaHai An niDa chiShang gerareteSi
       | ndeirunogaJian tsukari, Zhuan Men Jia gaDiao Cha notameYuan Di Ru
       | rishimashita.  [00:13.000 --> 00:25.000] Yuan Di
       | medeianiyorimasuto, osutorariaNan Bu nokinguDong de, 19Ri , Shao
       | nakutomo14Dong noZhen tsuXiang kuziragaHai An niDa chiShang
       | gerareteSi ndeirunogaJian tsukarimashita.  [00:25.000 -->
       | 00:31.000] hotondogaRuo iosutowoJian rare, Zhuan Men Jia gaXian
       | Chang niZhong mukiDiao Cha niDang tatsuteimasu.  [00:31.000 -->
       | 00:41.000] kuziranoSi Hai haDa kikuYun ndariMai
       | metarisurukotogaNan shiitame, Zi Ran niFen Jie sarerunowoDai
       | tsuFang Zhen gaJian Tao sareteimasu.  [00:41.000 --> 00:52.000]
       | mata, Si Hai woJu i, samegaHai niJi maruKe Neng Xing
       | gaarutoshite, Yuan Di Dong Ju hasahuanadoniZhou Wei niJin
       | dukanaiyouniHu bikaketeimasu.  [00:52.000 --> 01:02.000] Yi Fang
       | , 21Ri nihatasumaniaDong deoyoso230Dong nokuziragaBang Bian niDa
       | chiShang geraretaZhuang Tai deJian tsukarimashita.  [01:02.000
       | --> 01:07.000] oyosoBan Shu gamadaSheng kiteiruMo Yang deJi Zhu
       | Huo Dong gaJin merareteimasu.  [01:07.000 --> 01:23.000] Jian
       | tsukatsutanoha, gondokuziranoZhong Jian toJian rareteimasu.
        
         | knaik94 wrote:
         | Did you try translating them to english? I want to see if you
         | get a similar error as me with a random phrase "Translated by
         | Releska" showing up.
        
         | gzer0 wrote:
         | Shocked at how good the results are, and how easy of an
         | installation it is.
         | 
         | Here are the exact steps to follow to get it running on Ubuntu
         | 22.04 via WSL and yt-dlp:                 1. pip install
         | git+https://github.com/openai/whisper.git            2. yt-dlp
         | -f 'ba' -x --audio-format mp3
         | https://www.youtube.com/watch/?v\=bZkNIzeRBk4            3.
         | renamed the file to test.mp3            4. whisper test.mp3
         | --language Japanese --task translate --model large
         | 
         | Note: the large model will download a ~3Gb file
        
       | tullie wrote:
       | Great to see OpenAI finally being open :)
        
       | nicholasjarnold wrote:
       | This is so cool! I was just speaking to a non-technical family
       | member about privacy concerns around using "OK Google" and the
       | like. They responded inquiring about "private" alternatives, to
       | which my answer was "I'm not aware of good ones that give you
       | that level of accuracy and convenience."
       | 
       | Perhaps this development along with continued optimization and
       | device compute power increases will lead us into a near-future
       | where things like Mycroft devices and cellphones could have
       | local-only speech-to-text and translation capabilities which are
       | accurate even with environmental background noise variations
       | encountered IRL.
       | 
       | Great work OpenAI team!
        
       | mwlp wrote:
       | Super impressive. I tested it on a Japanese streamer whose
       | enunciation isn't exactly perfect and it did a decent job:
       | https://www.youtube.com/watch?v=ROiOU1scaNA
       | [00:00.000 --> 00:06.500]  Since the last one started, the number
       | of times I've eaten has decreased.       [00:06.500 -->
       | 00:11.000]  If I get too carried away with the last one, I'll get
       | hungry and do it.       [00:11.000 --> 00:14.500]  I don't have
       | time to eat.       [00:15.500 --> 00:18.000]  I'm going to eat
       | now.       [00:20.000 --> 00:23.000]  It's going to take about 10
       | minutes from here.       [00:23.000 --> 00:31.000]  It's been a
       | while since I've had my last meal.       [00:31.000 -->
       | 00:36.000]  I feel like I'm losing myNu Zi Li .       [00:36.000
       | --> 00:39.000]  I have to go back to my original self.
       | [00:39.000 --> 00:44.000]  I have to get ready and go to bed.
       | [00:44.000 --> 00:46.000]  It's not good.       [00:46.000 -->
       | 00:51.000]  I've been drinking a lot lately, so I'm going home.
       | [00:51.000 --> 00:53.000]  I have to get my nails done this fall.
       | [00:53.000 --> 00:54.000]  Halloween nails.       [00:54.000 -->
       | 00:57.000]  Halloween, Halloween, Halloween.       [00:57.000 -->
       | 00:59.000]  I'm going to the beauty salon today.       [00:59.000
       | --> 01:02.000]  I'm going to get my nails done the day after
       | tomorrow.       [01:02.000 --> 01:10.000]  I used to look at a
       | lot of clothes, but I stopped looking at them.       [01:10.000
       | --> 01:12.000]  I'm going crazy.       [01:12.000 --> 01:22.000]
       | My stomach's stopped in the middle of summer.
        
         | adeptima wrote:
         | translation is not the strongest part. transcription looks very
         | good.
        
         | magicalhippo wrote:
         | It's struggling with Norwegian. Which I guess isn't shocking.
         | The large model performs a fair bit better than the small,
         | though neither is "good".
         | 
         | Though I assume the amount of Norwegian it has been exposed to
         | is fairly limited, so in that light I'm actually impressed as
         | well.
         | 
         | I tried it on a news segment from the radio[1], this is the
         | large model output:                   [00:14.000 --> 00:17.200]
         | En skamlos krenking av FN pakten.         [00:17.200 -->
         | 00:24.000]  USAs president og verdensledere svarer pa den
         | russiske presidentens atomtrusler og krigsmobilisering.
         | [00:25.500 --> 00:29.400]  Arbeidsklaer som er ment til a vaere
         | til begge kjonn, har det med a vaere tilpasset.
         | [00:29.400 --> 00:33.400]  Men hvordan ville det gatt, om det
         | var motsatt?         [00:34.100 --> 00:38.900]
         | Dyrevernsorganisasjon vil ha digital merking av regnstyr,
         | [00:38.900 --> 00:44.900]  men naeringen selv insisterer pa den
         | gamle tradisjonsrike maten med rissing av kniv.
         | [00:45.600 --> 00:51.400]  Mange stromselskaper er positive til
         | a tilby kundene fastpris pa strom, og det arevis.
         | [00:51.400 --> 00:59.900]  Da risikerer de a matte betale mye i
         | nettopp aretsvis, sier aktorer som aldri tilbyr fastpris.
         | [00:59.900 --> 01:21.900]  Dette er onsdagens Dagsnytten. Jeg
         | heter Espen As.
         | 
         | For reference, here's what he actually said, from the source[1]
         | itself:                   * En skamlos krenking av FN-pakten.
         | USAs president og verdensledere svarer pa den russiske
         | presidentens atomtrusler og krigsmobilisering.         *
         | Arbeidsklaer som er ment a vaere til begge kjonn, er som regel
         | tilpasset ... menn. Hvordan hadde det gatt om det var motsatt?
         | * Dyrevernsoganisasjon vil ha digital merking av reinsdyr, men
         | naeringen selv insisterer pa den gamle tradisjonsrike maten med
         | rissing av kniv.         * Mange stromselskaper er positive til
         | a tilby kundene fastpris pa strom - og det i arevis.         -
         | Da risikerer de a matte betale mye i nettopp; arevis, sier
         | aktor som aldri tilbyr fastpris         Dette er onsdagens
         | Dagsnytt 18 - jeg heter Espen Aas.
         | 
         | The translation didn't fare that well though:
         | [00:14.000 --> 00:17.000]  A shameless violation of the UN
         | treaty.         [00:17.000 --> 00:24.000]  The US president and
         | world leaders respond to the Russian president's nuclear
         | threats and war mobilization.         [00:24.000 --> 00:33.000]
         | Work clothes that are meant to be for both genders have to be
         | suitable, but how would it be if it was the other way around?
         | [00:34.000 --> 00:44.000]  The animal welfare organization will
         | have a digital marking of reindeer, but the industry itself
         | insists on the old traditional way of tearing a knife.
         | [00:45.000 --> 00:51.000]  Many electricity companies are
         | positive in offering customers fixed electricity prices, and
         | that is annual.         [00:51.000 --> 00:58.000]  Then they
         | risk having to pay a lot in just a year, says an actor who has
         | never offered fixed prices.         [00:58.000 --> 01:20.000]
         | This is Wednesday's Dagsnytt 18. My name is Espen As.
         | 
         | For reference, here's Google Translate's attempt, which is
         | pretty good:                   * A shameless violation of the
         | UN Charter. The US president and world leaders respond to the
         | Russian president's nuclear threats and war mobilization.
         | * Work clothes intended for both sexes are usually adapted to
         | ... men. How would it have gone if it had been the other way
         | around?         * Animal welfare organizations want digital
         | marking of reindeer, but the industry itself insists on the
         | old, traditional way of marking with a knife.         * Many
         | electricity companies are positive about offering customers a
         | fixed price for electricity - and for years.         - Then
         | they risk having to pay a lot in precisely; for years, says a
         | player who never offers a fixed price         This is
         | Wednesday's Dagsnytt 18 - my name is Espen Aas.
         | 
         | [1]:
         | https://radio.nrk.no/podkast/dagsnytt_atten/l_5ce3e323-97a3-...
         | (not sure if it's available outside of Norway)
        
       | kiwih wrote:
       | Given this, are there good (and available/open source) models for
       | text to speech? Last time I tried everything still sounded
       | extremely robotic, and/or were a pain to set up and run. It would
       | be fun to set up a pipeline where the two processes
       | 'communicate'.
        
         | obscur wrote:
         | Measuring performance in rounds of successful Chinese whisper
         | 
         | (irony)
        
       | pen2l wrote:
       | Neat, https://github.com/openai/whisper - they have open-sourced
       | it, even the model weights, so they are living up to their name
       | in this instance.
       | 
       | The 4 examples are stunningly good (the examples have speakers
       | with heavy accents, speaking in foreign language, speaking with
       | dynamic background noise, etc.), this is far and away better than
       | anything else I've seen. Will be super curious to see other folks
       | trying it out and seeing if it's as robust as it seems, including
       | when confronted with audio speech with natural tics and uhhh's
       | and uhmm's and everything in-between.
       | 
       | I think it's fair to say that AI-transcription accuracy is now
       | decidedly superior to the average human's, what the implications
       | of this are I'm not sure.
        
         | anigbrowl wrote:
         | It was already better. I edit a podcast and have > a decade of
         | pro audio editing experience in the film industry, and I was
         | already using a commercial AI transcription service to render
         | the content to text and sometimes edit it as such (outputting
         | edited audio).
         | 
         | Existing (and affordable) offerings are so good that they can
         | cope with shitty recordings off a phone speaker and maintain
         | ~97% accuracy over hour-long conversations. I'm sure it's been
         | an absolute godsend for law enforcement other people who need
         | to gather poor-quality audio at scale, though much less great
         | for the targets of repressive authority.
         | 
         | Having this fully open is a big deal though - now that level of
         | transcription ability can be wrapped as an audio plugin and
         | just used wherever. Given the parallel advances in resynthesis
         | and understanding idiomatic speech, in a year or two I probably
         | won't need to cut out all those _uuh like um y 'know_ by hand
         | ever again, and every recording can be given an noise reduction
         | bath and come out sounding like it was recorded in a room full
         | of soft furniture.
        
           | adamgordonbell wrote:
           | I've not found that to be the case.
           | 
           | For technical content, I use Rev.com and provide a glossary
           | and real humans do the transcript. Other AI transcription
           | services get lots wrong because the context often matters.
           | Words like "TCP/IP" or "FAT disk format" or "Big Endian" I've
           | never found AI so far to handle well.
           | 
           | I'm interested to test out whisper on this one.
           | 
           | https://corecursive.com/063-apple-2001/
        
           | deegles wrote:
           | There's already software that can imitate a person's voice,
           | so we have all the pieces already to do speech-to-text, clean
           | up with GPT-3, and back to text-to-speech in the original
           | person's voice. Maybe with a style transfer to keep the
           | person's inflections etc the same?
        
             | Karuma wrote:
             | I think something similar already exists. See this, for
             | example: https://koe.ai/recast/
             | 
             | Although I don't know if they're using anything similar to
             | what you suggest. Very cool idea, anyway!
        
           | biomcgary wrote:
           | Since you work on podcasts, do any open source transcription
           | tools currently identity the speaker in the output? This
           | would be particularly helpful for interviews.
        
           | solarmist wrote:
           | Any recommendations for particular services?
        
             | anigbrowl wrote:
             | I use a service called sonix.ai. It's paid but I think they
             | have a free tier or trial period, and it's not very
             | expensive. I'm excited about this new OpenAI thing because
             | I'd rather do it on my own hardware than send it to the
             | cloud, but this company has earned its commercial success.
        
           | solarmist wrote:
           | That is an exciting possibility. Being able to fix bad setups
           | and missed takes automagically. It's always been possible,
           | just expensive and time consuming for moderate improvements.
        
           | thfuran wrote:
           | >~97% accuracy over hour-long conversations. I'm sure it's
           | been an absolute godsend for law enforcement
           | 
           | 97% accuracy means roughly three or four errors per minute of
           | speech. That seems potentially extremely problematic for
           | something like law enforcement use where decisions with
           | significant impact on people's day and/or life might be made
           | on the basis of "evidence".
        
             | gs17 wrote:
             | Yeah, I tried to use automated transcription for a research
             | project and we had to do it all manually because the few
             | errors (I would say it did pretty well given our recording
             | quality) were often dropping words like "not", which
             | changed the whole meaning of a sentence! It was a useful
             | assistance during transcription, but I really hope they
             | would verify it was correct before arresting anyone based
             | on it.
        
             | anigbrowl wrote:
             | No it isn't. That just means 2-3% of your content needs to
             | be double-checked by a person at the audio level, saving
             | huge amounts of time - equally true of human transcription,
             | in which individual words are often [UNINTELLIGEBLE].
             | 
             | Would you want to review this fully before going into
             | court, absolutely - because you'd want to play the
             | recording to a jury for emotional impact. Can you rely on
             | it when you want to quickly read through hours of
             | conversation and make decisions about whether to invest
             | further resources (which might just mean another hour of
             | listening back to the original audio)? Also absolutely.
             | Bear in mind that a lot of these errors have little to no
             | semantic impact, being on the same level as typos or
             | misspellings in a written communication.
             | 
             | Bear in mind too that if law enforcement (honest or not) is
             | so interested in you that they're willing to record your
             | conversations, your day is already ruined, you just don't
             | know it yet. The change here is one of scale rather than
             | quality.
        
               | wging wrote:
               | Doesn't it mean 100% of your content needs to be double-
               | checked? You can't easily identify which 2-3% of your
               | content has errors. I'm aware that errors are more likely
               | when the model is less confident of its predictions, but
               | that shouldn't be enough.
               | 
               | (edit for clarification: errors are not always something
               | like "[UNINTELLIGIBLE]", where the system knows it
               | doesn't know; they can also be misrecognitions that the
               | system believes in with high confidence.)
        
               | woah wrote:
               | You double check things that you think are important, in
               | this case, passages that will be used as evidence in
               | court.
        
               | guelo wrote:
               | Maybe you could run the text through a grammar checker to
               | identify the errors.
        
               | anigbrowl wrote:
               | By the time you're prosecuting someone in court, yes of
               | course you double, triple, quadruple check everything.
               | That's why lawyers get paid the big bucks (for now...).
               | But yes you can identify which content probably has
               | errors and flag it as such.
               | 
               | Look, I have decades of experience dealing with human
               | speech, and not just as an editor - I can trace the human
               | voice from neural impulses in Broca's region through the
               | physiology of vocal production, mechanical transduction
               | into electrical signals, discrete fourier transforms of
               | the resultant waveforms into spectral information and
               | back again, the reproduction of altered signals from
               | time-aligned speakers to create a sense of
               | spatialization, how those are processed in the human ear,
               | and how the cilia are connected by nerves back to your
               | brain. I'm a good enough editor that I can recognize many
               | short words by sight of a waveform, or make 10 edits in a
               | row by sight and know it will sound good on playback.
               | 
               | So when I say that machine transcription is as good as
               | human realtime transcription now, I say so with the clear
               | expectation that those decades of craft are very close to
               | being rendered obsolete. I absolutely expect to hand off
               | the mechanical part of editing to a machine within 2
               | years or so. It's already at the stage where I edit some
               | interviews as text, like in a word processor, and then
               | export the edited document as audio and it's Good Enough
               | - not for every speaker, but more than half the time.
               | 
               | NPR and a lot of commercial broadcasters cut their
               | material this way already, because you can get the same
               | result from 30 minutes of reading and text editing that
               | would require 3 hours of pure audio editing with no
               | transcription.
        
               | etienne618 wrote:
               | Presumably you can use the 97% that is correctly
               | transcribed to rapidly filter out the relevant content.
               | This is likely to be only a small portion of the total
               | content. Then you check 100% of that.
        
               | datalopers wrote:
               | If you know which 2-3% are the false positives, you have
               | a very lucrative business model.
        
               | MonkeyMalarky wrote:
               | When doing validation, I find it will often be the same
               | errors repeated again and again in a transcription. Like
               | it will fail on someone or some thing's name (that is
               | rare / unique) and map it onto a known similar sounding
               | word.
        
               | gnramires wrote:
               | I think an [UNINTELLIGIBLE] indication would be a great
               | addition to automatic transcription systems.
        
               | inanutshellus wrote:
               | It'd [UNINTELLIGIBLE score="92%" alternatives="pro-
               | rabble; pourable"]probably[/UNINTELLIGIBLE] be useful to
               | make a markup-based output... though you'd probably find
               | it gave you more info than you wanted.
        
               | anigbrowl wrote:
               | It already exists. The commercial product I use most is
               | called sonix.ai and I think they have a free tier or
               | trial period. It has shortcomings but it's shockingly
               | good, despite having some limitations.
        
               | thfuran wrote:
               | >equally true of human transcription, in which individual
               | words are often [UNINTELLIGEBLE].
               | 
               | ML systems somewhat notoriously do not necessarily make
               | the same sorts of errors that a human would. And I'd
               | expect a large portion of the errors to be transcribing
               | the wrong words rather that indicating that a word
               | couldn't be transcribed. That sort of error means that
               | you can't really get away with manually reviewing just 3%
               | of the audio.
        
             | golem14 wrote:
             | One would think that the few crucial bits of information
             | gleaned are listened to manually, and the machine
             | translation is not the only thing the judge or a jury sees.
        
               | thfuran wrote:
               | You have absolutely ruined someone's day way before
               | they're sitting in front of a jury.
        
               | formerly_proven wrote:
               | Stuff like that is a very good tell that someone has zero
               | experience with law enforcement.
        
             | j-krieger wrote:
             | I've worked with similar technology in the law enforcement
             | space and the software is never used to make decisions. You
             | can make out critical timestamps in conversations and a law
             | enforcement officer will always manually confirm the
             | software's assessments.
        
               | JohnFen wrote:
               | Given that law enforcement has made similar claims about
               | technology use in the past that turned out to be false, I
               | have no faith in this claim.
        
             | hadlock wrote:
             | Microsoft announced their voice transcription technology a
             | couple years ago and were also touting ~97-98% accuracy
             | which was actually _better_ than human transcription error
             | rates. The errors are usually in part people garbling their
             | own speech, or they move their head while talking and the
             | microphone misses a syllable. Anything in that error bar
             | would probably fall under  "reasonable doubt"
        
               | kyriakos wrote:
               | If its anything like Microsoft teams transcription I
               | doubt the 97%+ accuracy.
        
         | soheil wrote:
         | Their name reminds of the company McDonald's uses to supply
         | their beef called 100% Pure Beef Inc. so they can say 100% Pure
         | Beef on their menu.
        
           | space_fountain wrote:
           | This seems to not be true for McDonald:
           | https://www.snopes.com/fact-check/mcdonalds-100-beef/
        
             | soheil wrote:
             | This article seems very suspect to me. This is the main
             | reason they assert why the claim is false:
             | 
             | "While this is a fascinating premise, there's nothing to
             | it: McDonald's hamburger patties in the U.S. are made with
             | 100% USDA-inspected beef. They are cooked and prepared with
             | salt, pepper and nothing else; no preservatives, no
             | fillers.
             | 
             | McDonald's of Australia's "Make Up Your Own Mind" web site
             | said the following of the rumor in its Top FAQs section:
             | Is it true that McDonald's created a company called "100%
             | Australian Beef" just so they can say that in their
             | advertising?              No."
             | 
             | So if I'm McDonald's and want to squash a negative story
             | why not throw a few bucks at the pinnacle of journalism
             | that is Snopes? (formerly Urban Legends Reference Pages)
        
               | space_fountain wrote:
               | This isn't exactly a hard story to fact check. There is 0
               | evidence for this in either the reddit thread or really
               | anywhere? If they were willing to lie about the company
               | name why not just lie about the beef in their burgers it
               | would be equally scandalous
        
               | soheil wrote:
               | The company name could be 100% legit, there is nothing
               | stopping you from a forming a company with that name and
               | not even sell beef.
        
               | sam_goody wrote:
               | It definitely happens.
               | 
               | There are at least two companies that have branded [..]
               | Kosher Gelatin(tm). One of them makes gelatin that is
               | considered non-kosher by all of the major kashrus
               | agencies.
               | 
               | "Kosher Gelatin(r)", when in the ingredients, just means
               | the product contains pork.
        
               | jsight wrote:
               | You are right, it could be. The problem is that its the
               | kind of thing that would be almost impossible to disprove
               | if it were false. So you can always raise doubts about a
               | supposed disproof.
               | 
               | But it'd be really easy to prove if it were true and
               | noone has offered proof. And there've been plenty of
               | people who've looked for such proof, afaict.
               | 
               | My default assumption in such cases is that it is likely
               | false.
        
               | jefftk wrote:
               | If this was more than an urban legend someone would be
               | able to dig up a company with this name and some
               | indication that McD was working with them.
        
               | pessimizer wrote:
               | Something being possible to do isn't enough evidence for
               | rational people to believe that it happened. From my
               | perspective, it's possible that you're Iron Mike Tyson,
               | or that you died after your last comment and this one was
               | posted by the assassin who killed you.
        
               | soheil wrote:
               | What? I never said it's evidence that it did happen,
               | please don't make things up. I just pointed out the
               | evidence provided to refute the claim is possibly
               | invalid.
        
               | pessimizer wrote:
               | You haven't offered any evidence is the point.
        
               | [deleted]
        
               | whichfawkes wrote:
               | In the US, for a while I remember we had billboards
               | advertising McDonald's burgers as being "1 <hamburger>
               | <hamburger>% beef". Because the hamburgers were of course
               | circular, it looked kind of like "100%".
               | 
               | I remember thinking that surely an image of a hamburger
               | does not legally constitute a zero.
        
           | leobg wrote:
           | Seems like this is an urban legend.
           | 
           | https://www.reddit.com/r/IsItBullshit/comments/2rztov/isitbu.
           | ..
        
             | soheil wrote:
             | This seems to be primarily based on the referenced Snopes
             | article https://news.ycombinator.com/item?id=32929237
        
         | [deleted]
        
         | bambax wrote:
         | The French version is a little contrived. The speaker is a
         | native speaker, but the text is obviously the result of a
         | translation from English to French, not idiomatic French.
         | 
         | I will try to put the code to the test, see how it goes.
        
           | octref wrote:
           | I'm interested in building something with this to aid my own
           | French learning. Would love to read your findings if you end
           | up posting it somewhere like twitter/blog!
        
             | bambax wrote:
             | Tried again with Blaise Pascal -- the famous fragment of a
             | letter where he says he's sorry he didn't have enough time
             | to make it shorter.
             | 
             | Original:
             | 
             | > _Mes reverends peres, mes lettres n'avaient pas accoutume
             | de se suivre de si pres, ni d'etre si etendues. Le peu de
             | temps que j'ai eu a ete cause de l'un et de l'autre. Je
             | n'ai fait celle-ci plus longue que parce que je n'ai pas eu
             | le loisir de la faire plus courte. La raison qui m'a oblige
             | de me hater vous est mieux connue qu'a moi. Vos reponses
             | vous reussissaient mal. Vous avez bien fait de changer de
             | methode ; mais je ne sais si vous avez bien choisi, et si
             | le monde ne dira pas que vous avez eu peur des
             | benedictins._
             | 
             | Transcription:
             | 
             | > Mes reves errent peres, mais l'detre navais pas accoutume
             | de se suivre de si pres ni d'detre si etendu. Le peu de
             | temps que j'sais eu a ete cause de l'de l'de l'de autre.
             | J'sais n'detre plus longue que parce que j'sais pas eu le
             | loisir de la faire plus courte. La raison qui m'sa obligee
             | de me hater vous est mieux connue qu'moi. Vos reponses vous
             | reussissaient mal. Vous avez bien fait de changer de
             | methode, mais je ne sais pas si vous avez bien choisi et si
             | le monde ne dira pas que vous avez eu peur des benedictes.
             | 
             | Here there are many more mistakes, so many that the
             | beginning of the text is unintelligible. The language from
             | the 17th century is probably too different. Still on the
             | "medium" model, as the large one crashes the Colab (not
             | sure how to select a beefier machine.)
             | 
             | Still fascinating and exciting though.
        
             | bambax wrote:
             | I'm playing with a Colab posted in this thread
             | (https://news.ycombinator.com/item?id=32931349), and it's
             | incredibly fun and accurate!
             | 
             | I tried the beginning of L'etranger (because you seem to be
             | a fan of Camus ;-)
             | 
             | Here's the original:
             | 
             | > _Aujourd'hui, maman est morte. Ou peut-etre hier, je ne
             | sais pas. J'ai recu un telegramme de l'asile : << Mere
             | decedee. Enterrement demain. Sentiments distingues. >> Cela
             | ne veut rien dire. C'etait peut-etre hier._
             | 
             | > _L'asile de vieillards est a Marengo, a quatre-vingts
             | kilometres d'Alger. Je prendrai l'autobus a deux heures et
             | j'arriverai dans l'apres-midi. Ainsi, je pourrai veiller et
             | je rentrerai demain soir. J'ai demande deux jours de conge
             | a mon patron et il ne pouvait pas me les refuser avec une
             | excuse pareille. Mais il n'avait pas l'air content. Je lui
             | ai meme dit : << Ce n'est pas de ma faute. >> Il n'a pas
             | repondu. J'ai pense alors que je n'aurais pas du lui dire
             | cela. En somme, je n'avais pas a m'excuser. C'etait plutot
             | a lui de me presenter ses condoleances._
             | 
             | Here's the transcription:
             | 
             | > Aujourdhui, maman est morte, peut etre hier, je ne sais
             | pas. J''ai recu un telegramme de l''asile. Mere decedee,
             | enterrement demain, sentiment distingue. Cela ne veut rien
             | dire. C''etait peut etre hier.
             | 
             | > L''asile de Vieillard est a Maringot, a 80 km d''Alger.
             | Je prendrai l''autobus a deux heures et j''arriverai dans
             | l''apres midi. Ainsi, je pourrai veiller et je rentrerai
             | demain soir. J''ai demande deux jours de conge a mon patron
             | et il ne pouvait pas me les refuser avec une excuse
             | pareille. Mais il n''avait pas l''air content. Je lui ai
             | meme dit, ce n''est pas de ma faute. Il n''a pas repondu.
             | J''ai alors pense que je n''aurais pas du lui dire cela. En
             | somme, je n''avais pas a m''excuser. C''etait plutot a lui
             | de me presenter ses condoleances.
             | 
             | Except for the weird double quotes instead of the single
             | apostrophe ('), it's close to perfect, and it only uses the
             | "medium" model.
             | 
             | This is extremely exciting and fun! Happy to try other
             | texts if you have something specific in mind!
        
             | bambax wrote:
             | Last try for tonight with Baudelaire.
             | 
             | Original:                   Trois mille six cents fois par
             | heure, la Seconde         Chuchote Souviens-toi !- Rapide,
             | avec sa voix         D'insecte, Maintenant dit Je suis
             | Autrefois,         Et j'ai pompe ta vie avec ma trompe
             | immonde !              Remember ! Souviens-toi ! prodigue !
             | Esto memor !         (Mon gosier de metal parle toutes les
             | langues )         Les minutes, mortel folatre, sont des
             | gangues         Qu'il ne faut pas lacher sans en extraire
             | l'or !
             | 
             | Transcription:
             | 
             | > Trois mille six cents fois par heure, la seconde chuchote
             | << Souviens toi >>, rapide, avec sa voix d''insecte,
             | maintenant dit << Je suis autrefois >>, et j''ai pompe ta
             | vie avec ma trompe immonde. << Remember, souviens toi,
             | prodigue, est au memoire, mon gosier de metal, parle toutes
             | les langues, les minutes, mortelles folatres, sont des
             | gangs qu''il ne faut pas lacher sans en extraire l''or. >>
             | 
             | Not bad! Far from perfect but it's a difficult text.
             | Interesting that it works better with Baudelaire than
             | Pascal.
        
           | pen2l wrote:
           | Interesting, I'm a non-native French speaker, the original
           | French piece struck me as being entirely normal (but maybe it
           | was just the perfect French accent that swayed me). Can you
           | please point out what he said which wasn't idiomatic or
           | naturally-worded French?
        
             | bambax wrote:
             | Little details. The second sentence is really bizarre:
             | 
             | > _Nous etablissons que l 'utilisation de donnees d'un tel
             | nombre et d'une telle diversite est la raison pour laquelle
             | le systeme est a meme de comprendre de nombreux accents..._
             | 
             | It doesn't sound natural at all. An idiomatic formulation
             | would be more along the lines of:
             | 
             |  _Le recours a un corpus [de donnees] si riche et varie est
             | ce qui permet au systeme de comprendre de nombreux accents_
             | (With  'corpus', 'donnees' is implied.)
             | 
             | Of course this is just an example, and I'm sure other
             | French speakers could come up with a different wording, but
             | "donnees d'un tel nombre et d'une telle diversite" sounds
             | really wrong.
             | 
             | This is also weird and convoluted:
             | 
             | > _Nous distribuons en tant que logiciel libre le code
             | source pour nos modeles et pour l 'inference, afin que
             | ceux-ci puissent servir comme un point de depart pour
             | construire des applications utiles_
             | 
             | It should at least be "le code source DE nos modeles" and
             | "servir DE point de depart", and "en tant que logiciel
             | libre" should placed at the end of the proposition (after
             | 'inference').
             | 
             | Also, "construire" isn't used for code but for buildings,
             | and "applications utiles" is unusual, because "utiles"
             | (useful) is assumed. "...pour le developpement de nouvelles
             | applications" would sound more French.
        
               | [deleted]
        
             | _plg_ wrote:
             | At the start, the "Nous etablissons" part, for example. You
             | wouldn't write that if you were starting scratch from
             | French.
        
             | not_math wrote:
             | You can see from the transcript where the model made some
             | errors, for example:
             | 
             | > We distribute as a free software the source code for our
             | models and for the inference [...]
             | 
             | Should be
             | 
             | > We are open-sourcing models and inference code [...]
             | 
             | Another example
             | 
             | > We establish that the use of such a number of data is
             | such a diversity and the reason why our system is able
             | [...]
             | 
             | Should be
             | 
             | > We show that the use of such a large and diverse dataset
             | leads to improved robustness [...]
        
         | Workaccount2 wrote:
         | Can't wait to see twelve new $49.99/mo speech parser services
         | pop up in the next few weeks.
        
         | suyash wrote:
         | More of this is welcome, they should live up their name and
         | original purpose and share other models (code, weights,
         | dataset) in the open source community as well.
        
       | Simorgh wrote:
       | I've been experimenting with voice-interfaces where typing is
       | replaced by talking, but I find it hard to transition users to
       | voice - we 'seem' to prefer typing to talking.
       | 
       | I wonder if this will change.
        
         | ironlake wrote:
         | Personally, I would rather type than talk when interacting with
         | a computer. The only time I use voice interfaces are when the
         | physical interface is so poor it's just easier to use voice.
         | Apple TV devices are an example of this.
        
       | shpx wrote:
       | We shouldn't call this open source. The model definition + the
       | data is the source code. The model weights are a compilation
       | artifact.
       | 
       | > The source code must be the preferred form in which a
       | programmer would modify the program. [...] Intermediate forms
       | such as the output of a preprocessor or translator are not
       | allowed.
       | 
       | > https://opensource.org/osd
       | 
       | If I asked a programmer from OpenAI to modify the model to better
       | support Japanese speakers from Hokkaido, their "preferred form"
       | of the model's source code would include the 680,000 hours of
       | audio used to train the model.
       | 
       | Yes that means that there are almost no open source models and
       | yes it's awesome that they released this and made the weights
       | available. Just don't call it open source.
        
       | lfmunoz4 wrote:
        
       | sergiotapia wrote:
       | Does this work with multiple speakers?
       | 
       | I want to build a tool that takes a video and generates subtitles
       | for it, then I want to index the subtitles and let people search
       | for a specific quote to scrub to that part of the video using
       | automatically generated urls.
       | 
       | This is for a specific fandom of a ton of content, lots of dirty
       | audio mostly recorded in a gym setting with multiple people
       | speaking.
        
         | 867-5309 wrote:
         | pretty sure such a tool made HN front page a few months ago
        
       | isoprophlex wrote:
       | Really incredible to see that their multilingual audio-to-English
       | approach is viable. I'm super excited about this, and great to
       | see that openai actually open up about something, for once.
       | 
       | Skimming the codebase I can't immediately see code to do
       | additional training.
       | 
       | Being able to fine-tune the model to a specific language or case
       | (eg. teach it specifically about some technical topic that might
       | not be so prevalent in the current train set) would be majorly
       | disruptive to current SOTA in "callcenter analytics" tech.
       | Especially when combining Whisper with GPT3.
        
       | samstave wrote:
       | AI speech recognition FN scares the heck out of me...
       | 
       | for so many reasons.
       | 
       | But one that really pisses me off is not being able to turn it
       | off on the iphone, and the fact that aside from "hidden cameras
       | in my airBnB" -- soon we will have to worry about secret
       | listening machines EVERYWHERE
        
         | jfoster wrote:
         | Also, based on their demo, this model seems like it might have
         | comprehension well above the level of a typical human.
         | 
         | Anyway, it's out there now. No way to turn back.
        
         | ma2rten wrote:
         | We will see an explosion of AI capabilities in the next couple
         | of years. This will have a huge impact on our lives, much of it
         | good but some of it also bad.
        
           | samstave wrote:
           | "Good" for ensuring you're a compliant consumer - bad if
           | you're an individual person
        
         | wongarsu wrote:
         | "Secret listening machines everywhere" was a pretty big thing
         | in East Germany. It's also the central theme of the movie The
         | Lives of Others.
         | 
         | Of course, the ability to scale this more cheaply (throwing
         | more compute at it, instead of more people) is somewhat scary,
         | but it's not really introducing a new capability. Especially
         | since you still have to do something with the transcript. An
         | AirBnB landlord who reads the transcript of what you said could
         | as well have listened to the recording.
        
           | ALittleLight wrote:
           | I think it's a new capability to add good speech to text,
           | search, and models that can understand and process text. You
           | have microphones recording speech everywhere, models turning
           | that speech into easily searchable text, and something like
           | GPT-3 reading all the speech and raising red flags for any
           | transgressive idea you please.
        
             | samstave wrote:
             | Yes, and if you want AI that is searching for "dissenters"
             | we shall soon have "speech police" or tickets or some
             | format of authoritarian punitive actions powered by this
        
               | zappy42 wrote:
               | "John Spartan, you have been fined one credit for
               | violation of the Verbal Morality Statute."
        
           | jffry wrote:
           | I'd argue that cheap, pervasive, always-on surveillance with
           | a backlog of searchable transcriptions is a qualitatively
           | different capability.
        
             | samstave wrote:
             | Exactly.
             | 
             | We are entering the next era...
             | 
             | The Kurzweil podcast appearance on Lex Fridman is nuts and
             | while I love kurzweil, holy crap even with my distopian
             | outlook he makes it even worse when you listen to even half
             | of it...
        
       | gareth_untether wrote:
       | I'm thinking of releasing a plugin in for Unity to that can be
       | used to match a phrase to an action. Seeing Whisper is making me
       | think I should include a way to use voice and not just text.
        
       | aidenn0 wrote:
       | I just threw a random rock MP3 at it, and a first readthrough
       | shows no transcription errors; this is quite good.
       | 
       | Now I just want OCR that's even 50% as good as this...
        
         | aidenn0 wrote:
         | Ran a few other songs through it and found one obvious
         | mistranscription:
         | 
         | "He's the bedroom cosmic rocker" (should be "He's the veteran
         | cosmic rocker" in _Veteran Cosmic Rocker_ by The Moody Blues)
         | 
         | I also noticed that it's a little on the conservative side for
         | detecting speech; all songs were missing at least part of one
         | line.
        
       | funhighway wrote:
       | Would be nice to give more details about the provenance and
       | construction of the training data.
        
       | [deleted]
        
       | StevenWaterman wrote:
       | That example at the top of the page (speed talking) blew me away.
       | He started talking, I was stunned for a minute, then realised
       | yes, it really was English, and I just burst out laughing.
       | 
       | That's so, so far beyond the previous state-of-the-art, it's
       | absurd.
        
       | londons_explore wrote:
       | @dang Can we change the link to the github here[1]?
       | 
       | It seems to describe the project better for a technical audience.
       | 
       | [1]: https://github.com/openai/whisper
        
       | toss1 wrote:
       | Like every model I've seen there is something like this:
       | 
       | >>A decoder is trained to predict the corresponding text...
       | 
       | Prediction of expected text in the context of the previous text.
       | 
       | While this is valuable in casual transcription, it can be
       | extremely dangerous in serious contexts.
       | 
       | From personal experience, having given a deposition with an "AI"
       | transcription, it will literally reverse the meanings of
       | sentences.
       | 
       | This is because it produces the _EXPECTED_ output in a context,
       | and _NOT THE ACTUAL OUTPUT_.
       | 
       | Like a speaker that clips the output, these types of systems
       | 'clip' the really valuable information out of a transcription.
       | Worse yet, this is a completely silent failure, as the transcript
       | _LOOKS_ really good.
       | 
       | Basic info theory shows that there is more information contained
       | in 'surprising' chunks of data than in expected ones. These
       | systems actively work to substitute 'expected' speech to
       | overwrite 'surprising' speech.
       | 
       | The transcript I got was utter trash, multiple pages of errata I
       | had to submit when the normal is a couple of lines. And as I
       | said, some literally reversed the meaning in a consequential way,
       | and yet completely silently.
       | 
       | This kind of silent active failure mode is terrifying. Unless it
       | is solved, and I see no way to solve it without removing ALL
       | predictive algos from the system, these types of systems must not
       | be used in any situation of serious consequence, at least not
       | without real redundancy and backup.
        
       | sowbug wrote:
       | I knew there was a reason why I kept my MP3 library even after
       | subscribing to Spotify. Now piping everything through whisper. So
       | far the generated lyrics are reasonable, though it thinks the REM
       | song says "Linnie Bruce is not afraid."
       | 
       | No surprise that it appears to have successfully transcribed all
       | the recordings of Harvard Sentences I could find.
       | https://en.wikipedia.org/wiki/Harvard_sentences
        
       | hijp wrote:
       | Anyone get it running on m1 mac?
       | 
       | I keep getting `ModuleNotFoundError: No module named
       | 'setuptools.command.build'`
        
         | kif wrote:
         | I got requirements installed, but then when running the Python
         | example, I get:
         | 
         | RuntimeError: "slow_conv2d_cpu" not implemented for 'Half'
        
           | kif wrote:
           | Probably need to pass some kind of options when initializing.
           | The command itself works fine, just shows a warning:
           | warnings.warn("FP16 is not supported on CPU; using FP32
           | instead")
        
             | mewse-hn wrote:
             | using this in the sample code worked for me:
             | 
             | >>> options = whisper.DecodingOptions(fp16=False)
        
         | dceddia wrote:
         | Yep, I had this too. `pip3 install -U pip setuptools` took care
         | of it. (If you get an error about pip3, try `pip` instead)
        
           | hijp wrote:
           | I'm really new to pip, but does this look ok?
           | 
           | (after running the command for setuptools) Defaulting to user
           | installation because normal site-packages is not writeable
           | Requirement already satisfied: pip in
           | /Users/xxx/Library/Python/3.9/lib/python/site-packages
           | (22.2.2) Requirement already satisfied: setuptools in
           | /Users/xxx/Library/Python/3.9/lib/python/site-packages
           | (65.3.0)
           | 
           | ---- after trying whisper installation: x Getting
           | requirements to build wheel did not run successfully. | exit
           | code: 1 +-> [20 lines of output] Traceback (most recent call
           | last): File "/Users/xxx/Library/Python/3.9/lib/python/site-
           | packages/pip/_vendor/pep517/in_process/_in_process.py", line
           | 363, in <module> main() File
           | "/Users/xxx/Library/Python/3.9/lib/python/site-
           | packages/pip/_vendor/pep517/in_process/_in_process.py", line
           | 345, in main json_out['return_val'] =
           | hook(*hook_input['kwargs']) File
           | "/Users/xxx/Library/Python/3.9/lib/python/site-
           | packages/pip/_vendor/pep517/in_process/_in_process.py", line
           | 130, in get_requires_for_build_wheel return
           | hook(config_settings) File "/Library/Developer/CommandLineToo
           | ls/Library/Frameworks/Python3.framework/Versions/3.9/lib/pyth
           | on3.9/site-packages/setuptools/build_meta.py", line 154, in
           | get_requires_for_build_wheel return self._get_build_requires(
           | File "/Library/Developer/CommandLineTools/Library/Frameworks/
           | Python3.framework/Versions/3.9/lib/python3.9/site-
           | packages/setuptools/build_meta.py", line 135, in
           | _get_build_requires self.run_setup() File "/Library/Developer
           | /CommandLineTools/Library/Frameworks/Python3.framework/Versio
           | ns/3.9/lib/python3.9/site-packages/setuptools/build_meta.py",
           | line 150, in run_setup exec(compile(code, __file__, 'exec'),
           | locals()) File "setup.py", line 2, in <module> from
           | setuptools_rust import Binding, RustExtension File "/private/
           | var/folders/lj/7x6d3dxd3cbdtt484k6xsmyh0000gn/T/pip-build-
           | env-ieaydl8r/overlay/lib/python3.9/site-
           | packages/setuptools_rust/__init__.py", line 1, in <module>
           | from .build import build_rust File "/private/var/folders/lj/7
           | x6d3dxd3cbdtt484k6xsmyh0000gn/T/pip-build-env-
           | ieaydl8r/overlay/lib/python3.9/site-
           | packages/setuptools_rust/build.py", line 23, in <module> from
           | setuptools.command.build import build as CommandBuild # type:
           | ignore[import] ModuleNotFoundError: No module named
           | 'setuptools.command.build' [end of output]
           | note: This error originates from a subprocess, and is likely
           | not a problem with pip.
           | 
           | error: subprocess-exited-with-error
        
             | dceddia wrote:
             | Nope, that doesn't look good! I honestly just googled the
             | error and installing setuptools fixed it for me, but I
             | barely know anything about the Python ecosystem so I'm
             | really just fumbling around here.
        
               | hijp wrote:
               | haha same, thanks
        
         | Smaug123 wrote:
         | I'm still not successfully using the GPU, but it's working
         | decently quickly (with the base model - it's incredibly slow to
         | use the Large model) using just the CPU. I'm going to have to
         | check what magic stable-diffusion is doing to enable the GPU :(
        
           | dceddia wrote:
           | There's a --device flag you can pass. I've been trying to get
           | `--device cuda` to work on my Windows machine and it's saying
           | that torch wasn't compiled with CUDA. Trying to figure out
           | what's going on there.
           | 
           | And on the M1, supposedly PyTorch has support for hardware
           | acceleration using MPS (Metal Performance Shaders, announced
           | here https://pytorch.org/blog/introducing-accelerated-
           | pytorch-tra...) but when I tried `--device mps` it blew up
           | with an error "input types 'tensor<1x1280x3000xf16>' and
           | 'tensor<1xf32>' are not broadcast compatible".
        
             | Smaug123 wrote:
             | Yep, same for me, on M1 after enabling MPS (with
             | `model.to("mps")`) it just either SIGSEGV or SIGABRTs every
             | time with that line. The extremely unclean nature of the
             | abort is making it hard to debug :(
        
               | dceddia wrote:
               | I noticed the size seems to correspond to the model. With
               | a large model, the error is tensor<1x1280x3000xf16>. With
               | tiny, it's tensor<1x384x3000xf16>, and with medium it's
               | tensor<1x1024x3000xf16>. It also seems like a bad thing
               | that those are f16's but the "expected" data is f32.
        
               | Smaug123 wrote:
               | I'm giving up for the night, but
               | https://github.com/Smaug123/whisper/pull/1/files at least
               | contains the setup instructions that may help others get
               | to this point. Got it working on the GPU, but it's...
               | much much slower than the CPU? Presumably due to the
               | 'aten::repeat_interleave.self_int' CPU fallback.
               | 
               | Also hitting a nice little PyTorch bug:
               | 
               | > File "/Users/patrick/Documents/GitHub/whisper/whisper/d
               | ecoding.py", line 388, in apply logits[:,
               | self.tokenizer.encode(" ") + [self.tokenizer.eot]] =
               | -np.inf
               | 
               | > RuntimeError: dst_.nbytes() >= dst_byte_offset INTERNAL
               | ASSERT FAILED at "/Users/runner/work/pytorch/pytorch/pyto
               | rch/aten/src/ATen/native/mps/operations/Copy.mm":200,
               | please report a bug to PyTorch.
        
       | nik_s wrote:
       | I just tested the model [1] using an RTX3090, trying to translate
       | a french text I found here [2].
       | 
       | Some observations:
       | 
       | - The full translation of the 6:22 minute video takes about 22
       | seconds (17x real time)
       | 
       | - It recognizes the language by default (and did a good job to
       | recognize it was french audio)
       | 
       | - MIT License [3]!
       | 
       | - The quality of the transcription is good, but not perfect.
       | 
       | - The quality of the translation (if you don't consider
       | transcription errors as a translation error) is generally very
       | good.
       | 
       | ---
       | 
       | The transcription:
       | 
       | > Bonjour a tous, <error>j'suis</error> espere que vous allez
       | bien, c''est ENTI. Et aujourd', <error>aujourd',</error> on se
       | retrouve <error>un peu physique</error> pour parler de la termo
       | dynamique. Vous ne vous inquietez pas, ca va bien se passer. On
       | va y aller ensemble, <error>etre a par exemple,</error> je vous
       | accompagne a travers une serie de videos pour vous expliquer les
       | principes de base en termo dynamique. Et bah, c''est parti, on va
       | y aller tranquillement. Lidee, c''est vous puissiez comprendre la
       | termo dynamique dans son ensemble. Donc, je vais vraiment prendre
       | mon temps pour <error>couplisser</error> bien comprendre les
       | notions,
       | 
       | The translation:
       | 
       | > Hello everyone, I hope you're doing well, it's NT and today we
       | find ourselves a little physical to talk about the thermo
       | dynamic. Don't worry, it's going well, we're going to go together
       | and be the same. I'm going to accompany you through a series of
       | videos to explain the basic principles in thermo dynamic. Well,
       | let's go, <error>we're going to go quietly</error>. The idea is
       | that you can understand the thermo dynamic <error>in sound
       | together</error>. So I'm really going to take my time to
       | understand the notions,
       | 
       | ---
       | 
       | All in all very happy that OpenAI is publishing their models. If
       | Stable Diffusion is any guide, people will hack some crazy things
       | with this.
       | 
       | [1] https://github.com/openai/whisper [2]
       | https://www.youtube.com/watch?v=OFLt-KL0K7Y [3]
       | https://github.com/openai/whisper/blob/main/LICENSE
        
         | seszett wrote:
         | > _dans son ensemble_
         | 
         | > _in sound together_
         | 
         | That's hilarious and honestly, incredibly bad. "Dans son
         | ensemble" is a very common idiom (meaning "as a whole") while
         | "in sound together" has to be pretty rare. "Son" means
         | "his/hers/its" as well as "sound", and the former meaning is
         | probably more common in general so I have no idea how this
         | result could arise.
         | 
         | "Termo" also doesn't exist in French, it's "thermo", so the
         | transcript even makes orthographic errors.
         | 
         | And I forgot about "couplisser" which is also a hilarious made-
         | up word that sounds like it could mean something, but doesn't!
         | _Edit_ Google finds exactly one reference of this, in a patent
         | with a typo on the word  "coulisser".
         | 
         | I'm still impressed by the transcript quality since it covers
         | many languages, but the translation part is quite poor.
        
         | StevenWaterman wrote:
         | Was this with the `base` model? `large` is running ok on a P100
         | in colab, but is about 4% the speed of `base.en`. Certainly
         | seems like some of these models will be fast enough for real-
         | time.
        
         | joshcryer wrote:
         | It also runs well on a CPU and seems to have proper memory
         | management. Wonderful timing because I was using DeepSpeech for
         | some audio recordings and it required me to script up a
         | splitter to make the files into .wav and then do snippets of 10
         | seconds each. Everything about this just works out of the box.
         | On a core i5 I'm getting about 30 seconds every minute.
         | Transcriptionist jobs just turned into editor jobs. I love how
         | it drops the inflections in the audio as well, because it was
         | trained on transcription work, and that is one of the first
         | things you learn to do (drop the uhs and ums and huhs etc,
         | unless it is a strictly verbose transcription).
        
         | solarmist wrote:
         | Is it translation or transcription? Or both?
         | 
         | Both, wow. This is really interesting.
        
           | StevenWaterman wrote:
           | Both, the blog covers it in detail. Pass in audio in any
           | language, and get an English transcription out.
        
           | nik_s wrote:
           | It can do both - I've edited my original post to show the
           | translation task.
        
       | gok wrote:
       | Comparing this model's word error rates to the state of the art
       | [1] on a few common test sets:
       | Whisper    SoTA       LibriSpeech test-clean      2.7%     1.8%
       | LibriSpeech test-other      5.6%     2.9%       Switchboard
       | 13.1%     4.9%       CallHome                   15.8%     9.5%
       | 
       | The authors do explicitly state that they're trying to do a lot
       | of fancy new stuff here, like be multilingual, rather than
       | pursuing just accuracy.
       | 
       | [1] https://github.com/syhw/wer_are_we
        
         | lunixbochs wrote:
         | I suspect Whisper is more robust than other "SOTA" models, but
         | this release is likely leaving a fair bit of accuracy on the
         | table considering the amount of resources OpenAI is capable of
         | throwing at training it.
         | 
         | Comparing the readily available test sets from the paper to
         | some of my personal robust models (for the Talon models, this
         | is greedy decoding, no language model):
         | Talon  Talon  Talon  Whisper  wav2vec 2.0
         | 28M    300M   1B     Large    960h         librispeech clean
         | 3.21   2.52   2.40   2.7      2.7         librispeech other
         | 8.21   6.56   5.63   5.6      6.2         common voice
         | 13.88  11.65   8.86   9.5     29.9         tedlium
         | 7.51   6.55   5.47   4.0     10.5
         | 
         | I have a battery of more difficult tests on hand (including
         | adversarial tests, and diverse accent-specific metrics). I'll
         | look at running these tests on each of the Whisper model sizes
         | and following up with a larger comparison.
        
           | allanrbo wrote:
           | Talon was the first thing that came to my mind when I saw
           | this news. Would be nice if it could benefit from Whisper.
           | (Big fan of your work on Talon!)
        
           | ma2rten wrote:
           | I'm looking forward to your comparison. It's really hard to
           | make sense of how good this model actually is without being
           | an expert in the area.
        
           | nshm wrote:
           | It is interesting how they compare with wav2vec2 instead of
           | nemo conformer (which is more accurate) in Table 2.
        
         | StevenWaterman wrote:
         | One of the things they point out is that the SoTA on e.g.
         | LibriSpeech is _only_ good at LibriSpeech, and doesn 't
         | generalise as well.
         | 
         | > Because Whisper was trained on a large and diverse dataset
         | and was not fine-tuned to any specific one, it does not beat
         | models that specialize in LibriSpeech performance, a famously
         | competitive benchmark in speech recognition. However, when we
         | measure Whisper's zero-shot performance across many diverse
         | datasets we find it is much more robust and makes 50% fewer
         | errors than those models.
        
           | lunixbochs wrote:
           | My own experience agrees: the generally available "SOTA"
           | models are not especially robust, and can be _extremely_ bad
           | (>50% absolute error rate) at some tasks. I'll post some
           | preliminary numbers in a sibling comment and look into
           | running my full set of tests on Whisper.
           | 
           | It looks like Whisper is probably leaving a lot of accuracy
           | on the table, but initially it does seem to be a lot more
           | robust than general "SOTA" models.
           | 
           | For a quick comparison, Silero's accuracy charts are kind of
           | nice because they post results for a large variety of
           | datasets. Scroll down to the EN V6 xlarge EE model (not the
           | xlarge CE) [1]
           | 
           | [1] https://github.com/snakers4/silero-models/wiki/Quality-
           | Bench...
        
       | jawadch93 wrote:
        
       | LanternLight83 wrote:
       | Hoping to see this out to use in open source voice assistants,
       | eg. mycroft
        
       | liminalsunset wrote:
       | I really wish I had this about half a year ago when I was
       | building a tool to automatically turn online school lectures into
       | searchable, clickable transcripts (kind of like YouTube or EdX
       | transcripts).
       | 
       | I was originally using Adobe Premiere Pro's speech to text to do
       | it, and wrote Python to convert its output to the Hyperaudio
       | format on GitHub. With this, I can totally skip all of that step
       | and this is fully open source, too.
       | 
       | App idea:
       | 
       | Build an app that takes a video and uses Hyperaudio or a similar
       | project to add a clickable and searchable transcript (clicking in
       | transcript seeks video)
        
         | resoluteteeth wrote:
         | You could already do the speech recognition in a fully open
         | source way with vosk easily, although Whisper may be more
         | accurate
        
       | throwamon wrote:
       | Is it feasible to use this for Talon-like voice-driven computer
       | usage?
        
         | FloatArtifact wrote:
         | Maybe, a number of speech recognition engines have been
         | integrated into https://github.com/dictation-toolbox/dragonfly
        
       | dubeye wrote:
       | I know a manual transcription company, which is still seeing
       | modest growth from existing clients who also use ASR, so it's not
       | quite there yet
        
       | londons_explore wrote:
       | I wonder how much the 30 second window is impacting performance?
       | 
       | Anecdotally, I feel like there are plenty of times that I need
       | context from more than 30 seconds ago to understand some
       | technical jargon that's being discussed.
        
       | chrisstanchak wrote:
       | Hold on to your papers
        
       | smusamashah wrote:
       | How well does it do for technical and domain oriented speech? For
       | example I have audio recordings of a senior explaining some very
       | technical aspects of our software. Will it understand the
       | technical terms in that speech?
       | 
       | I guess I will need to download and run on it to see how correct
       | it is.
        
       | emcq wrote:
       | Be wary of using this model - the licensing of this model seems
       | sketchy. Several of the datasets used for training like WSJ and
       | TED-LIUM have clear non-commercial clauses. I'm not a lawyer but
       | releasing a model as "MIT" seems dubious, and hopefully OpenAI
       | has paid for the appropriate licenses during training as they are
       | no longer a research-only non profit.
        
         | nshm wrote:
         | I think they didn't use WSJ for training, only for evaluation.
         | Paper includes WSJ under "Evaluation datasets"
        
         | jefftk wrote:
         | This is a big dispute right now: OpenAI and other AI companies
         | generally take the position that models learning from data does
         | not make the output of the models a derivative work of that
         | data. For example, GitHub Co-pilot uses all publicly available
         | GitHub code regardless of license, and
         | DALLE-2/StableDiffusion/etc use lots of non-free images. I
         | don't think this has been challenged in court yet, and I'm very
         | curious to see what happens when it is.
        
           | petercooper wrote:
           | I think it might be even less problematic with something like
           | Whisper than with DALLE/SD? Merely consuming data to train a
           | system or create an index is not usually contrary to the law
           | (otherwise Google wouldn't exist) - it's the _publication_ of
           | copyright content that 's thorny (and is something you can
           | begin to achieve with results from visual models that include
           | Getty Photos logo, etc.)
           | 
           | I think it'd be a lot harder to make a case for an accurate
           | audio to text transcription being seen to violate the
           | copyright of any of the training material in the way a visual
           | could.
        
           | emcq wrote:
           | This is even slightly more direct: access to WSJ data
           | requires paying LDC for the download, and the pricing varies
           | depending on what institution / license you're from. The cost
           | may be a drop in the bucket compared to compute, but I don't
           | know that these licenses are transferable to the end product.
           | We might be a couple court cases away from finding out but I
           | wouldn't want to be inviting one of those cases :)
        
       | zeagle wrote:
       | It would be exceptional to get a healthy competitor to
       | microsoft/nuance's dragon monopoly on voice recognition in
       | healthcare. At a couple thousand bucks a license and the more
       | recent SaaS subscription trend there is a lot of money to be made
       | in that space.
        
       | darkpicnic wrote:
       | I just wrote a script with Hazel to automatically transcribe my
       | voice notes to txt. It handles punctuation extremely well. What a
       | wonderful contribution!
        
       | abidlabs wrote:
       | Here [1] is a video tutorial on building a web UI that accepts
       | microphone input and runs it through Whisper for speech
       | transcription
       | 
       | [1]
       | https://www.youtube.com/watch?v=ywIyc8l1K1Q&ab_channel=1litt...
        
         | amrrs wrote:
         | Thank you for sharing!
        
       ___________________________________________________________________
       (page generated 2022-09-21 23:00 UTC)