[HN Gopher] Whisper - open source speech recognition by OpenAI ___________________________________________________________________ Whisper - open source speech recognition by OpenAI Author : _just7_ Score : 850 points Date : 2022-09-21 16:16 UTC (6 hours ago) (HTM) web link (openai.com) (TXT) w3m dump (openai.com) | wongarsu wrote: | > About a third of Whisper's audio dataset is non-English, and it | is alternately given the task of transcribing in the original | language or translating to English. We find this approach is | particularly effective at learning speech to text translation and | outperforms the supervised SOTA on CoVoST2 to English translation | zero-shot. | | That's intriguing. You can just set the model to transcribe | everything into English, no matter which language the speaker is | using, and it just works. Given that many people are much better | at understanding English than at speaking it, this might make | voice interfaces much more accessible without much work. | FloatArtifact wrote: | This would be a cool thing to integrate into Dragonfly | https://github.com/dictation-toolbox/dragonfly | rexreed wrote: | I'd love to find a way to test this with longer audio but I don't | have GPU resources and not exactly sure how to load that into the | Colab. Is anyone planning on hosting or sharing a model that can | be used by others to test longer form audio (for podcast | transcription)? | londons_explore wrote: | I've never seen transcription and translation combined into a | single step like this before... | | Have I been living under a rock, or is this new? | | I assume it should help performance, because it means emphasis, | timing and tone can be used to inform the translation. Helps make | better guesses about information missing from the source | language. | jerpint wrote: | I recorded myself speaking French and was able to translate | decently well on my laptop. Very impressive! | jfoster wrote: | It seems like OpenAI are finally living up to their name for once | with this release? Anything I'm missing? | | From what I can gather: | | 1. Includes model weights. I can't find the URL, but they | reference them enough and have a CLI tool, so I presume I just | haven't found them yet. | | 2. Includes code: https://github.com/openai/whisper | | 3. Released under MIT License: | https://github.com/openai/whisper/blob/main/LICENSE | thesausageking wrote: | It's one model and in a non-strategic area where there are | existing open source projects (Kaldi, DeepSpeech, ...). | | For a company that raised $1B, that's not exactly living up to | their name and original mission. | whimsicalism wrote: | > It's one model and in a non-strategic area where there are | existing open source projects (Kaldi, DeepSpeech, ...). | | I can already tell this is much better than any of the | existing open source projects with the exception of the wav2* | sequence of projects and potentially nvidia's nemo. | StevenWaterman wrote: | (Model weights from | https://github.com/openai/whisper/blob/main/whisper/__init__... | ) | | "tiny.en": "https://openaipublic.azureedge.net/main/whisper/mod | els/d3dd5..." | | "tiny": "https://openaipublic.azureedge.net/main/whisper/models | /65147..." | | "base.en": "https://openaipublic.azureedge.net/main/whisper/mod | els/25a85..." | | "base": "https://openaipublic.azureedge.net/main/whisper/models | /ed3a0..." | | "small.en": "https://openaipublic.azureedge.net/main/whisper/mo | dels/f953a..." | | "small": "https://openaipublic.azureedge.net/main/whisper/model | s/9ecf7..." | | "medium.en": "https://openaipublic.azureedge.net/main/whisper/m | odels/d7440..." | | "medium": "https://openaipublic.azureedge.net/main/whisper/mode | ls/345ae..." | | "large": "https://openaipublic.azureedge.net/main/whisper/model | s/e4b87..." | mmastrac wrote: | Large is 3GB to save everyone a click. Tiny is 72MB. | anigbrowl wrote: | That's unexpectedly lightweight - enough to run in some | phones. | solarmist wrote: | This kind of model is harder to abuse, so I guess it passed | their internal checks much more easily. | | I can understand not releasing GPT-3, even if I disagree with | the decision. | ignoramous wrote: | > _This kind of model is harder to abuse, so I guess it | passed their internal checks much more easily._ | | The version I choose to believe: _stability.ai_ ate DALL-E | for lunch, and that woke them up. | solarmist wrote: | This is probably also true. | jfoster wrote: | True. The potential of GPT-3 to cause internet mayhem was/is | significant. I would argue that the mere act of announcing it | was still a catalyst for an eventual GPT-3-like model being | released. In revealing it, they established a target for what | open source models could aim to achieve, and simultaneously | got bad actors thinking about ways to abuse it. | dwohnitmok wrote: | > I can understand not releasing GPT-3, even if I disagree | with the decision. | | Why do you disagree? | bigyikes wrote: | I don't see how GPT-3 is any more dangerous than Stable | Diffusion, Photoshop, that fake news website the crazy | person you're friends with on Facebook really likes, or any | of the number of other tools and services that can be used | to generate or spread fake information. | jfoster wrote: | All of your examples are limited in some way, but GPT-3 | wouldn't have any meaningful limits. | | Stable Diffusion: Marks images as AI-generated. | (invisible watermark, but still, it's there) | | Photoshop: Requires time & effort from a human. | | Fake news website: Requires time & effort from a human. | xkapastel wrote: | I wouldn't really say Stable Diffusion marks images as | AI-generated. There's a script in the Stable Diffusion | repository that will do that, but it's not connected to | the model itself in a meaningful way. I use Stable | Diffusion a lot and I've never touched this script. | | https://github.com/CompVis/stable- | diffusion/blob/69ae4b35e0a... | capableweb wrote: | What "script" are you using for doing txt2img? The | watermark function is automatically called when you use | the CLI in two places, https://github.com/CompVis/stable- | diffusion/blob/69ae4b35e0a... and | https://github.com/CompVis/stable- | diffusion/blob/69ae4b35e0a... | | Trivial to remove, I give you that. But AFAIK, the | original repository + most forks put the watermark | automatically unless you've removed it on your own. | spullara wrote: | SD only does that if you don't delete the line of code | that does it... | mmh0000 wrote: | Because why should the wealthy and connected be the only | ones -allowed- have access to such life improving | technology? | solarmist wrote: | Two reasons. First, someone else will release something | similar. Second, I didn't see a related push from them to | work with other in the industry to do something productive | towards safety with the time they got by delaying | availability of these kinds of models. So it felt | disingenuous. | bredren wrote: | This is dropping right in the middle of Interspeech 2022. | | I don't believe OpenAI has anyone presenting at the conference, | so presumably this was timed to coincide with that and get buzz | at the conference. | | Curious how this model compares with foss STT from the startup | Coqui. | revskill wrote: | It's actually better than Google Meet subtitle system. | blueberrychpstx wrote: | This is absolute garbage python as I am neither a python | developer, nor a good developer. I was trying to play around with | real time transcriptions. However, it does work! | | > * recording * done recording Recording saved to file.wav Press | enter to transcribe | | /Users/laptop/Development/Personal/Public/pythonProject1/venv/lib | /python3.9/site-packages/whisper/transcribe.py:70: UserWarning: | FP16 is not supported on CPU; using FP32 instead | warnings.warn("FP16 is not supported on CPU; using FP32 instead") | Detected language: english Goodbye, I need to go pick up my wife. | Press enter to start recording | | Any improvements welcome here. | | ``` # This is a sample Python script. | | # Press ^R to execute it or replace it with your code. # Press | Double | to search everywhere for classes, files, tool windows, | actions, and settings. | | def print_hi(name): # Use a breakpoint in the code line below to | debug your script. print(f'Hi, {name}') # Press [?]F8 to toggle | the breakpoint. | | def record_microphone(seconds): import pyaudio import wave | CHUNK = 1024 FORMAT = pyaudio.paInt16 CHANNELS = | 1 RATE = 44100 RECORD_SECONDS = seconds | WAVE_OUTPUT_FILENAME = "file.wav" p = | pyaudio.PyAudio() stream = p.open(format=FORMAT, | channels=CHANNELS, rate=RATE, | input=True, frames_per_buffer=CHUNK) | print("* recording") frames = [] for i | in range(0, int(RATE / CHUNK * RECORD_SECONDS)): data | = stream.read(CHUNK) frames.append(data) | print("* done recording") stream.stop_stream() | stream.close() p.terminate() wf = | wave.open(WAVE_OUTPUT_FILENAME, 'wb') | wf.setnchannels(CHANNELS) | wf.setsampwidth(p.get_sample_size(FORMAT)) | wf.setframerate(RATE) wf.writeframes(b''.join(frames)) | wf.close() return WAVE_OUTPUT_FILENAME | | if __name__ == '__main__': seconds = 5 while True: print("Press | enter to start recording") input() filename = | record_microphone(seconds) print("Recording saved to " + | filename) print("Press enter to transcribe") input() import | whisper model = whisper.load_model("base") | result = model.transcribe(filename) | print(result["text"]) | | ``` | yawnxyz wrote: | Oh man I remember LOVING Micro Machines as a kid. | | But also, this tool seems much better than Otter.ai, which gets | every third word wrong when transcribing microbiology recordings | alexb_ wrote: | Combine the translation + transcription with voice synthesis, and | once compute power allows for this to be miniaturized we will be | able to have babel-fish technology in real life. | no1youknowz wrote: | This is awesome. But I really want the other way. | | To be able to give it text and hear the speech. A TTS (text to | speech). | | As a language learner, the ability to create my own sentences | (based on existing ones I have, in changing a word here or | there). Would be amazing. | | How long till we have this I wonder. I know I could use a service | to do this currently. But having something running locally, I'd | prefer. | | Hopefully someone in the OpenAI team reads this. :) | TaylorAlexander wrote: | I suspect this is coming. I mean we do have decent text to | speech systems already, but in this vein of "we used neural | networks and now it's very very good" you can imagine that with | something like GPT-3, to extend it they could use this speech | to text system so you could speak to it for input, and then a | natural progression is that it can use text to speech to return | the output, so you just have a voice oriented conversational | system. | | So I think TTS is a logical part of the system. I also think | that there are peculiarities of voice interaction that aren't | captured in text training datasets, so they would need to do | some fine tuning on actual voice conversation to make it feel | natural. | | All in due time I suppose. | noreally_ wrote: | A notebook is available to try with your microphone on Colab | here: https://colab.research.google.com/drive/1nBZ- | pDIaIi3N1DIIXvJ... | | I'm surprised by the quality on non-English languages, given that | 80+% of the training data is English, and the rest is split | between tens of languages. | bambax wrote: | Thanks! I played with this in French and posted the results as | replies to this comment: | https://news.ycombinator.com/item?id=32928643 | | It's sometimes close to perfect, and sometimes goes off the | rail; I think that maybe the model tries to establish some sort | of consistency for each sentence; if starts wrong for the first | few words of a sentence, it can't build the rest properly. | | But it's super fun. | goffi wrote: | Really interesting, I can see ton of potential uses. | | 2 questions: | | 1) how does it compare to state of the art FOSS solutions? I'm | seeking about DeepSpeech or Vosk | | 2) would it be somehow possible to associate timestamp to the | words recognized? That would be amazing for things such as audio | editing or skipping to a particular location on a video | nshm wrote: | You properly mentioned timestamps. There are many other | important properties of good ASR system like vocabulary | adaptability (if you can introduce new words) or streaming. Or | confidences. Or latency of the output. Compared to Vosk models | this model can not work in streaming manner, so not very | suitable for real-time applications. | | But in general the model is robust and accurate and trained on | the amount of speech we never dreamed about in Vosk. We will | certainly benefit from this model as a teacher (together with | others like gigaspeech models). I recently wrote about it | https://alphacephei.com/nsh/2022/06/14/voting.html | goffi wrote: | > goffi | | for 2), it's actually written in the description: "phrase-level | timestamps", so it should be possible (phrase level is neat for | skipping to a special location on a video, but maybe not for | audio editing). | IceWreck wrote: | Is there a list of system requirements somewhere ? Can it run on | cheaper low memory GPUs ? maybe CPUs ? | StevenWaterman wrote: | Their models range from 70mb to 3gb. The largest model is | smaller than the optimised stable diffusion. Not sure what the | inference speed is like, haven't tried it myself yet. | IceWreck wrote: | I just tested it myself. Its fast enough on colab, couple of | seconds but not sure if its fast enough to transcribe | realtime audio yet. | [deleted] | mewse-hn wrote: | I know this isn't a tech support forum but maybe someone here | knows. I'm attempting the sample python code from the github and | _almost_ get a transcription running on my work laptop without a | GPU, but I run into this error message: | | >>> result = whisper.decode(model, mel, options) | | Traceback (most recent call last): | | [snip] | | RuntimeError: "slow_conv2d_cpu" not implemented for 'Half' | | It looks like a Torch error, is there some twiddling with | "options" I can do to get it to run? | mewse-hn wrote: | I seem to have worked around it by tweaking the "options" line | from the sample code to this: | | >>> options = whisper.DecodingOptions(fp16=False) | O__________O wrote: | Anyone know if it is possible to output IPA using this? | | International Phonetic Alphabet (IPA) | | - https://wikipedia.org/wiki/International_Phonetic_Alphabet | | _________ | | EDIT: Based on list of languages in the tokenizer code here, | doesn't appear IPA is supported: | | https://github.com/openai/whisper/blob/5f8d4bcc254d4f3e833d3... | jcims wrote: | Did respectably with some mumble rap: | https://controlc.com/d353dafb | | (some NSFW words in the lyrics obv) | derangedHorse wrote: | Whisper performed a lot better than I would've expected it to! | mmh0000 wrote: | Okay this is super impressive. I just downloaded Whisper and fed | it a random flac file I had handy and it did a really good job. | Also impressive that it works on my weak CPU: | | A 3m07s flac took 5m to transcribe: $ whisper | --device cpu 'BLACKPINK - BORN PINK/01 Pink Venom.flac' | Detecting language using up to the first 30 seconds. Use | `--language` to specify the language Detected language: | korean [00:00.000 --> 00:10.000] Blackpink | [00:11.000 --> 00:14.000] Kick in the door, wave in the coco | [00:14.000 --> 00:16.000] pabkonineun cinge ggyeodeul saenggag | malgo [00:16.000 --> 00:19.000] I talk to talk, run ways I | walk walk [00:19.000 --> 00:21.000] him gamgo pab pab an | bwado ceog [00:21.000 --> 00:24.000] By one and two by two | [00:24.000 --> 00:26.000] nae songgeut du hanae tamyeon ajieun | jung [00:26.000 --> 00:30.000] gas jasyo jigeum hwaryeohae | T makes no sense [00:30.000 --> 00:32.000] You couldn't | get a dollar out of me [00:33.000 --> 00:38.000] ja oneul | bamiya nuntobeul pumgo [00:38.000 --> 00:41.000] mihoneul | bbaeseum down [00:41.000 --> 00:43.000] Look what you made | us do [00:43.000 --> 00:47.000] ceonceonhi neol jamjaeul | paieo [00:48.000 --> 00:52.000] jami nal mankeum | areumdaweo [00:52.000 --> 00:53.000] I bring the pain like | [00:53.000 --> 00:57.000] diseutab, paengpaeng, diseutab, | paengpaeng, diseutab, paengpaeng, paengpaeng [00:57.000 --> | 00:58.000] Get em, get em, get em [00:58.000 --> | 01:00.000] Straight till you don't like [01:00.000 --> | 01:01.000] Whoa, whoa, whoa [01:01.000 --> 01:03.000] | Straight till you don't like [01:03.000 --> 01:04.000] Ah, | ah, ah [01:04.000 --> 01:05.000] Taste that, pink venom | [01:05.000 --> 01:06.000] Taste that, pink venom | [01:06.000 --> 01:08.000] Taste that, pink venom | [01:08.000 --> 01:09.000] Get em, get em, get em | [01:09.000 --> 01:11.000] Straight till you don't like | [01:11.000 --> 01:12.000] Whoa, whoa, whoa [01:12.000 --> | 01:13.000] Straight till you don't like [01:13.000 --> | 01:14.000] Ah, ah, ah [01:14.000 --> 01:15.000] Blackpink | and Amo [01:15.000 --> 01:17.000] Got it by the smack ram | [01:17.000 --> 01:18.000] But rest in peace [01:18.000 --> | 01:19.000] Please light up a candle [01:19.000 --> | 01:20.000] This the knife of a vando [01:20.000 --> | 01:22.000] Messed up and I'm still in saline ...SNIP... | lunixbochs wrote: | Looks like it defaults to the model called "small". | | I just ran some benchmarks - M1 Max, pytorch, with a 1.29 | second flac (looks like the matrix math was running on a single | thread): tiny 146.522ms detect_lang | 549.131ms decode_one 0.057ms tokenizer | base 354.885ms detect_lang 1046.679ms | decode_one 0.011ms tokenizer small | 803.892ms detect_lang 3194.503ms decode_one | 0.017ms tokenizer medium 2279.689ms | detect_lang 10128.255ms decode_one 0.023ms | tokenizer large 3656.478ms detect_lang | 17249.024ms decode_one 0.016ms tokenizer | lazylion2 wrote: | I ran it on this clip | | https://clips.twitch.tv/ReliablePopularWerewolfOSkomodo-pcuw... | | because... hard accent. | | first run whisper thought its welsh so I had to run with | --language en , and it did pretty well | | https://i.imgur.com/TQiYU9X.png | | took 36 seconds in Google colab | manishsharan wrote: | Oh this is a relief to have something opensource in this field. I | had using Mozilla Deepspeech for transcribing my voice notes , | often with hilarious to incomprehensible results. DeepSpeech is | dead ; so I will be sure to check this out. | w10-1 wrote: | Naively, training the same model on multiple languages has | interesting implications. | | On one hand, it may capture something "deeper" about language. | | On the other hand, it's likely to do great in general, but miss | particularities of some language. | | Understanding the coverage of the training model seems a | perennial problem. Is there any (shorthand) way to compare | language model training corpora? | | Clearly if they use common subsets we have a literal comparison. | I'm more interested in whether there's progress in characterizing | corpora by speech styles, fluency, vocabulary sets, (noise) | environment, emotionality, proposition types, etc. | | (btw: 25 minutes for a 9-minute segment on a 12-thread x86. Lots | of jargon spelled as it sounds. Sentences capitalized but no | punctuation. Overall good.) | dindindin wrote: | I'm not in the Speech Recognition circles and am looking for open | source speech recognition I can play around with - would this be | the new state of the art? | mercurywells wrote: | For me as a deaf person the current state of art (in terms of | speed & usability) is the Recorder app on a Google Pixel phone | (4a/6 Pro is what I've used) | StevenWaterman wrote: | Yes | visarga wrote: | Most probably | The5thElephant wrote: | How is it Apple, Google, or Microsoft are not further ahead of | the game on speech recognition like this? They have the resources | to hire the best ML researchers and throw tons of computing hours | at it, yet Siri, Google, and Cortana continue to struggle to get | anywhere near this level of comprehension. | wongarsu wrote: | Siri and Cortana have to run at least in real time, with | reasonable compute resources. Probably faster than real time | when the audio gets shipped off to the cloud and transcribed | there. This model can't do that (in the "large" version, which | the examples use). | | Also, you are comparing Whisper's highlight reel with everyday | performance of other models. Nobody shows their weaknesses in | their highlight reel. | alex_marchant wrote: | Siri until ios 15 was done in the cloud IIRC. | coder543 wrote: | Someone else in this thread[0] said Whisper was running at | 17x real time for them. So, even a weak machine might be able | to do an acceptable approximation of real time with Whisper. | | Also, I feel like shipping to the cloud and back has been | shown to be just as fast as on device transcription in a lot | of scenarios. Doing it on device is primarily a benefit for | privacy and offline, not necessarily latency. (Although, | increasingly powerful smartphone hardware is starting to give | the latency edge to local processing.) | | Siri's dictation has had such terrible accuracy for me (an | American English speaker without a particularly strong | regional accent) and everyone else I know for so many years | that it is just a joke in my family. Google and Microsoft | have much higher accuracy in their models. The bar is so low | for Siri that I automatically wonder how much Whisper is | beating Siri in accuracy... because I assume it has to be | better than that. | | I really wish there was an easy demo for Whisper that I could | try out. | | [0]: https://news.ycombinator.com/item?id=32928207 | lunixbochs wrote: | 17x realtime _on a 3090_ | | I did some basic tests on CPU, the "small" Whisper model is | in the ballpark of 0.5x realtime, which is probably not | great for interactive use. | | My models in Talon run closer to 100x realtime on CPU. | coder543 wrote: | "CPU" isn't necessarily the benchmark, though. Most | smartphones going back years have ML inference | accelerators built in, and both Intel and AMD are | starting to build in instructions to accelerate | inference. Apple's M1 and M2 have the same inference | accelerator hardware as their phones and tablets. The | question is whether this model is a good fit for those | inference accelerators, and how well it works there, or | how well it works running on the integrated GPUs these | devices all have. | | Brute forcing the model with just traditional CPU | instructions is fine, but... obviously going to be pretty | slow. | | I have no experience on the accuracy of Talon, but I've | heard that most open source models are basically overfit | to the test datasets... so their posted accuracy is often | misleading. If Whisper is substantially better in the | real world, that's the important thing, but I have no | idea if that's the case. | lunixbochs wrote: | See https://news.ycombinator.com/item?id=32929029 re | accuracy, I'm working on a wider comparison. My models | are generally more robust than open-source models such as | Vosk and Silero, but I'm definitely interested in how my | stuff compares to Whisper on difficult held-out data. | | > Brute forcing the model with just traditional CPU | instructions is fine, but... obviously going to be pretty | slow. | | It's not that simple. Many of the mobile ML accelerators | are more targeted for conv net image workloads, and | current-gen Intel and Apple CPUs have dedicated hardware | to accelerate matrix math (which helps quite a bit here, | and these instructions were in use in my tests). | | Also, not sure which model they were using at 17x | realtime on the 3090. (If it's one of the smaller models, | that bodes even worse for non-3090 performance.) The 3090 | is one of the fastest ML inference chips in the world, so | it doesn't necessarily set realistic expectations. | | There are also plenty of optimizations that aren't | applied to the code we're testing, but I think it's | fairly safe to say the Large model is likely to be slow | on anything but a desktop-gpu-class accelerator just due | to the sheer parameter size. | lunixbochs wrote: | Ok, my test harness is ready. My A40 box will be busy | until later tonight, but on an NVIDIA A2 [1], this is the | batchsize=1 throughput I'm seeing. Common Voice, default | Whisper settings, card is staying at 97-100% utilization: | tiny.en: ~18 sec/sec base.en: ~14 sec/sec | small.en: ~6 sec sec/sec medium.en: ~2.2 sec/sec | large: ~1.0 sec/sec (fairly wide variance when ramping up | as this is slow to process individual clips) | | [1] https://www.nvidia.com/en-us/data-center/products/a2/ | coder543 wrote: | Isn't the A2 much weaker than a 3090? So those results | are promising. | | EDIT: for what it's worth, Nvidia rated the A2 at 18 | TFLOPS of FP16, and Apple rates the current A16 Neural | Engine at 17 TFLOPS of FP16. I'm sure it's not an "apples | to apples" comparison. | The5thElephant wrote: | Good point about realtime or not, however with ML I have | found the weaknesses get addressed pretty fast by someone. | There is a big step between proof of concept and practical | application though, so we shall see. | Kuinox wrote: | OpenAI is owned by Microsoft FYI. | neongreen wrote: | Is it? Googling suggests that Microsoft invested in OpenAI | but doesn't actually own it. | Kuinox wrote: | Oh, my bad looks like they only bought an exclusive license | to GPT3. | fxtentacle wrote: | This AI has a 30 second delay on the audio processing because | it needs to be able to "look into the future" to get these good | results. That 30s delay would be unacceptable for | Siri/Google/Cortana. | coder543 wrote: | A lot of models we currently use seem to do the same thing. | The model will transcribe a "best effort" interpretation in | real time, then as you can continue speaking, you'll see it | go back and make corrections. I'm sure you can feed the first | X seconds you have into the model, followed by (30-X) seconds | of silence, and it will do real time transcription just | fine... it would be weird if this broke anything. Then, as | you get more speech, you continue getting better | transcription of the first 30 seconds, then you switch to a | 30 second sliding window. | | Maybe I'm missing something, but I don't see the problem | here. | fxtentacle wrote: | Yes, that's because Whisper - like pretty much all of them | - uses a Transformer encoder with Attention layers. And the | Attention layers learn to look into the future. | | And yes, what you describe could be done. But no, it won't | reduce latency that much, because the model itself learns | to delay the prediction w.r.t. the audio stream. That's why | ASR-generated subtitles usually need to be re-aligned after | the speech recognition step. And that's why there is | research such as the FastEmit paper to prevent that, but | then it is a trade-off between latency and quality again. | | Also, running your "low-latency" model with 1s chunks means | you now need to evaluate the AI 30x as often as if you'd be | using 30s chunks. | coder543 wrote: | You just said the models pretty much all work the same | way, then you said doing what I described won't help. I'm | confused. Apple and Google both offer real time, on | device transcription these days, so _something_ clearly | works. And if you say the models already all do this, | then running it 30x as often isn 't a problem anyways, | since again... people are used to that. | | I doubt people run online transcription for long periods | of time on their phone very often, so the battery impact | is irrelevant, and the model is ideally running (mostly) | on a low power, high performance inference accelerator | anyways, which is common to many SoCs these days. | fxtentacle wrote: | I meant that most research that has been released in | papers or code recently uses the same architecture. But | all of those research papers use something different than | Apple and Google. | | As for running the AI 30x, on current hardware that'll | make it slower than realtime. Plus all of those 1GB+ | models won't fit into a phone anyway. | beastman82 wrote: | In my unmeasured empirical observation Google has amazing | speech recognition | jeffbee wrote: | I tried feeding the four examples from this announcement into | Google as dictation inputs and it just sits there blankly. On | the JFK speech test file in the repo, Google understands | perfectly. The samples in the announcement are clearly | outside the capabilities of anything Google has launched | publicly, but I don't know how that translates to overall | utility in every day applications. | The5thElephant wrote: | I agree they have the best compared to Apple, Amazon, | Microsoft. However I don't think it is as good as what is | being shown here by OpenAI. | Vetch wrote: | My experience with the APIs is Google is excellent and | Microsoft is slightly better. And the offline model I've | been using that's nearly as good as both is facebook's | wav2vec2-large-960h-lv60-self. | | Don't believe what's on marketing pages, they rarely | transfer to the real world. Will have to make time to try | it and see. In theory, given task diversity and sheer | number of hours, it should be a lot more robust but will | wait on evidence before believing any claims on SoTA. | andy_xor_andrew wrote: | Hold on, it does not only speech recognition, but also language | translation, in the same model? | | What an interesting approach. What benefits does this have over | having two dedicated models, one for speech-to-text, and another | for translation? | | It just seems so odd, given the problems of speech-to-text and | Spanish-to-English seems so different from one another (in terms | of the problem domain). Seems so unusual to have both handled by | one model! | | Does knowledge of speech-to-text carry over into knowledge of | translation? Does knowledge of translation carry over into | knowledge of speech-to-text? So weird. | newhaus1994 wrote: | My understanding is that multi-modal models are the primary | focus of OpenAI right now, due to their stated goal of | achieving AGI. This product is probably better thought of as an | offshoot of their work to create a fully generalizable model, | rather than a specific attempt to provide | translation/transcription services. | TaylorAlexander wrote: | It seems these days that language-oriented models are commonly | becoming multilingual by default. There are a lot of common | threads when understanding sentence construction between | different languages. French and English have different rules | but they will still have things like nouns, adjectives, | subjects, prepositions, etc. It seems that by training models | on many languages you get both a more robust understanding of | language, and it saves you the trouble of having to make many | more localized models for every language. I also believe that | the other languages help the models construct sentences in | languages which have very small training sets. If it has a few | examples in a rare language as well as good translations to a | better-known language, then it can provide good support for the | rare language. | | We also see in image generation models that multi-modal | networks are more powerful than single purpose networks. As we | move towards more advanced AI systems I suspect we will see | more and more generalizable networks with distinct advantages | over separate networks that get plugged together. | magicalhippo wrote: | Would a multilingual modal perhaps also be better at | understanding non-natives speech? | thuttinger wrote: | I tried running it in realtime with live audio input (kind of). | | If you want to give it a shot, you can find the python script in | this repo: https://github.com/tobiashuttinger/openai-whisper- | realtime | | A bit more context on how it works: The systems default audio | input is captured with python, split into small chunks and is | then fed to OpenAI's original transcription function. It tries | (currently rather poorly) to detect word breaks and doesn't split | the audio buffer in those cases. With how the model is designed, | it doesn't make the most sense to do this, but i found it would | be worth trying. It works acceptably well. | minimaxir wrote: | The model output can be tweaked to produce audio embeddings (akin | to BERT for text embeddings and CLIP for image embeddings), which | can lead to some _interesting_ applications as the previous two | examples have demonstrated. | FerociousTimes wrote: | What do you mean exactly by audio embeddings? | minimaxir wrote: | Represent a given set of audio inputs as a numeric vector, | which can then for example be finetuned for other ML/AI | problems or placed in an embeddings database for easy ANN | search with similar audio clips. In the extreme case it could | facilitate better AI audio generation similar to how CLIP can | guide a VQGAN. | | Although the 30 second minimum input is a bit of a bummer | since it may not allow much granularity in the resulting | embeddings. | lynguist wrote: | How can I use this (or something similar) for live translation? I | don't mind if there's a 30s delay. | | As in I don't want to input a file, I want to input the | microphone sound. | agnos wrote: | Would also like to know this. It looks like they're processing | the audio file in 30 second chunks, so a naive approach of | keeping a buffer of 30-second input stream chunks and just | continually writing to an output .mp3 could work... | blueberrychpstx wrote: | Was wondering the same. | | I really wish I would have been paying attention in Unix | class... | | Something like `microphone | chunk 3s | whisper | stdout` would | be SO COOL!!! I think that's possible but too lazy to look | more. | spywaregorilla wrote: | Hmm are there any noteworthy open sourced speech to speech | models? Like transform a spoken line to another voice, copying | both the words spoken and the inflections? | cercatrova wrote: | Their Scottish accent example is pretty good, I'd like to see it | work on some very strong English accents like this one: | https://www.youtube.com/watch?v=nJ7QB3om-QY | homarp wrote: | Detected language: english | | [00:00.000 --> 00:05.400] Gordy and County Kerry are | investigating the theft of up to 60 sheep on Mount Brandon. | | [00:05.400 --> 00:10.400] One of the farmers is offering a | reward for information leading to the return of the use, | | [00:10.400 --> 00:12.200] which are worth thousands of euro. | | [00:12.200 --> 00:14.200] Well, I'm fine with that. | | [00:14.200 --> 00:15.200] That's right. | | [00:15.200 --> 00:16.200] Do you own them? | | [00:16.200 --> 00:17.200] Anyone can say it. | | [00:17.200 --> 00:18.200] Fine with that. | | [00:18.200 --> 00:22.720] Last Saturday, Mikey Joe O'Shea | brought his flock of Scotch sheep down from the mountain | | [00:22.720 --> 00:25.320] commonage ahead of lambing. | | [00:25.320 --> 00:29.840] He discovered over 50 were missing, | allowing for a number of deaths and | | [00:29.840 --> 00:30.840] strays. | | [00:30.840 --> 00:34.600] Mikey is convinced over 45 sheep have | been stolen. | | [00:34.600 --> 00:35.600] It was a good night. | | [00:35.600 --> 00:36.600] It would be a full moon there. | | [00:36.600 --> 00:37.600] It would be a good night. | | [00:37.600 --> 00:38.600] It would be bright out. | | [00:38.600 --> 00:40.600] There could be anyone going up in the | mountains. | | [00:40.600 --> 00:41.600] It would be a good night. | | [00:41.600 --> 00:43.600] Well, that was 45 sheep missing. | | [00:43.600 --> 00:49.600] Mikey and the lambs and everything in | the sheep, they counted out a nice bit of money. | | [00:49.600 --> 00:52.200] They've been doing the boat in | Nassan. | | [00:52.200 --> 00:53.200] It's a big one. [00:53.200 --> | 00:54.200] It's a big one. [00:54.200 --> 00:55.200] It's a big | one. | | [00:55.200 --> 00:59.000] Mikey's next door neighbor says some | of his sheep have also been stolen. | | [00:59.000 --> 01:00.000] Come back. [01:00.000 --> 01:01.000] | Come back. [01:01.000 --> 01:02.000] Come back. | | [01:02.000 --> 01:03.000] I've been missing about 10 years. | | [01:03.000 --> 01:04.000] It's not all that difficult. | | [01:04.000 --> 01:06.320] All they've got to do is have a good | dog. | | [01:06.320 --> 01:10.560] Have a good dog and go at night, some | moonshine night. | | [01:10.560 --> 01:11.560] Just put the dog around him. | | [01:11.560 --> 01:14.120] Put him on a trailer and walk him. | | [01:14.120 --> 01:18.360] And then probably somebody else to | pick him up. | | [01:18.360 --> 01:29.960] Everybody's doing it north, but he's | doing it. | cercatrova wrote: | Wow that is incredibly impressive. At 0:53 is it translating | as well? Didn't sound like English to me. | mod wrote: | Those are Irish. | biggerChris wrote: | We have reached sentient mode. | dom96 wrote: | This really makes me want to build a Amazon Echo/Google Nest/etc | replacement that's open hardware, open source and most | importantly recognises voice completely offline. I find that I | don't use these smart devices for much more than setting timers | anyway so this seems like an easy project. | | I just wonder what system requirements Whisper has and whether | there are open source voice recognition models that are | specifically built for embedded devices. | solarkraft wrote: | Are you thinking about reimplementing Mycroft? | | The Mycroft has done a lot of cool and important work in the | field to ship an actual personal assistant product (stuff like | wake word detection). | dom96 wrote: | hah, of course someone had the idea already and executed on | it. But yeah, basically that but without the screen (probably | would go a long way to decrease the cost, $299 is pretty | steep for such a device) | suyash wrote: | This is only one side of the coin, you still need really good | models for Speech Synthesis and then be able to have it all | working in almost real time, ideally locally on device. | ricopags wrote: | As far as TTS goes, Mycroft.ai[0] has released a decent | offline one. | | [0]https://mycroft.ai/ | MacsHeadroom wrote: | I really want all this too. The smallest model is ~80mb and the | largest is 3gb. Not sure about system requirements yet; but | models that small suggest this may be doable locally on a | single board computer. | | Edit: According to this comment[0] the base model runs in real | time on an M1 CPU. The tiny model apparently decodes an audio | file twice as fast. These are promising results. | | [0] https://news.ycombinator.com/item?id=32927360#32929739 | dom96 wrote: | I'd be interested to see how well it performs on something | like an RPi. M1 is pretty beefy. | TOMDM wrote: | Given how robust it seems to be with fast speech, I wonder if you | could save cycles by speeding up the audio before feeding it in. | eatsyourtacos wrote: | Can this be used as a real-time transcription or is it too slow | for that? | | Curious what anyone is using these days for a real-time | transcription. It doesn't have to be perfect, but just good | enough. | | My kids watch some youtube vidoes where people will make a mod | where it converts them talking to text then look for keywords and | spawn a boss in Terraria if you say the wrong keyword etc. | | I made a clone of that with the .NET System.Speech.Recognition | library. It... works.. but my biggest problem is that #1 it waits | until you are done speaking to translate to text on the callback, | so there was too much of a delay for it to be fun.. the point is | that it will be checking a stream of chatter. #2 is the | recognition is pretty crap, I mean it's nearly good enough for my | silly purpose but it's still pretty bad. | blueberrychpstx wrote: | If your family uses Apple devices, Apple offers free on-device | speech recognition. Only caveat is that it needs to be | restarted every minute due to whatever stupid limitation (or | bug) they've introduced. | | https://developer.apple.com/documentation/speech/recognizing... | | Also, see `requiresOnDeviceRecognition` | [deleted] | [deleted] | nshm wrote: | Try https://github.com/alphacep/vosk- | api/blob/master/csharp/demo... | whimsicalism wrote: | It might require too much work for what you are looking for, | but the wav2letter library is the best real-time transcription | OSS I have found by a considerable margin. | davidzweig wrote: | Out of interest, did you try Nemo? | https://github.com/NVIDIA/NeMo | whimsicalism wrote: | No. I dont think it had streaming capabilities when i was | doing this test two years ago, although i see it does now. | TaylorAlexander wrote: | The base model seems to run faster than real time on my | machine. The "medium" model is larger and runs more slowly - | roughly real time or maybe slightly slower. | suyash wrote: | Depends if you're trying to run it offline or over the cloud. | tgtweak wrote: | Good to see them releasing model weights - hopefully now that | Stable Diffusion is out they will release Dall-E 2 source and | weights as well. | knaik94 wrote: | I got a super weird results with the 'medium' and language | Japanese (with a --task translate). The song is False Sympathy by | Mondo Grosso. | | "[01:17.000 --> 01:32.000] Translated by Releska" when using the | translate to english. That entire part of the song is | instrumental. This line does not appear at all in the original | transcribe only in the opus format rip. | | It shows up in the yt rip in format 251 (opus), but not in format | 140 (aac from youtube), nor the flac rip. All three are giving | different results. | | The translation quality is tied to bitrate. Same song converted | to different words, the only difference being bitrates and | formats. Converting my own rip with the same parameters as yt | (opus @140 and then @130) didn't allow me to reproduce this | error. | | The model hung for a solid extra minute at the end when | translating to english, the last 90ish seconds of the song took | real time 60 seconds, while the entire rest took about 90. The | same behavior was not observed with the transcribe. | | Some of the english words are incorrect but that was expected. | The first Japanese "mistake" I found was "Quan tehaEr Ren no" | instead of "subeteha hutarino". With the left being what whisper | wrote. A single random word "hey" was transcribed/translated to | english even though it's the singer elongating the Yuan while | singing the Le Yuan . "Luo chiteyuku Er Ren deXi garetaEr Ren | noragu HEY" instead of "Luo chiteiku Suo detsunagareta Er Ren | noLe Yuan " . | | I am using the official subtitles released on the youtube video. | | It's a complex Japanese song with both japanese and english, and | the original transcribe took about 20 real time seconds to start | with the first line, 130 seconds for the whole song. It seems to | be showing results in 20 second window increments, but this seems | to depend on what it considers audio and what it is throwing | away. | | On my computer I wasn't able to use the large model because I ran | out of VRAM, I have 8gb, not sure how much more it'd require. So | I ran it with medium. | | The song is False Sympathy by Mondo Grosso. The mv is suggestive, | in case that matters. I grabbed a fresh audio rip from Youtube | because I didn't want to take it out of my cd case. | | https://www.youtube.com/watch?v=B6Y-WsgpzlQ | | It is translating this version differently from the director's | cut version. I ripped both as opus. | | There is something weird about how it is handling the opus | encoded version, as I find the same "Translated by Releska" in a | wav version transcoded from the opus. | amrrs wrote: | Here's a live demo on Hugging Face Spaces if you want to try - | https://huggingface.co/spaces/Amrrs/openai-whisper-live-tran... | clemnt wrote: | this is amazing! got it working in French too | TaylorAlexander wrote: | Hey this looks great! I like to record audio notes while driving | in my car after work, to kind of decompress my thoughts from the | day. But I never go back and listen as they can be long and | meandering. Sometimes in the audio log I will sum up my thoughts, | but this might be 20 minutes in and hard to find. I really wish I | had transcriptions so I could easily scan the full contents. I | have tried Mozilla Deepspeech (I don't want a cloud solution) and | I was surprised to find that I could not get Deepspeech to | reliably transcribe them. There is a bit of road noise, though I | think for a human listener they are easy to understand. It looks | like this one might actually do the trick! | | EDIT: Tried it and it worked great! It is very easy to use. I | just did the pip install line in the readme and was ready to go. | You literally just run the one pip install line, and then you run | the program in the format "whisper my_audio.wav" and it goes. | Really nice job OpenAI! | zhynn wrote: | I do this too! I have been doing it for about a year now, and | haven't ever run into someone else that does this kind of | audio-journaling. Would you be up for comparing notes sometime | about how it is working out for you? I am finding that it is | extremely effective form of self-care, but with lots of | personal caveats. I would be so interested to hear your | experience. | blueberrychpstx wrote: | Count me in!! Working on tools actually to turn these | transcriptions into something more social | tekacs wrote: | I do this too, and I've built some software for it just for | myself. | | I'd love to chat and hear about how you use this! My email is | in my profile, or I'm @tekacs on Twitter (and everywhere). :) | TaylorAlexander wrote: | Oh cool! Yeah I have stopped doing it lately as I was not | really using them (I would like to use them for making rough | notes for future youtube video scripts), though in general it | does seem like good self care too even if I don't review | them. That said I just tried the base model on one of my | voice logs and it was pretty good! Trying the medium model | now and it seems basically perfect. So I will have to start | doing these logs more! | | Anyway I am pretty terrible with email but short exchanges | can work for me, or maybe we can connect over signal. Send me | a message to my email in my profile and I would be happy to | sync up! | Snitch-Thursday wrote: | Google's recorder app for android will let you record audio | files and make some transcriptions, right on the device. | Tenoke wrote: | I just tested it and it was pretty mediocre at least with my | accent. I can definitely benefit from a decent app for quick | note recording with a button press->transcribe->upload to | gdrive/good UI app for later grepping. | TaylorAlexander wrote: | Was this with the default base model, or the medium or | large model? This can be specified with the --model flag. | Tenoke wrote: | I meant the 'Google's recorder app' from the parent | comment and not Whisper. | capableweb wrote: | Is that application actually doing on-device transcription? | Under "Data safety" on the Google Play page it says "This app | may share these data types with third parties: Audio" which | doesn't exactly instill confidence that my audio will 100% | always stay on my device. It also says "Data is encrypted in | transit" but if data stays on the device, why it has to be | "encrypted in transit"? There should be no transit at all. | petercooper wrote: | I'll probably explore using this, but I've used an app called | Just Press Record to do what you say. Runs on Apple Watch too, | so you can tap a complication at any time in the day, speak, | and you get a transcript on your phone, etc. | anigbrowl wrote: | Oh nice - I have an immediate use case for this. This looks | accessible enough that the sci-fi dream of instantaneous audio | translation is suddenly within reach. | petercooper wrote: | Just tested this on some developer podcasts which usually fail | hard given they're full of technical jargon, brand names, etc. | Whisper is a revolution! It's picking up terms like Heroku, | DigitalOcean, GitHub, ECS, AWS, etc. and capitalizing properly - | something nothing else did unless you provided a whole pile of | guiding vocabulary. | ma2rten wrote: | Did these podcasts have transcripts? You might be inadvertently | evaluating it on data that it was trained on, which is | basically cheating. Even if not, it might be trained on similar | podcasts. Judging how good these kinds of models are is really | hard. | WiSaGaN wrote: | True. The test should only be done on the material released | _after_ the model. | Jnr wrote: | Cool! | | I am one of the top contributors to the tiny Mozilla Common Voice | data-set for my language. The data-set is very small compared to | other popular languages and none of the other mentioned data-sets | contribute to that language to train the model of Whisper. | | And even with so little data to train on it still works | surprisingly well. | jdmoreira wrote: | Looking forward to see if this works well with foreign accents | mminer237 wrote: | They have an example in the post with a very thick Scottish | accent. You should listen to it. It's pretty impressive. | localy wrote: | Are there any published benchmarks available outlining how this | compares to other open source ASR software, such as Coqui.ai? | bickett wrote: | Hard to keep up with all the great things. The AI community is | really moving quick right now. | aidenn0 wrote: | For those on NixOS, here's a quick and dirty flake.nix that will | let you make a venv in which to "pip install"' | | Just put it in a flake.nix, and "nix develop" followed by | "virtualenv ./venv; . ./venv/bin/activate; pip install | git+https://github.com/openai/whisper.git" { | description = "Python 3.9 development environment"; | outputs = { self, nixpkgs }: let system | = "x86_64-linux"; pkgs = import nixpkgs { inherit | system; }; in { | devShells.${system}.default = pkgs.mkShell { | buildInputs = [ pkgs.ffmpeg | pkgs.python39 pkgs.python39Packages.pip | pkgs.python39Packages.numpy | pkgs.python39Packages.pytorch | pkgs.python39Packages.virtualenv ]; | }; }; } | aidenn0 wrote: | This should, in theory, work with CUDA; my GPU doesn't have | enough RAM to do it (it runs out at 2.9GiB allocated, I have | 4GiB, but am running a compositing desktop, which chews up | about 600MiB; not sure where the other ~400MiB went) | | [edit] | | I confirmed CUDA worked with the "small" model, which used | 3.3GB of GPU ram, and resulted in _much_ poorer recognition | than the "medium" model on my CPU (but it ran at least two | orders of magnitude faster). { | description = "Python 3.9 development environment"; | outputs = { self, nixpkgs }: let system = | "x86_64-linux"; pkgs = import nixpkgs { | inherit system; config.allowUnfree = true; | config.cudaSupport = true; }; in { | devShells.${system}.default = pkgs.mkShell { | buildInputs = with pkgs; [ cudatoolkit | linuxPackages.nvidia_x11 cudaPackages.cudnn | libGLU libGL xorg.libXi xorg.libXmu freeglut | xorg.libXext xorg.libX11 xorg.libXv xorg.libXrandr zlib | ncurses5 stdenv.cc binutils ffmpeg | python39 python39Packages.pip | python39Packages.numpy | python39Packages.pytorch-bin | python39Packages.virtualenv ]; | shellHook = '' export | LD_LIBRARY_PATH="${pkgs.linuxPackages.nvidia_x11}/lib" | ''; }; }; } | magicalhippo wrote: | CUDA worked fine with large on my 2080Ti FWIW. The speedup is | ridiculous, as expected. My Ryzen 3800X used almost an hour | transcribing a minute worth of speech, while the 2080Ti does | it in like 10-20 seconds. | BasilPH wrote: | Any opinions on what this means for speech-to-text companies like | rev.ai and assmembly.ai ? | | We've tested open source solutions for s2t, like kaldi, but the | quality was not good enough. However, one of the main advantages | of a service like assembly.ai to me was that they offer sentence | splitting in form of punctuation and speaker detection, which | Kaldi does not. | | So I guess I answered my own question to some degree: A S2T | service is more than just S2T. We already see assembly.ai add | more and more features (like summarisation, PID redaction ect.) | that are a value-add to plain S2T. | | Still, curious to hear what your take on that is. | nshm wrote: | You can apply public punctation model from Vosk on top of Kaldi | output, you can also get speaker labels with existing open | source software. | | On quick video transcription test this model is more accurate | than AssemblyAI and Rev AI. It will be harder for them to sell | pure ASR now. Some more business-oriented applications will | still be important though, for example ASR as part of | callcenter analytics solution or as a part of medical ERP | system. | | The value of automatic summarization is small, without AI it is | very hard to make it right, you need to be an expert in the | field to understand what is important. | adeptima wrote: | Japanese results looks pretty impressive! | | Took matsukoukuzira14Tou gaHai An niDa chiShang gerareru | osutoraria(2022Nian 9Yue 21Ri ) | https://www.youtube.com/watch?v=bZkNIzeRBk4 | | Extracted audio with youtube-dl -f bestaudio | https://www.youtube.com/watch\?v\=bZkNIzeRBk4 | | Converted into [00:00.000 --> 00:13.000] osutorariaNan Bu noDao | de, Zhen tsuXiang kuzira14Dong gaHai An niDa chiShang gerareteSi | ndeirunogaJian tsukari, Zhuan Men Jia gaDiao Cha notameYuan Di Ru | rishimashita. [00:13.000 --> 00:25.000] Yuan Di | medeianiyorimasuto, osutorariaNan Bu nokinguDong de, 19Ri , Shao | nakutomo14Dong noZhen tsuXiang kuziragaHai An niDa chiShang | gerareteSi ndeirunogaJian tsukarimashita. [00:25.000 --> | 00:31.000] hotondogaRuo iosutowoJian rare, Zhuan Men Jia gaXian | Chang niZhong mukiDiao Cha niDang tatsuteimasu. [00:31.000 --> | 00:41.000] kuziranoSi Hai haDa kikuYun ndariMai | metarisurukotogaNan shiitame, Zi Ran niFen Jie sarerunowoDai | tsuFang Zhen gaJian Tao sareteimasu. [00:41.000 --> 00:52.000] | mata, Si Hai woJu i, samegaHai niJi maruKe Neng Xing | gaarutoshite, Yuan Di Dong Ju hasahuanadoniZhou Wei niJin | dukanaiyouniHu bikaketeimasu. [00:52.000 --> 01:02.000] Yi Fang | , 21Ri nihatasumaniaDong deoyoso230Dong nokuziragaBang Bian niDa | chiShang geraretaZhuang Tai deJian tsukarimashita. [01:02.000 | --> 01:07.000] oyosoBan Shu gamadaSheng kiteiruMo Yang deJi Zhu | Huo Dong gaJin merareteimasu. [01:07.000 --> 01:23.000] Jian | tsukatsutanoha, gondokuziranoZhong Jian toJian rareteimasu. | knaik94 wrote: | Did you try translating them to english? I want to see if you | get a similar error as me with a random phrase "Translated by | Releska" showing up. | gzer0 wrote: | Shocked at how good the results are, and how easy of an | installation it is. | | Here are the exact steps to follow to get it running on Ubuntu | 22.04 via WSL and yt-dlp: 1. pip install | git+https://github.com/openai/whisper.git 2. yt-dlp | -f 'ba' -x --audio-format mp3 | https://www.youtube.com/watch/?v\=bZkNIzeRBk4 3. | renamed the file to test.mp3 4. whisper test.mp3 | --language Japanese --task translate --model large | | Note: the large model will download a ~3Gb file | tullie wrote: | Great to see OpenAI finally being open :) | nicholasjarnold wrote: | This is so cool! I was just speaking to a non-technical family | member about privacy concerns around using "OK Google" and the | like. They responded inquiring about "private" alternatives, to | which my answer was "I'm not aware of good ones that give you | that level of accuracy and convenience." | | Perhaps this development along with continued optimization and | device compute power increases will lead us into a near-future | where things like Mycroft devices and cellphones could have | local-only speech-to-text and translation capabilities which are | accurate even with environmental background noise variations | encountered IRL. | | Great work OpenAI team! | mwlp wrote: | Super impressive. I tested it on a Japanese streamer whose | enunciation isn't exactly perfect and it did a decent job: | https://www.youtube.com/watch?v=ROiOU1scaNA | [00:00.000 --> 00:06.500] Since the last one started, the number | of times I've eaten has decreased. [00:06.500 --> | 00:11.000] If I get too carried away with the last one, I'll get | hungry and do it. [00:11.000 --> 00:14.500] I don't have | time to eat. [00:15.500 --> 00:18.000] I'm going to eat | now. [00:20.000 --> 00:23.000] It's going to take about 10 | minutes from here. [00:23.000 --> 00:31.000] It's been a | while since I've had my last meal. [00:31.000 --> | 00:36.000] I feel like I'm losing myNu Zi Li . [00:36.000 | --> 00:39.000] I have to go back to my original self. | [00:39.000 --> 00:44.000] I have to get ready and go to bed. | [00:44.000 --> 00:46.000] It's not good. [00:46.000 --> | 00:51.000] I've been drinking a lot lately, so I'm going home. | [00:51.000 --> 00:53.000] I have to get my nails done this fall. | [00:53.000 --> 00:54.000] Halloween nails. [00:54.000 --> | 00:57.000] Halloween, Halloween, Halloween. [00:57.000 --> | 00:59.000] I'm going to the beauty salon today. [00:59.000 | --> 01:02.000] I'm going to get my nails done the day after | tomorrow. [01:02.000 --> 01:10.000] I used to look at a | lot of clothes, but I stopped looking at them. [01:10.000 | --> 01:12.000] I'm going crazy. [01:12.000 --> 01:22.000] | My stomach's stopped in the middle of summer. | adeptima wrote: | translation is not the strongest part. transcription looks very | good. | magicalhippo wrote: | It's struggling with Norwegian. Which I guess isn't shocking. | The large model performs a fair bit better than the small, | though neither is "good". | | Though I assume the amount of Norwegian it has been exposed to | is fairly limited, so in that light I'm actually impressed as | well. | | I tried it on a news segment from the radio[1], this is the | large model output: [00:14.000 --> 00:17.200] | En skamlos krenking av FN pakten. [00:17.200 --> | 00:24.000] USAs president og verdensledere svarer pa den | russiske presidentens atomtrusler og krigsmobilisering. | [00:25.500 --> 00:29.400] Arbeidsklaer som er ment til a vaere | til begge kjonn, har det med a vaere tilpasset. | [00:29.400 --> 00:33.400] Men hvordan ville det gatt, om det | var motsatt? [00:34.100 --> 00:38.900] | Dyrevernsorganisasjon vil ha digital merking av regnstyr, | [00:38.900 --> 00:44.900] men naeringen selv insisterer pa den | gamle tradisjonsrike maten med rissing av kniv. | [00:45.600 --> 00:51.400] Mange stromselskaper er positive til | a tilby kundene fastpris pa strom, og det arevis. | [00:51.400 --> 00:59.900] Da risikerer de a matte betale mye i | nettopp aretsvis, sier aktorer som aldri tilbyr fastpris. | [00:59.900 --> 01:21.900] Dette er onsdagens Dagsnytten. Jeg | heter Espen As. | | For reference, here's what he actually said, from the source[1] | itself: * En skamlos krenking av FN-pakten. | USAs president og verdensledere svarer pa den russiske | presidentens atomtrusler og krigsmobilisering. * | Arbeidsklaer som er ment a vaere til begge kjonn, er som regel | tilpasset ... menn. Hvordan hadde det gatt om det var motsatt? | * Dyrevernsoganisasjon vil ha digital merking av reinsdyr, men | naeringen selv insisterer pa den gamle tradisjonsrike maten med | rissing av kniv. * Mange stromselskaper er positive til | a tilby kundene fastpris pa strom - og det i arevis. - | Da risikerer de a matte betale mye i nettopp; arevis, sier | aktor som aldri tilbyr fastpris Dette er onsdagens | Dagsnytt 18 - jeg heter Espen Aas. | | The translation didn't fare that well though: | [00:14.000 --> 00:17.000] A shameless violation of the UN | treaty. [00:17.000 --> 00:24.000] The US president and | world leaders respond to the Russian president's nuclear | threats and war mobilization. [00:24.000 --> 00:33.000] | Work clothes that are meant to be for both genders have to be | suitable, but how would it be if it was the other way around? | [00:34.000 --> 00:44.000] The animal welfare organization will | have a digital marking of reindeer, but the industry itself | insists on the old traditional way of tearing a knife. | [00:45.000 --> 00:51.000] Many electricity companies are | positive in offering customers fixed electricity prices, and | that is annual. [00:51.000 --> 00:58.000] Then they | risk having to pay a lot in just a year, says an actor who has | never offered fixed prices. [00:58.000 --> 01:20.000] | This is Wednesday's Dagsnytt 18. My name is Espen As. | | For reference, here's Google Translate's attempt, which is | pretty good: * A shameless violation of the | UN Charter. The US president and world leaders respond to the | Russian president's nuclear threats and war mobilization. | * Work clothes intended for both sexes are usually adapted to | ... men. How would it have gone if it had been the other way | around? * Animal welfare organizations want digital | marking of reindeer, but the industry itself insists on the | old, traditional way of marking with a knife. * Many | electricity companies are positive about offering customers a | fixed price for electricity - and for years. - Then | they risk having to pay a lot in precisely; for years, says a | player who never offers a fixed price This is | Wednesday's Dagsnytt 18 - my name is Espen Aas. | | [1]: | https://radio.nrk.no/podkast/dagsnytt_atten/l_5ce3e323-97a3-... | (not sure if it's available outside of Norway) | kiwih wrote: | Given this, are there good (and available/open source) models for | text to speech? Last time I tried everything still sounded | extremely robotic, and/or were a pain to set up and run. It would | be fun to set up a pipeline where the two processes | 'communicate'. | obscur wrote: | Measuring performance in rounds of successful Chinese whisper | | (irony) | pen2l wrote: | Neat, https://github.com/openai/whisper - they have open-sourced | it, even the model weights, so they are living up to their name | in this instance. | | The 4 examples are stunningly good (the examples have speakers | with heavy accents, speaking in foreign language, speaking with | dynamic background noise, etc.), this is far and away better than | anything else I've seen. Will be super curious to see other folks | trying it out and seeing if it's as robust as it seems, including | when confronted with audio speech with natural tics and uhhh's | and uhmm's and everything in-between. | | I think it's fair to say that AI-transcription accuracy is now | decidedly superior to the average human's, what the implications | of this are I'm not sure. | anigbrowl wrote: | It was already better. I edit a podcast and have > a decade of | pro audio editing experience in the film industry, and I was | already using a commercial AI transcription service to render | the content to text and sometimes edit it as such (outputting | edited audio). | | Existing (and affordable) offerings are so good that they can | cope with shitty recordings off a phone speaker and maintain | ~97% accuracy over hour-long conversations. I'm sure it's been | an absolute godsend for law enforcement other people who need | to gather poor-quality audio at scale, though much less great | for the targets of repressive authority. | | Having this fully open is a big deal though - now that level of | transcription ability can be wrapped as an audio plugin and | just used wherever. Given the parallel advances in resynthesis | and understanding idiomatic speech, in a year or two I probably | won't need to cut out all those _uuh like um y 'know_ by hand | ever again, and every recording can be given an noise reduction | bath and come out sounding like it was recorded in a room full | of soft furniture. | adamgordonbell wrote: | I've not found that to be the case. | | For technical content, I use Rev.com and provide a glossary | and real humans do the transcript. Other AI transcription | services get lots wrong because the context often matters. | Words like "TCP/IP" or "FAT disk format" or "Big Endian" I've | never found AI so far to handle well. | | I'm interested to test out whisper on this one. | | https://corecursive.com/063-apple-2001/ | deegles wrote: | There's already software that can imitate a person's voice, | so we have all the pieces already to do speech-to-text, clean | up with GPT-3, and back to text-to-speech in the original | person's voice. Maybe with a style transfer to keep the | person's inflections etc the same? | Karuma wrote: | I think something similar already exists. See this, for | example: https://koe.ai/recast/ | | Although I don't know if they're using anything similar to | what you suggest. Very cool idea, anyway! | biomcgary wrote: | Since you work on podcasts, do any open source transcription | tools currently identity the speaker in the output? This | would be particularly helpful for interviews. | solarmist wrote: | Any recommendations for particular services? | anigbrowl wrote: | I use a service called sonix.ai. It's paid but I think they | have a free tier or trial period, and it's not very | expensive. I'm excited about this new OpenAI thing because | I'd rather do it on my own hardware than send it to the | cloud, but this company has earned its commercial success. | solarmist wrote: | That is an exciting possibility. Being able to fix bad setups | and missed takes automagically. It's always been possible, | just expensive and time consuming for moderate improvements. | thfuran wrote: | >~97% accuracy over hour-long conversations. I'm sure it's | been an absolute godsend for law enforcement | | 97% accuracy means roughly three or four errors per minute of | speech. That seems potentially extremely problematic for | something like law enforcement use where decisions with | significant impact on people's day and/or life might be made | on the basis of "evidence". | gs17 wrote: | Yeah, I tried to use automated transcription for a research | project and we had to do it all manually because the few | errors (I would say it did pretty well given our recording | quality) were often dropping words like "not", which | changed the whole meaning of a sentence! It was a useful | assistance during transcription, but I really hope they | would verify it was correct before arresting anyone based | on it. | anigbrowl wrote: | No it isn't. That just means 2-3% of your content needs to | be double-checked by a person at the audio level, saving | huge amounts of time - equally true of human transcription, | in which individual words are often [UNINTELLIGEBLE]. | | Would you want to review this fully before going into | court, absolutely - because you'd want to play the | recording to a jury for emotional impact. Can you rely on | it when you want to quickly read through hours of | conversation and make decisions about whether to invest | further resources (which might just mean another hour of | listening back to the original audio)? Also absolutely. | Bear in mind that a lot of these errors have little to no | semantic impact, being on the same level as typos or | misspellings in a written communication. | | Bear in mind too that if law enforcement (honest or not) is | so interested in you that they're willing to record your | conversations, your day is already ruined, you just don't | know it yet. The change here is one of scale rather than | quality. | wging wrote: | Doesn't it mean 100% of your content needs to be double- | checked? You can't easily identify which 2-3% of your | content has errors. I'm aware that errors are more likely | when the model is less confident of its predictions, but | that shouldn't be enough. | | (edit for clarification: errors are not always something | like "[UNINTELLIGIBLE]", where the system knows it | doesn't know; they can also be misrecognitions that the | system believes in with high confidence.) | woah wrote: | You double check things that you think are important, in | this case, passages that will be used as evidence in | court. | guelo wrote: | Maybe you could run the text through a grammar checker to | identify the errors. | anigbrowl wrote: | By the time you're prosecuting someone in court, yes of | course you double, triple, quadruple check everything. | That's why lawyers get paid the big bucks (for now...). | But yes you can identify which content probably has | errors and flag it as such. | | Look, I have decades of experience dealing with human | speech, and not just as an editor - I can trace the human | voice from neural impulses in Broca's region through the | physiology of vocal production, mechanical transduction | into electrical signals, discrete fourier transforms of | the resultant waveforms into spectral information and | back again, the reproduction of altered signals from | time-aligned speakers to create a sense of | spatialization, how those are processed in the human ear, | and how the cilia are connected by nerves back to your | brain. I'm a good enough editor that I can recognize many | short words by sight of a waveform, or make 10 edits in a | row by sight and know it will sound good on playback. | | So when I say that machine transcription is as good as | human realtime transcription now, I say so with the clear | expectation that those decades of craft are very close to | being rendered obsolete. I absolutely expect to hand off | the mechanical part of editing to a machine within 2 | years or so. It's already at the stage where I edit some | interviews as text, like in a word processor, and then | export the edited document as audio and it's Good Enough | - not for every speaker, but more than half the time. | | NPR and a lot of commercial broadcasters cut their | material this way already, because you can get the same | result from 30 minutes of reading and text editing that | would require 3 hours of pure audio editing with no | transcription. | etienne618 wrote: | Presumably you can use the 97% that is correctly | transcribed to rapidly filter out the relevant content. | This is likely to be only a small portion of the total | content. Then you check 100% of that. | datalopers wrote: | If you know which 2-3% are the false positives, you have | a very lucrative business model. | MonkeyMalarky wrote: | When doing validation, I find it will often be the same | errors repeated again and again in a transcription. Like | it will fail on someone or some thing's name (that is | rare / unique) and map it onto a known similar sounding | word. | gnramires wrote: | I think an [UNINTELLIGIBLE] indication would be a great | addition to automatic transcription systems. | inanutshellus wrote: | It'd [UNINTELLIGIBLE score="92%" alternatives="pro- | rabble; pourable"]probably[/UNINTELLIGIBLE] be useful to | make a markup-based output... though you'd probably find | it gave you more info than you wanted. | anigbrowl wrote: | It already exists. The commercial product I use most is | called sonix.ai and I think they have a free tier or | trial period. It has shortcomings but it's shockingly | good, despite having some limitations. | thfuran wrote: | >equally true of human transcription, in which individual | words are often [UNINTELLIGEBLE]. | | ML systems somewhat notoriously do not necessarily make | the same sorts of errors that a human would. And I'd | expect a large portion of the errors to be transcribing | the wrong words rather that indicating that a word | couldn't be transcribed. That sort of error means that | you can't really get away with manually reviewing just 3% | of the audio. | golem14 wrote: | One would think that the few crucial bits of information | gleaned are listened to manually, and the machine | translation is not the only thing the judge or a jury sees. | thfuran wrote: | You have absolutely ruined someone's day way before | they're sitting in front of a jury. | formerly_proven wrote: | Stuff like that is a very good tell that someone has zero | experience with law enforcement. | j-krieger wrote: | I've worked with similar technology in the law enforcement | space and the software is never used to make decisions. You | can make out critical timestamps in conversations and a law | enforcement officer will always manually confirm the | software's assessments. | JohnFen wrote: | Given that law enforcement has made similar claims about | technology use in the past that turned out to be false, I | have no faith in this claim. | hadlock wrote: | Microsoft announced their voice transcription technology a | couple years ago and were also touting ~97-98% accuracy | which was actually _better_ than human transcription error | rates. The errors are usually in part people garbling their | own speech, or they move their head while talking and the | microphone misses a syllable. Anything in that error bar | would probably fall under "reasonable doubt" | kyriakos wrote: | If its anything like Microsoft teams transcription I | doubt the 97%+ accuracy. | soheil wrote: | Their name reminds of the company McDonald's uses to supply | their beef called 100% Pure Beef Inc. so they can say 100% Pure | Beef on their menu. | space_fountain wrote: | This seems to not be true for McDonald: | https://www.snopes.com/fact-check/mcdonalds-100-beef/ | soheil wrote: | This article seems very suspect to me. This is the main | reason they assert why the claim is false: | | "While this is a fascinating premise, there's nothing to | it: McDonald's hamburger patties in the U.S. are made with | 100% USDA-inspected beef. They are cooked and prepared with | salt, pepper and nothing else; no preservatives, no | fillers. | | McDonald's of Australia's "Make Up Your Own Mind" web site | said the following of the rumor in its Top FAQs section: | Is it true that McDonald's created a company called "100% | Australian Beef" just so they can say that in their | advertising? No." | | So if I'm McDonald's and want to squash a negative story | why not throw a few bucks at the pinnacle of journalism | that is Snopes? (formerly Urban Legends Reference Pages) | space_fountain wrote: | This isn't exactly a hard story to fact check. There is 0 | evidence for this in either the reddit thread or really | anywhere? If they were willing to lie about the company | name why not just lie about the beef in their burgers it | would be equally scandalous | soheil wrote: | The company name could be 100% legit, there is nothing | stopping you from a forming a company with that name and | not even sell beef. | sam_goody wrote: | It definitely happens. | | There are at least two companies that have branded [..] | Kosher Gelatin(tm). One of them makes gelatin that is | considered non-kosher by all of the major kashrus | agencies. | | "Kosher Gelatin(r)", when in the ingredients, just means | the product contains pork. | jsight wrote: | You are right, it could be. The problem is that its the | kind of thing that would be almost impossible to disprove | if it were false. So you can always raise doubts about a | supposed disproof. | | But it'd be really easy to prove if it were true and | noone has offered proof. And there've been plenty of | people who've looked for such proof, afaict. | | My default assumption in such cases is that it is likely | false. | jefftk wrote: | If this was more than an urban legend someone would be | able to dig up a company with this name and some | indication that McD was working with them. | pessimizer wrote: | Something being possible to do isn't enough evidence for | rational people to believe that it happened. From my | perspective, it's possible that you're Iron Mike Tyson, | or that you died after your last comment and this one was | posted by the assassin who killed you. | soheil wrote: | What? I never said it's evidence that it did happen, | please don't make things up. I just pointed out the | evidence provided to refute the claim is possibly | invalid. | pessimizer wrote: | You haven't offered any evidence is the point. | [deleted] | whichfawkes wrote: | In the US, for a while I remember we had billboards | advertising McDonald's burgers as being "1 <hamburger> | <hamburger>% beef". Because the hamburgers were of course | circular, it looked kind of like "100%". | | I remember thinking that surely an image of a hamburger | does not legally constitute a zero. | leobg wrote: | Seems like this is an urban legend. | | https://www.reddit.com/r/IsItBullshit/comments/2rztov/isitbu. | .. | soheil wrote: | This seems to be primarily based on the referenced Snopes | article https://news.ycombinator.com/item?id=32929237 | [deleted] | bambax wrote: | The French version is a little contrived. The speaker is a | native speaker, but the text is obviously the result of a | translation from English to French, not idiomatic French. | | I will try to put the code to the test, see how it goes. | octref wrote: | I'm interested in building something with this to aid my own | French learning. Would love to read your findings if you end | up posting it somewhere like twitter/blog! | bambax wrote: | Tried again with Blaise Pascal -- the famous fragment of a | letter where he says he's sorry he didn't have enough time | to make it shorter. | | Original: | | > _Mes reverends peres, mes lettres n'avaient pas accoutume | de se suivre de si pres, ni d'etre si etendues. Le peu de | temps que j'ai eu a ete cause de l'un et de l'autre. Je | n'ai fait celle-ci plus longue que parce que je n'ai pas eu | le loisir de la faire plus courte. La raison qui m'a oblige | de me hater vous est mieux connue qu'a moi. Vos reponses | vous reussissaient mal. Vous avez bien fait de changer de | methode ; mais je ne sais si vous avez bien choisi, et si | le monde ne dira pas que vous avez eu peur des | benedictins._ | | Transcription: | | > Mes reves errent peres, mais l'detre navais pas accoutume | de se suivre de si pres ni d'detre si etendu. Le peu de | temps que j'sais eu a ete cause de l'de l'de l'de autre. | J'sais n'detre plus longue que parce que j'sais pas eu le | loisir de la faire plus courte. La raison qui m'sa obligee | de me hater vous est mieux connue qu'moi. Vos reponses vous | reussissaient mal. Vous avez bien fait de changer de | methode, mais je ne sais pas si vous avez bien choisi et si | le monde ne dira pas que vous avez eu peur des benedictes. | | Here there are many more mistakes, so many that the | beginning of the text is unintelligible. The language from | the 17th century is probably too different. Still on the | "medium" model, as the large one crashes the Colab (not | sure how to select a beefier machine.) | | Still fascinating and exciting though. | bambax wrote: | I'm playing with a Colab posted in this thread | (https://news.ycombinator.com/item?id=32931349), and it's | incredibly fun and accurate! | | I tried the beginning of L'etranger (because you seem to be | a fan of Camus ;-) | | Here's the original: | | > _Aujourd'hui, maman est morte. Ou peut-etre hier, je ne | sais pas. J'ai recu un telegramme de l'asile : << Mere | decedee. Enterrement demain. Sentiments distingues. >> Cela | ne veut rien dire. C'etait peut-etre hier._ | | > _L'asile de vieillards est a Marengo, a quatre-vingts | kilometres d'Alger. Je prendrai l'autobus a deux heures et | j'arriverai dans l'apres-midi. Ainsi, je pourrai veiller et | je rentrerai demain soir. J'ai demande deux jours de conge | a mon patron et il ne pouvait pas me les refuser avec une | excuse pareille. Mais il n'avait pas l'air content. Je lui | ai meme dit : << Ce n'est pas de ma faute. >> Il n'a pas | repondu. J'ai pense alors que je n'aurais pas du lui dire | cela. En somme, je n'avais pas a m'excuser. C'etait plutot | a lui de me presenter ses condoleances._ | | Here's the transcription: | | > Aujourdhui, maman est morte, peut etre hier, je ne sais | pas. J''ai recu un telegramme de l''asile. Mere decedee, | enterrement demain, sentiment distingue. Cela ne veut rien | dire. C''etait peut etre hier. | | > L''asile de Vieillard est a Maringot, a 80 km d''Alger. | Je prendrai l''autobus a deux heures et j''arriverai dans | l''apres midi. Ainsi, je pourrai veiller et je rentrerai | demain soir. J''ai demande deux jours de conge a mon patron | et il ne pouvait pas me les refuser avec une excuse | pareille. Mais il n''avait pas l''air content. Je lui ai | meme dit, ce n''est pas de ma faute. Il n''a pas repondu. | J''ai alors pense que je n''aurais pas du lui dire cela. En | somme, je n''avais pas a m''excuser. C''etait plutot a lui | de me presenter ses condoleances. | | Except for the weird double quotes instead of the single | apostrophe ('), it's close to perfect, and it only uses the | "medium" model. | | This is extremely exciting and fun! Happy to try other | texts if you have something specific in mind! | bambax wrote: | Last try for tonight with Baudelaire. | | Original: Trois mille six cents fois par | heure, la Seconde Chuchote Souviens-toi !- Rapide, | avec sa voix D'insecte, Maintenant dit Je suis | Autrefois, Et j'ai pompe ta vie avec ma trompe | immonde ! Remember ! Souviens-toi ! prodigue ! | Esto memor ! (Mon gosier de metal parle toutes les | langues ) Les minutes, mortel folatre, sont des | gangues Qu'il ne faut pas lacher sans en extraire | l'or ! | | Transcription: | | > Trois mille six cents fois par heure, la seconde chuchote | << Souviens toi >>, rapide, avec sa voix d''insecte, | maintenant dit << Je suis autrefois >>, et j''ai pompe ta | vie avec ma trompe immonde. << Remember, souviens toi, | prodigue, est au memoire, mon gosier de metal, parle toutes | les langues, les minutes, mortelles folatres, sont des | gangs qu''il ne faut pas lacher sans en extraire l''or. >> | | Not bad! Far from perfect but it's a difficult text. | Interesting that it works better with Baudelaire than | Pascal. | pen2l wrote: | Interesting, I'm a non-native French speaker, the original | French piece struck me as being entirely normal (but maybe it | was just the perfect French accent that swayed me). Can you | please point out what he said which wasn't idiomatic or | naturally-worded French? | bambax wrote: | Little details. The second sentence is really bizarre: | | > _Nous etablissons que l 'utilisation de donnees d'un tel | nombre et d'une telle diversite est la raison pour laquelle | le systeme est a meme de comprendre de nombreux accents..._ | | It doesn't sound natural at all. An idiomatic formulation | would be more along the lines of: | | _Le recours a un corpus [de donnees] si riche et varie est | ce qui permet au systeme de comprendre de nombreux accents_ | (With 'corpus', 'donnees' is implied.) | | Of course this is just an example, and I'm sure other | French speakers could come up with a different wording, but | "donnees d'un tel nombre et d'une telle diversite" sounds | really wrong. | | This is also weird and convoluted: | | > _Nous distribuons en tant que logiciel libre le code | source pour nos modeles et pour l 'inference, afin que | ceux-ci puissent servir comme un point de depart pour | construire des applications utiles_ | | It should at least be "le code source DE nos modeles" and | "servir DE point de depart", and "en tant que logiciel | libre" should placed at the end of the proposition (after | 'inference'). | | Also, "construire" isn't used for code but for buildings, | and "applications utiles" is unusual, because "utiles" | (useful) is assumed. "...pour le developpement de nouvelles | applications" would sound more French. | [deleted] | _plg_ wrote: | At the start, the "Nous etablissons" part, for example. You | wouldn't write that if you were starting scratch from | French. | not_math wrote: | You can see from the transcript where the model made some | errors, for example: | | > We distribute as a free software the source code for our | models and for the inference [...] | | Should be | | > We are open-sourcing models and inference code [...] | | Another example | | > We establish that the use of such a number of data is | such a diversity and the reason why our system is able | [...] | | Should be | | > We show that the use of such a large and diverse dataset | leads to improved robustness [...] | Workaccount2 wrote: | Can't wait to see twelve new $49.99/mo speech parser services | pop up in the next few weeks. | suyash wrote: | More of this is welcome, they should live up their name and | original purpose and share other models (code, weights, | dataset) in the open source community as well. | Simorgh wrote: | I've been experimenting with voice-interfaces where typing is | replaced by talking, but I find it hard to transition users to | voice - we 'seem' to prefer typing to talking. | | I wonder if this will change. | ironlake wrote: | Personally, I would rather type than talk when interacting with | a computer. The only time I use voice interfaces are when the | physical interface is so poor it's just easier to use voice. | Apple TV devices are an example of this. | shpx wrote: | We shouldn't call this open source. The model definition + the | data is the source code. The model weights are a compilation | artifact. | | > The source code must be the preferred form in which a | programmer would modify the program. [...] Intermediate forms | such as the output of a preprocessor or translator are not | allowed. | | > https://opensource.org/osd | | If I asked a programmer from OpenAI to modify the model to better | support Japanese speakers from Hokkaido, their "preferred form" | of the model's source code would include the 680,000 hours of | audio used to train the model. | | Yes that means that there are almost no open source models and | yes it's awesome that they released this and made the weights | available. Just don't call it open source. | lfmunoz4 wrote: | sergiotapia wrote: | Does this work with multiple speakers? | | I want to build a tool that takes a video and generates subtitles | for it, then I want to index the subtitles and let people search | for a specific quote to scrub to that part of the video using | automatically generated urls. | | This is for a specific fandom of a ton of content, lots of dirty | audio mostly recorded in a gym setting with multiple people | speaking. | 867-5309 wrote: | pretty sure such a tool made HN front page a few months ago | isoprophlex wrote: | Really incredible to see that their multilingual audio-to-English | approach is viable. I'm super excited about this, and great to | see that openai actually open up about something, for once. | | Skimming the codebase I can't immediately see code to do | additional training. | | Being able to fine-tune the model to a specific language or case | (eg. teach it specifically about some technical topic that might | not be so prevalent in the current train set) would be majorly | disruptive to current SOTA in "callcenter analytics" tech. | Especially when combining Whisper with GPT3. | samstave wrote: | AI speech recognition FN scares the heck out of me... | | for so many reasons. | | But one that really pisses me off is not being able to turn it | off on the iphone, and the fact that aside from "hidden cameras | in my airBnB" -- soon we will have to worry about secret | listening machines EVERYWHERE | jfoster wrote: | Also, based on their demo, this model seems like it might have | comprehension well above the level of a typical human. | | Anyway, it's out there now. No way to turn back. | ma2rten wrote: | We will see an explosion of AI capabilities in the next couple | of years. This will have a huge impact on our lives, much of it | good but some of it also bad. | samstave wrote: | "Good" for ensuring you're a compliant consumer - bad if | you're an individual person | wongarsu wrote: | "Secret listening machines everywhere" was a pretty big thing | in East Germany. It's also the central theme of the movie The | Lives of Others. | | Of course, the ability to scale this more cheaply (throwing | more compute at it, instead of more people) is somewhat scary, | but it's not really introducing a new capability. Especially | since you still have to do something with the transcript. An | AirBnB landlord who reads the transcript of what you said could | as well have listened to the recording. | ALittleLight wrote: | I think it's a new capability to add good speech to text, | search, and models that can understand and process text. You | have microphones recording speech everywhere, models turning | that speech into easily searchable text, and something like | GPT-3 reading all the speech and raising red flags for any | transgressive idea you please. | samstave wrote: | Yes, and if you want AI that is searching for "dissenters" | we shall soon have "speech police" or tickets or some | format of authoritarian punitive actions powered by this | zappy42 wrote: | "John Spartan, you have been fined one credit for | violation of the Verbal Morality Statute." | jffry wrote: | I'd argue that cheap, pervasive, always-on surveillance with | a backlog of searchable transcriptions is a qualitatively | different capability. | samstave wrote: | Exactly. | | We are entering the next era... | | The Kurzweil podcast appearance on Lex Fridman is nuts and | while I love kurzweil, holy crap even with my distopian | outlook he makes it even worse when you listen to even half | of it... | gareth_untether wrote: | I'm thinking of releasing a plugin in for Unity to that can be | used to match a phrase to an action. Seeing Whisper is making me | think I should include a way to use voice and not just text. | aidenn0 wrote: | I just threw a random rock MP3 at it, and a first readthrough | shows no transcription errors; this is quite good. | | Now I just want OCR that's even 50% as good as this... | aidenn0 wrote: | Ran a few other songs through it and found one obvious | mistranscription: | | "He's the bedroom cosmic rocker" (should be "He's the veteran | cosmic rocker" in _Veteran Cosmic Rocker_ by The Moody Blues) | | I also noticed that it's a little on the conservative side for | detecting speech; all songs were missing at least part of one | line. | funhighway wrote: | Would be nice to give more details about the provenance and | construction of the training data. | [deleted] | StevenWaterman wrote: | That example at the top of the page (speed talking) blew me away. | He started talking, I was stunned for a minute, then realised | yes, it really was English, and I just burst out laughing. | | That's so, so far beyond the previous state-of-the-art, it's | absurd. | londons_explore wrote: | @dang Can we change the link to the github here[1]? | | It seems to describe the project better for a technical audience. | | [1]: https://github.com/openai/whisper | toss1 wrote: | Like every model I've seen there is something like this: | | >>A decoder is trained to predict the corresponding text... | | Prediction of expected text in the context of the previous text. | | While this is valuable in casual transcription, it can be | extremely dangerous in serious contexts. | | From personal experience, having given a deposition with an "AI" | transcription, it will literally reverse the meanings of | sentences. | | This is because it produces the _EXPECTED_ output in a context, | and _NOT THE ACTUAL OUTPUT_. | | Like a speaker that clips the output, these types of systems | 'clip' the really valuable information out of a transcription. | Worse yet, this is a completely silent failure, as the transcript | _LOOKS_ really good. | | Basic info theory shows that there is more information contained | in 'surprising' chunks of data than in expected ones. These | systems actively work to substitute 'expected' speech to | overwrite 'surprising' speech. | | The transcript I got was utter trash, multiple pages of errata I | had to submit when the normal is a couple of lines. And as I | said, some literally reversed the meaning in a consequential way, | and yet completely silently. | | This kind of silent active failure mode is terrifying. Unless it | is solved, and I see no way to solve it without removing ALL | predictive algos from the system, these types of systems must not | be used in any situation of serious consequence, at least not | without real redundancy and backup. | sowbug wrote: | I knew there was a reason why I kept my MP3 library even after | subscribing to Spotify. Now piping everything through whisper. So | far the generated lyrics are reasonable, though it thinks the REM | song says "Linnie Bruce is not afraid." | | No surprise that it appears to have successfully transcribed all | the recordings of Harvard Sentences I could find. | https://en.wikipedia.org/wiki/Harvard_sentences | hijp wrote: | Anyone get it running on m1 mac? | | I keep getting `ModuleNotFoundError: No module named | 'setuptools.command.build'` | kif wrote: | I got requirements installed, but then when running the Python | example, I get: | | RuntimeError: "slow_conv2d_cpu" not implemented for 'Half' | kif wrote: | Probably need to pass some kind of options when initializing. | The command itself works fine, just shows a warning: | warnings.warn("FP16 is not supported on CPU; using FP32 | instead") | mewse-hn wrote: | using this in the sample code worked for me: | | >>> options = whisper.DecodingOptions(fp16=False) | dceddia wrote: | Yep, I had this too. `pip3 install -U pip setuptools` took care | of it. (If you get an error about pip3, try `pip` instead) | hijp wrote: | I'm really new to pip, but does this look ok? | | (after running the command for setuptools) Defaulting to user | installation because normal site-packages is not writeable | Requirement already satisfied: pip in | /Users/xxx/Library/Python/3.9/lib/python/site-packages | (22.2.2) Requirement already satisfied: setuptools in | /Users/xxx/Library/Python/3.9/lib/python/site-packages | (65.3.0) | | ---- after trying whisper installation: x Getting | requirements to build wheel did not run successfully. | exit | code: 1 +-> [20 lines of output] Traceback (most recent call | last): File "/Users/xxx/Library/Python/3.9/lib/python/site- | packages/pip/_vendor/pep517/in_process/_in_process.py", line | 363, in <module> main() File | "/Users/xxx/Library/Python/3.9/lib/python/site- | packages/pip/_vendor/pep517/in_process/_in_process.py", line | 345, in main json_out['return_val'] = | hook(*hook_input['kwargs']) File | "/Users/xxx/Library/Python/3.9/lib/python/site- | packages/pip/_vendor/pep517/in_process/_in_process.py", line | 130, in get_requires_for_build_wheel return | hook(config_settings) File "/Library/Developer/CommandLineToo | ls/Library/Frameworks/Python3.framework/Versions/3.9/lib/pyth | on3.9/site-packages/setuptools/build_meta.py", line 154, in | get_requires_for_build_wheel return self._get_build_requires( | File "/Library/Developer/CommandLineTools/Library/Frameworks/ | Python3.framework/Versions/3.9/lib/python3.9/site- | packages/setuptools/build_meta.py", line 135, in | _get_build_requires self.run_setup() File "/Library/Developer | /CommandLineTools/Library/Frameworks/Python3.framework/Versio | ns/3.9/lib/python3.9/site-packages/setuptools/build_meta.py", | line 150, in run_setup exec(compile(code, __file__, 'exec'), | locals()) File "setup.py", line 2, in <module> from | setuptools_rust import Binding, RustExtension File "/private/ | var/folders/lj/7x6d3dxd3cbdtt484k6xsmyh0000gn/T/pip-build- | env-ieaydl8r/overlay/lib/python3.9/site- | packages/setuptools_rust/__init__.py", line 1, in <module> | from .build import build_rust File "/private/var/folders/lj/7 | x6d3dxd3cbdtt484k6xsmyh0000gn/T/pip-build-env- | ieaydl8r/overlay/lib/python3.9/site- | packages/setuptools_rust/build.py", line 23, in <module> from | setuptools.command.build import build as CommandBuild # type: | ignore[import] ModuleNotFoundError: No module named | 'setuptools.command.build' [end of output] | note: This error originates from a subprocess, and is likely | not a problem with pip. | | error: subprocess-exited-with-error | dceddia wrote: | Nope, that doesn't look good! I honestly just googled the | error and installing setuptools fixed it for me, but I | barely know anything about the Python ecosystem so I'm | really just fumbling around here. | hijp wrote: | haha same, thanks | Smaug123 wrote: | I'm still not successfully using the GPU, but it's working | decently quickly (with the base model - it's incredibly slow to | use the Large model) using just the CPU. I'm going to have to | check what magic stable-diffusion is doing to enable the GPU :( | dceddia wrote: | There's a --device flag you can pass. I've been trying to get | `--device cuda` to work on my Windows machine and it's saying | that torch wasn't compiled with CUDA. Trying to figure out | what's going on there. | | And on the M1, supposedly PyTorch has support for hardware | acceleration using MPS (Metal Performance Shaders, announced | here https://pytorch.org/blog/introducing-accelerated- | pytorch-tra...) but when I tried `--device mps` it blew up | with an error "input types 'tensor<1x1280x3000xf16>' and | 'tensor<1xf32>' are not broadcast compatible". | Smaug123 wrote: | Yep, same for me, on M1 after enabling MPS (with | `model.to("mps")`) it just either SIGSEGV or SIGABRTs every | time with that line. The extremely unclean nature of the | abort is making it hard to debug :( | dceddia wrote: | I noticed the size seems to correspond to the model. With | a large model, the error is tensor<1x1280x3000xf16>. With | tiny, it's tensor<1x384x3000xf16>, and with medium it's | tensor<1x1024x3000xf16>. It also seems like a bad thing | that those are f16's but the "expected" data is f32. | Smaug123 wrote: | I'm giving up for the night, but | https://github.com/Smaug123/whisper/pull/1/files at least | contains the setup instructions that may help others get | to this point. Got it working on the GPU, but it's... | much much slower than the CPU? Presumably due to the | 'aten::repeat_interleave.self_int' CPU fallback. | | Also hitting a nice little PyTorch bug: | | > File "/Users/patrick/Documents/GitHub/whisper/whisper/d | ecoding.py", line 388, in apply logits[:, | self.tokenizer.encode(" ") + [self.tokenizer.eot]] = | -np.inf | | > RuntimeError: dst_.nbytes() >= dst_byte_offset INTERNAL | ASSERT FAILED at "/Users/runner/work/pytorch/pytorch/pyto | rch/aten/src/ATen/native/mps/operations/Copy.mm":200, | please report a bug to PyTorch. | nik_s wrote: | I just tested the model [1] using an RTX3090, trying to translate | a french text I found here [2]. | | Some observations: | | - The full translation of the 6:22 minute video takes about 22 | seconds (17x real time) | | - It recognizes the language by default (and did a good job to | recognize it was french audio) | | - MIT License [3]! | | - The quality of the transcription is good, but not perfect. | | - The quality of the translation (if you don't consider | transcription errors as a translation error) is generally very | good. | | --- | | The transcription: | | > Bonjour a tous, <error>j'suis</error> espere que vous allez | bien, c''est ENTI. Et aujourd', <error>aujourd',</error> on se | retrouve <error>un peu physique</error> pour parler de la termo | dynamique. Vous ne vous inquietez pas, ca va bien se passer. On | va y aller ensemble, <error>etre a par exemple,</error> je vous | accompagne a travers une serie de videos pour vous expliquer les | principes de base en termo dynamique. Et bah, c''est parti, on va | y aller tranquillement. Lidee, c''est vous puissiez comprendre la | termo dynamique dans son ensemble. Donc, je vais vraiment prendre | mon temps pour <error>couplisser</error> bien comprendre les | notions, | | The translation: | | > Hello everyone, I hope you're doing well, it's NT and today we | find ourselves a little physical to talk about the thermo | dynamic. Don't worry, it's going well, we're going to go together | and be the same. I'm going to accompany you through a series of | videos to explain the basic principles in thermo dynamic. Well, | let's go, <error>we're going to go quietly</error>. The idea is | that you can understand the thermo dynamic <error>in sound | together</error>. So I'm really going to take my time to | understand the notions, | | --- | | All in all very happy that OpenAI is publishing their models. If | Stable Diffusion is any guide, people will hack some crazy things | with this. | | [1] https://github.com/openai/whisper [2] | https://www.youtube.com/watch?v=OFLt-KL0K7Y [3] | https://github.com/openai/whisper/blob/main/LICENSE | seszett wrote: | > _dans son ensemble_ | | > _in sound together_ | | That's hilarious and honestly, incredibly bad. "Dans son | ensemble" is a very common idiom (meaning "as a whole") while | "in sound together" has to be pretty rare. "Son" means | "his/hers/its" as well as "sound", and the former meaning is | probably more common in general so I have no idea how this | result could arise. | | "Termo" also doesn't exist in French, it's "thermo", so the | transcript even makes orthographic errors. | | And I forgot about "couplisser" which is also a hilarious made- | up word that sounds like it could mean something, but doesn't! | _Edit_ Google finds exactly one reference of this, in a patent | with a typo on the word "coulisser". | | I'm still impressed by the transcript quality since it covers | many languages, but the translation part is quite poor. | StevenWaterman wrote: | Was this with the `base` model? `large` is running ok on a P100 | in colab, but is about 4% the speed of `base.en`. Certainly | seems like some of these models will be fast enough for real- | time. | joshcryer wrote: | It also runs well on a CPU and seems to have proper memory | management. Wonderful timing because I was using DeepSpeech for | some audio recordings and it required me to script up a | splitter to make the files into .wav and then do snippets of 10 | seconds each. Everything about this just works out of the box. | On a core i5 I'm getting about 30 seconds every minute. | Transcriptionist jobs just turned into editor jobs. I love how | it drops the inflections in the audio as well, because it was | trained on transcription work, and that is one of the first | things you learn to do (drop the uhs and ums and huhs etc, | unless it is a strictly verbose transcription). | solarmist wrote: | Is it translation or transcription? Or both? | | Both, wow. This is really interesting. | StevenWaterman wrote: | Both, the blog covers it in detail. Pass in audio in any | language, and get an English transcription out. | nik_s wrote: | It can do both - I've edited my original post to show the | translation task. | gok wrote: | Comparing this model's word error rates to the state of the art | [1] on a few common test sets: | Whisper SoTA LibriSpeech test-clean 2.7% 1.8% | LibriSpeech test-other 5.6% 2.9% Switchboard | 13.1% 4.9% CallHome 15.8% 9.5% | | The authors do explicitly state that they're trying to do a lot | of fancy new stuff here, like be multilingual, rather than | pursuing just accuracy. | | [1] https://github.com/syhw/wer_are_we | lunixbochs wrote: | I suspect Whisper is more robust than other "SOTA" models, but | this release is likely leaving a fair bit of accuracy on the | table considering the amount of resources OpenAI is capable of | throwing at training it. | | Comparing the readily available test sets from the paper to | some of my personal robust models (for the Talon models, this | is greedy decoding, no language model): | Talon Talon Talon Whisper wav2vec 2.0 | 28M 300M 1B Large 960h librispeech clean | 3.21 2.52 2.40 2.7 2.7 librispeech other | 8.21 6.56 5.63 5.6 6.2 common voice | 13.88 11.65 8.86 9.5 29.9 tedlium | 7.51 6.55 5.47 4.0 10.5 | | I have a battery of more difficult tests on hand (including | adversarial tests, and diverse accent-specific metrics). I'll | look at running these tests on each of the Whisper model sizes | and following up with a larger comparison. | allanrbo wrote: | Talon was the first thing that came to my mind when I saw | this news. Would be nice if it could benefit from Whisper. | (Big fan of your work on Talon!) | ma2rten wrote: | I'm looking forward to your comparison. It's really hard to | make sense of how good this model actually is without being | an expert in the area. | nshm wrote: | It is interesting how they compare with wav2vec2 instead of | nemo conformer (which is more accurate) in Table 2. | StevenWaterman wrote: | One of the things they point out is that the SoTA on e.g. | LibriSpeech is _only_ good at LibriSpeech, and doesn 't | generalise as well. | | > Because Whisper was trained on a large and diverse dataset | and was not fine-tuned to any specific one, it does not beat | models that specialize in LibriSpeech performance, a famously | competitive benchmark in speech recognition. However, when we | measure Whisper's zero-shot performance across many diverse | datasets we find it is much more robust and makes 50% fewer | errors than those models. | lunixbochs wrote: | My own experience agrees: the generally available "SOTA" | models are not especially robust, and can be _extremely_ bad | (>50% absolute error rate) at some tasks. I'll post some | preliminary numbers in a sibling comment and look into | running my full set of tests on Whisper. | | It looks like Whisper is probably leaving a lot of accuracy | on the table, but initially it does seem to be a lot more | robust than general "SOTA" models. | | For a quick comparison, Silero's accuracy charts are kind of | nice because they post results for a large variety of | datasets. Scroll down to the EN V6 xlarge EE model (not the | xlarge CE) [1] | | [1] https://github.com/snakers4/silero-models/wiki/Quality- | Bench... | jawadch93 wrote: | LanternLight83 wrote: | Hoping to see this out to use in open source voice assistants, | eg. mycroft | liminalsunset wrote: | I really wish I had this about half a year ago when I was | building a tool to automatically turn online school lectures into | searchable, clickable transcripts (kind of like YouTube or EdX | transcripts). | | I was originally using Adobe Premiere Pro's speech to text to do | it, and wrote Python to convert its output to the Hyperaudio | format on GitHub. With this, I can totally skip all of that step | and this is fully open source, too. | | App idea: | | Build an app that takes a video and uses Hyperaudio or a similar | project to add a clickable and searchable transcript (clicking in | transcript seeks video) | resoluteteeth wrote: | You could already do the speech recognition in a fully open | source way with vosk easily, although Whisper may be more | accurate | throwamon wrote: | Is it feasible to use this for Talon-like voice-driven computer | usage? | FloatArtifact wrote: | Maybe, a number of speech recognition engines have been | integrated into https://github.com/dictation-toolbox/dragonfly | dubeye wrote: | I know a manual transcription company, which is still seeing | modest growth from existing clients who also use ASR, so it's not | quite there yet | londons_explore wrote: | I wonder how much the 30 second window is impacting performance? | | Anecdotally, I feel like there are plenty of times that I need | context from more than 30 seconds ago to understand some | technical jargon that's being discussed. | chrisstanchak wrote: | Hold on to your papers | smusamashah wrote: | How well does it do for technical and domain oriented speech? For | example I have audio recordings of a senior explaining some very | technical aspects of our software. Will it understand the | technical terms in that speech? | | I guess I will need to download and run on it to see how correct | it is. | emcq wrote: | Be wary of using this model - the licensing of this model seems | sketchy. Several of the datasets used for training like WSJ and | TED-LIUM have clear non-commercial clauses. I'm not a lawyer but | releasing a model as "MIT" seems dubious, and hopefully OpenAI | has paid for the appropriate licenses during training as they are | no longer a research-only non profit. | nshm wrote: | I think they didn't use WSJ for training, only for evaluation. | Paper includes WSJ under "Evaluation datasets" | jefftk wrote: | This is a big dispute right now: OpenAI and other AI companies | generally take the position that models learning from data does | not make the output of the models a derivative work of that | data. For example, GitHub Co-pilot uses all publicly available | GitHub code regardless of license, and | DALLE-2/StableDiffusion/etc use lots of non-free images. I | don't think this has been challenged in court yet, and I'm very | curious to see what happens when it is. | petercooper wrote: | I think it might be even less problematic with something like | Whisper than with DALLE/SD? Merely consuming data to train a | system or create an index is not usually contrary to the law | (otherwise Google wouldn't exist) - it's the _publication_ of | copyright content that 's thorny (and is something you can | begin to achieve with results from visual models that include | Getty Photos logo, etc.) | | I think it'd be a lot harder to make a case for an accurate | audio to text transcription being seen to violate the | copyright of any of the training material in the way a visual | could. | emcq wrote: | This is even slightly more direct: access to WSJ data | requires paying LDC for the download, and the pricing varies | depending on what institution / license you're from. The cost | may be a drop in the bucket compared to compute, but I don't | know that these licenses are transferable to the end product. | We might be a couple court cases away from finding out but I | wouldn't want to be inviting one of those cases :) | zeagle wrote: | It would be exceptional to get a healthy competitor to | microsoft/nuance's dragon monopoly on voice recognition in | healthcare. At a couple thousand bucks a license and the more | recent SaaS subscription trend there is a lot of money to be made | in that space. | darkpicnic wrote: | I just wrote a script with Hazel to automatically transcribe my | voice notes to txt. It handles punctuation extremely well. What a | wonderful contribution! | abidlabs wrote: | Here [1] is a video tutorial on building a web UI that accepts | microphone input and runs it through Whisper for speech | transcription | | [1] | https://www.youtube.com/watch?v=ywIyc8l1K1Q&ab_channel=1litt... | amrrs wrote: | Thank you for sharing! ___________________________________________________________________ (page generated 2022-09-21 23:00 UTC)