[HN Gopher] Google open-sources the Lyra audio codec ___________________________________________________________________ Google open-sources the Lyra audio codec Author : chmaynard Score : 137 points Date : 2021-04-06 16:20 UTC (6 hours ago) (HTM) web link (opensource.googleblog.com) (TXT) w3m dump (opensource.googleblog.com) | Thaxll wrote: | Can't wait to try that in ffmpeg! | unixhero wrote: | I used to be fond of Google products. | chintan wrote: | Now what could go wrong? 4 dots becomes 3 dots...[1] | | 1). Silicon Valley - Finale S6E7 | https://www.youtube.com/watch?v=48Y77jSSHGU | ncmncm wrote: | A more useful system would take Opus-compressed data as input and | feature-extract that, presumably faster than this thing. Bonus | for not requiring a proprietary library like | libsparse_inference.so. | | Also, instead of encoding independent 40ms segments, it should be | much better to encode 10ms segments given the previous 30ms. | ProAm wrote: | Isn't Lyra the name of Facebook's cryptocurrency as well? I | cannot remember if that project was shelved. | mgraczyk wrote: | Are you thinking of Libra, now called Diem? | | https://en.wikipedia.org/wiki/Diem_(digital_currency) | ProAm wrote: | Yes I was. Thanks, I had them mixed up. | rektide wrote: | There's huge wins but the grandiosity of "enabling voice calls" | is grating. I don't think this will open many users to voice | communication. It will reduce data-costs in a way that has an | impact on a significant amount of people's bottom line. But I | feel manipulated with the current headline, and by the long | extended lack of ability to mix the very real hope with some | measure of humility. | [deleted] | p1mrx wrote: | This seems kind of unnecessary, compared to Opus at ~10 kbps. If | you're sending IPv6+UDP in 40 ms chunks, that's 9.6 kbps just | from the packet headers (25 Hz * 40+8 bytes). | | When the voice payload is smaller than the packet headers, you're | well into diminishing returns territory. | posguy wrote: | Opus at 8Kbps sounds better, and commodity, inexpensive | hardware like the Grandstream HT802 Analog Telephone Adapter | supports this codec today (along with any cheap Android phone). | | Lyra as it stands today will not support anything outside of | x86-64 and ARM64 without rewriting the proprietary kernel it | relies on. | pjc50 wrote: | One thing I'm slightly worried about "machine learning" in | compression rather than conventional everything-is-sines | mathematical approaches is the possibility of odd nonlinear | errors. Remember the photocopier that worked by OCR and would | occasionally mis-transcribe numbers? | | I don't mind compressing a phoneme to <unintelligible> as much as | I would mind it compressing it to a clearly audible _different_ | phoneme. | 1-6 wrote: | One day, voice cloning may become so powerful that only word | data and intonations will become part of the datastream. There | could be various 'layers' in which encodes/decodes can occur. | Voice Cloning would be at the very top of the stack. | minikites wrote: | >Remember the photocopier that worked by OCR and would | occasionally mis-transcribe numbers? | | For those who don't remember: | http://www.dkriesel.com/en/blog/2013/0802_xerox-workcentres_... | smnrchrds wrote: | This one too: | | https://petapixel.com/2020/08/17/gigapixel-ai-accidentally-a... | tyingq wrote: | >photocopier that worked by OCR | | The interesting bit was that it wasn't supposed to work by | OCR...that had been deliberately turned off. The compression | was too clever. | cbhl wrote: | [disclaimer: Personal opinion, not that of my employer.] | | I had a coworker play me before/after of an early version of | the codec "babbling" and it was definitely uncanny valley. It | looks like some work has been done on the problem since then. | | The second paper linked in the README.md of the repo talks | about talks about a few strategies to reduce 'babbling' or | 'babble'. For your reference, here's the citation and the link | to the PDF. | | Denton, T., Luebs, A., Lim, F. S., Storus, A., Yeh, H., Kleijn, | W. B., & Skoglund, J. (2021). Handling Background Noise in | Neural Speech Generation. arXiv preprint arXiv:2102.11906. | | https://arxiv.org/pdf/2102.11906.pdf | rexreed wrote: | The OCR issue was the first thing I thought about. Machine | learning is probabilistic, not deterministic, so in the case of | S being converted to 5 (or 6 to 8, etc.), which definitely | impacts numerical data in the case of the OCR stuff, we can | expect similar voice mis-classifications. Perhaps "You're fine" | might get misclassified as "you're fired". | thaumasiotes wrote: | > Remember the photocopier that worked by OCR and would | occasionally mis-transcribe numbers? | | That was perfectly ordinary compression? | | The phenomenon is all over the place, most visible in | autocorrect. | bayindirh wrote: | It was ordinary compression, something called JBIG2. It did | not mistranscribe, but mark slightly different number or | character blocks as same, resulting replaced parts in images. | | In other words, its match tolerance is a bit too lax, so it | get poisoned by blocks in its own dictionary, thinking it | already has the blocks for things it had just scanned. | | More details can be found in [0] and [1]. | | [0]: https://www.theregister.com/2013/08/06/xerox_copier_flaw | _mea... | | [1]: http://www.dkriesel.com/en/blog/2013/0802_xerox- | workcentres_...? | Wowfunhappy wrote: | Yes! This is why I always turn off autocorrect! It's true | that I absolutely make more typos without it, but at least | they're obvious as typos, and not different words that | potentially change the meaning of the sentence. | beagle3 wrote: | Are you aware that the same exact uncompressed recording sounds | different depending on context? This is known as the McGurk | effect. | | Very worth your two minutes if you're not yet familiar with the | effect: https://www.youtube.com/watch?v=2k8fHR9jKVM | dqv wrote: | This already happens with existing compression algorithms. | Certain vowel sounds get collapsed, so someone will say, for | example, "66" and it will come out on the other side as "6". | Very annoying because you can't exactly coach a layperson on | how to talk "the right way" to not trigger this vowel collapse. | bobthechef wrote: | > you can't exactly coach a layperson on how to talk "the | right way" to not trigger this vowel collapse | | I've never noticed. At any rate, we should not coach people | to adapt to technology in this way. It is Procrustean and | anti-human and unnecessarily places a burden on people that | belongs to the software and the developer. | nyanpasu64 wrote: | For what it's worth, amateur radio operators already have | specialized rules and techniques for speech, to improve | clarity over a muffled noisy analog radio channel. | tyingq wrote: | > how to talk "the right way" | | Not suggesting it as a fix, but this did remind me of the | military phonetic alphabet, which includes numbers too. | | 3 is "tree", 4 is "fow er", 5 is "fife", 9 is "niner". The | rest of the numbers are mostly as-is, but you'll hear very | deliberate enunciation, like "Zee Row" for 0. | WaitWaitWha wrote: | whiskey hotel yankee delta oscar india hotel alpha victor | echo tango oscar sierra papa echo alpha kilo tango hotel | echo lima alpha november golf uniform alpha golf echo oscar | foxtrot tango hotel echo mike alpha charlie hotel india | november echo ? tango hotel alpha tango india sierra india | november sierra alpha november echo! | tyingq wrote: | | perl -pe 's/(\w)\w+/\1/g' | toast0 wrote: | Humans adapt a whole hell of a lot easier than machines. | | Sure, it would be nice to have clean high bandwidth, low | latency voice channels to everywhere so you could drop | pins and expect the other side to hear it. Unfortunately, | high bandwidth never really happened, and some places | never ran land lines to everyone's home, and nobody wants | to pay the high price of circuit switched voice when | packet switched voice mostly works good enough and is | enormously cheaper. | posguy wrote: | But is Lyra a significant improvement over modern Opus at | 8Kbps? You can buy a Grandstream HT802 for ~$30 and its | DSP can decode Opus today, whereas Lyra will require | orders of magnitude more power to decode while providing | much worse reproduction accuracy. | est31 wrote: | Back when Lyra was announced [0], I listened to the released | samples and it changed an "m" sound to an "l" sound. | | [0]: https://news.ycombinator.com/item?id=26309553 | jagger27 wrote: | I find that Lyra sounds good at first but it can chop off hard | consonants in certain scenarios. It sort of sounds like slightly | slurred speech. Anyone else getting that impression from their | samples? | ent101 wrote: | Discontinued in 3... 2... 1... | devops000 wrote: | Give them at least 3-4 years ;) | squarefoot wrote: | Doesn't seem that better compared to Codec2 which is already | fully Open Source (LGPL), even taking into account that Codec2's | examples originals are already of much worse quality than the | ones on Lyra's website. I'd be curious to hear both working on | the same set of audio samples. | | https://www.rowetel.com/?page_id=452 | Seirdy wrote: | Agreed; codec2 doesn't alter speech as aggressively, require | proprietary components, or have as strong a connection to | Google. | devops000 wrote: | 404: NOT FOUND | | https://basis-universal-webgl.vercel.app/texture/ | | Where else I can see a demo? | dang wrote: | Recent past threads on this: | | _Lyra audio codec enables high-quality voice calls at 3 kbps | bitrate_ - https://news.ycombinator.com/item?id=26300229 - March | 2021 (198 comments) | | _Lyra: A New Very Low-Bitrate Codec for Speech Compression_ - | https://news.ycombinator.com/item?id=26279891 - Feb 2021 (25 | comments) | | Is there significant new information here? | https://hn.algolia.com/?dateRange=all&page=0&prefix=false&so... | | Edit: it seems the SNI is the open-sourcing. I've changed the | title to say that now. Corporate press releases are generally an | exception to HN's rules about titles and original sources: | https://hn.algolia.com/?dateRange=all&page=0&prefix=true&sor.... | miohtama wrote: | If I remember correctly the original landline audio was 64kpbs, | 8000 Hz. So Lyra is 1/20 of this. And probably still sounds | better. | posguy wrote: | PCMU/PCMA (G.711m and G.711a) are not original landline | quality audio, but rather what Bell Systems felt they could | get away with passing off as a toll quality call in 1972. | | Lyra will likely sound better, but the reproduction accuracy | is apt to be quite a bit poorer as many others have | commented. G.711 was created to require nearly no processing | (its nearly raw PCM data from a sound card after all) while | operating at reasonable bitrates, Lyra looks much more | computationally intensive and will likely only run on | smartphones in the next few years. | | Edit: Is Lyra a significant improvement over modern Opus at | 8Kbps? You can buy a Grandstream HT802 analog telephone | adapter for ~$30 and its DSP can decode Opus today, whereas | Lyra will require orders of magnitude more power to decode | while providing much worse reproduction accuracy. | YarickR2 wrote: | Original landline audio was/is analog | [deleted] | colanderman wrote: | For reference, analog landlines specified 24 dB SNR [1] and | 300-3300 Hz passband [2], giving ~24 kbps information rate. | | [1] https://www.tschmidt.com/writings/POTS_Modem_Impairment | s.htm | | [2] | https://en.wikipedia.org/wiki/Plain_old_telephone_service | | [3] https://en.wikipedia.org/wiki/Shannon%E2%80%93Hartley_t | heore... | sandGorgon wrote: | is the training code opensource ? | levosmetalo wrote: | I hope this never takes off. | | This whole machine learning, optimization etc, story, but the end | goal is that Google can easily transcribe your voice calls and | store it as text. Then it can apply all shady practices that it | previously was too expensive to do because storing voice and | extracting information from it required huge storage costs and | actual human labour. | | Or worst, just imagine what some government you don't trust could | do with all those voice call transcripts. | rektide wrote: | This will make voices radically more correlatable, most likely. | It's a more effective model for voice, it has run endless | regressions & found better patterns to model human sounds upon. | That could well make processing & comparing pieces of speech | data less computationally expensive. | | I don't see much relation to surveillance & transcription | issues. This technology does not, would not change the field of | battle significantly, if such a battle were about. Which it | probably is, in some countries, perhaps even applying to | Google-touched, -relayed, or Google-held data. | sreekotay wrote: | I mean...this is them open sourcing it? | mgraczyk wrote: | This codec has nothing to do with what you're worried about. | There's no current technical limitation preventing what you're | describing. Google doesn't do it because it makes no sense for | their business and because your phone calls aren't routed | through Google's servers. Governments outside the US are | already doing it. | cbdumas wrote: | Anyone listening to the sample audio linked to in the article | should read this note from the last time this was discussed on | HN: https://news.ycombinator.com/item?id=26309787 | | Summary: the Lyra audio samples are louder which muddies the | comparison | pessimizer wrote: | I've been waiting for an audio codec that could actually silently | change the words I've said. | | https://www.zdnet.com/article/xerox-scanners-alter-numbers-i... | plzbo wrote: | Link to the website of the person who found the error in the | first place: http://www.dkriesel.com/en/blog/2013/0802_xerox- | workcentres_... | sreekotay wrote: | In practical terms, very impressive. Anyone know what latency is | like? Feels a domain where people who have not experienced low | latency full duplex cannot fully appreciate why voice has faded | in everyday life... | Ajedi32 wrote: | Sounds like at least +40ms of latency: | | > features, are extracted in chunks of 40ms, then compressed | and sent over the network | markstos wrote: | Notable that the two post authors sign it with " - Chrome", | indicating I presume they are Chrome team members. | londons_explore wrote: | Google misses the mark here... | | Bad internet connectivity in the developing world isn't "only | 56kbps" as some people think. | | It's "random bursts of fast with random 30 second gaps of no | connectivity at all". It's routed through 3 layers of proxies and | firewalls which block random stuff and not others, while | disconnecting long running connections. | | Oh, and it'll be expensive per MB. | | To that end, Lyra helps with the expense of a data connection, | but is unusable for long voice calls. What would help more is a | text chat system like WhatsApp. | | Oh right - WhatsApp is already wildly popular in most of the | developing world for mostly this reason. | herodoturtle wrote: | Heya, please could you unpack your reasoning a little bit more? | | You said: | | > WhatsApp is already wildly popular in most of the developing | world for mostly this reason. | | I can't speak for the majority of the developing world, but | here in South Africa, WhatsApp is indeed the predominant | communications app. | | That being said, WhatsApp voice calls are also used here quite | a bit. | | So with that in mind, and reading from the article: | | > Lyra compresses raw audio down to 3kbps for quality that | compares favourably to other codecs | | To me 3kbps sounds pretty great, and might actually work out | cheaper / better than one might imagine. | | So I'm just wondering, how does WhatsApp voice call data usage | compare to Lyra? | | Also whilst South Africa is indeed a developing country (where, | among other things, the price of data is proportionately high | relative to average household income), the cellular network | infrastructure is excellent. | | So I don't think the random bursts of connectivity you describe | are as big of an issue here, whereas the price of data most | certainly is. | | In which case, I can definitely see a market for Lyra (assuming | the 3kbps is indeed vastly superior to WhatsApp's data usage | for a voice call). | | Hope that makes sense but I'd be happy to extrapolate a little | further :-) | villasv wrote: | > Oh right - WhatsApp is already wildly popular in most of the | developing world for mostly this reason. | | Not only that, but carriers will often advertise plans with | "unlimited Internet for Facebook and WhatsApp" (a punch in the | face of net neutrality). | | So not only WhatsApp has more impact with audio messages when | audio calls are too unstable, audio calls already substitute | the bulk of phone calls even for people who have shitty data | plans. | | This is what my carrier says on their most basic offering: | | > What does WhatsApp Unlimited mean? | | > The benefit is granted automatically, without the need for | activation. And the use of the app is unlimited to send | messages, audios, photos, videos, in addition to making voice | calls. Only video calls that are discounted from the internet | package, as well as access to external links. | setr wrote: | In the middle east I noticed a baffling-to-me usage of | whatsapp: people were simply exchanging voice messages back and | forth instead of calling. [0] | | Presumably for exactly the reason you've stated. | | [0] I later tried it myself with a friend, but you end up | losing the benefits of both worlds -- you can't search or | review old messages effectively (as you would text), and its | significantly slower than calling. | throwaway81523 wrote: | Another reason for end-to-end speech encryption: to keep your | cleartext voice signal away from these overaggressive codecs | changing the words. I can understand the need for a super low | bandwidth codec on top of Mt. Everest, but 64 kbit PCM was good | enough for our grandparents' landlines (or 13 kbit GSM for their | mobiles) and it's good enough for us. | LeoPanthera wrote: | What a spectacular failure of imagination. Why change anything | ever, right? I supposed dial-up modems were good enough for you | too. | | Everyone is imagining that codecs like this will "change your | words" but no-one has provided examples of that _actually | happening_. I don 't believe it. | robert_foss wrote: | Encoding takes >40ms? Opus takes 5-26.5ms. Apparently 150ms[1] is | the generally accepted upper bound for call latency. | | I think the article could do with some | bandwidth/quality/latency/power comparisons to other codecs. | | [1] https://en.wikipedia.org/wiki/Latency_(audio) | Bedon292 wrote: | I don't think it is discussing encoding time in the article, it | says "features are extracted in chunks of 40ms". My reading is | that its breaking down the speech into 40ms chunks, compressing | it, and sending that. | 0b01 wrote: | But since the buffer size has to be 40ms then so the minimum | latency is 40ms | ksec wrote: | >These speech attributes, also called features, are extracted | in chunks of 40ms, then compressed and sent over the network. | | So while Encoding doesn't take 40ms, the latency + encoding | will indeed be 40ms+. | | 150ms is the End to End Latency, which is basically everything | from Encoding + Network + Decoding. We cant beat the speed of | light on our fibre network. We can certainly do something with | Encoding and Decoding. And Lyra doesn't seems to help with that | case here. Something I pointed out last time Lyra was on HN. | | I think Opus default to 20ms with option of 10ms slot ( | excluding Encoding speed ) at the expense of quality. What we | really need is higher bitrate, lower latency and higher quality | codec. Which is sort of the exact opposite of what Lyra is | offering. | RL_Quine wrote: | > _We cant beat the speed of light on our fibre network._ | | Speed of light in what? We can absolutely be faster than | fibre optics, which are quite slow relatively speaking | (2/3rds that of light in a vacuum). | ksec wrote: | We wont be replacing Glass Fibre with Vacuum Fibre anytime | soon. And I have been following this tech for long, but I | do wish I am very wrong. | BlueTemplar wrote: | Starlink ? | tymekpavel wrote: | Satellite links are orders of magnitude slower than | fiber. | blendergeek wrote: | > Satellite links are orders of magnitude slower than | fiber. | | Minimum end-to-end latency for communications from | opposite points of the earth is much lower for Starlink | style LEO satellites than for fiber. | ksec wrote: | Which is only in the case of "opposite points of the | earth", otherwise you are just adding ~700KM of distance | between two point. The point is even if we have perfect | Speed of light Data Transfer over a direct line, we are | fundamentally limited by it and nothing can be done. But | Encoding, Decoding, Time Slots and quality are everything | that we have control of and should be look into more | seriously. | BlueTemplar wrote: | Aren't they still heavily expected to feature in | connecting that "next billion" ? | tymekpavel wrote: | Yes, because they are convenient for other reasons (don't | require infrastructure over land) which makes them | suitable for connecting rural areas where it doesn't make | sense to run fiber. But fiber will always be the fastest | you can get, and if you get fiber in a vacuum, you could | theoretically achieve near-speed of light communication. | Satellites won't get you anywhere close to that, even if | you use lasers, because there is always atmospheric | disturbances that introduce latency. | wmf wrote: | Internet latency is much higher than it could be, even | using fiber: https://arxiv.org/abs/1811.10737 | | And adding an HFT-style microwave backbone could reduce | Internet latency even more: | https://arxiv.org/abs/1809.10897 | azinman2 wrote: | Ya I was just coming here to say the same thing. 40ms _just in | the codec_ feels like a lot. Because that's not even including | time to pull in audio from the hardware (could be 20ms or more | in Android devices), time to upload, and time to have it across | the Internet, and then time to decode + play on the receiver. | That adds up pretty quickly. I'm guessing 40ms was chosen | because it is some sweet spot of having enough data to get a | worthwhile compression on, but it's one of these things where | technology, however impressive it might be, is slowly giving us | a worse experience over time in the pursuit of digitization. | robert_foss wrote: | From my understanding the 40ms is just the feature extraction | part. The encoding also does quantization, which surely adds | to this number. | stefan_ wrote: | The favorite way to cheat compression contests. Buffer more | data, get more compression. | te0006 wrote: | "Please note that there is a closed-source kernel used for math | operations that is linked via a shared object called | libsparse_inference.so. We provide the libsparse_inference.so | library to be linked, but are unable to provide source for it. | This is the reason that a specific toolchain/compiler is | required.* - README | ncmncm wrote: | [update: proprietary .so] | | They should re-implement the needed bits of libsparse_inference | before releasing this thing. Otherwise it's just a distraction. | | Probably they should get it building with something other than | Bazel, too. | danaliv wrote: | It's not a kernel module, it's a compute kernel. Nothing to | do with operating systems. They provide versions for android- | arm64 and linux-x86_64. | | The fine README says it builds and runs on Ubuntu 20.04. | posguy wrote: | Ah, so Lyra today will not work on RISC-V, i386, Power, | MIPS, lower end or older ARM chips like the Allwinner H3 | (very popular in Single Board Computers) and any other new | architecture that comes out? | skybrian wrote: | Yes, that will have to be removed as part of the effort of | porting it to new platforms. | jacobn wrote: | Sounds like NVIDIA's Maxine [1], but for voice? | | 1: https://developer.nvidia.com/maxine | flohofwoe wrote: | Why is the demo link towards the bottom of the post pointing to | the Basis Universal repository (which is a texture compressor)? | | https://github.com/BinomialLLC/basis_universal/tree/master/w... | | Copy-pasta error, or did they run the post through Lyra? ;) | astlouis44 wrote: | This is going to be VERY useful for WebXR social platforms. ___________________________________________________________________ (page generated 2021-04-06 23:00 UTC)