[HN Gopher] The Seamless Communication models ___________________________________________________________________ The Seamless Communication models Author : skadamat Score : 537 points Date : 2023-12-01 14:53 UTC (8 hours ago) (HTM) web link (ai.meta.com) (TXT) w3m dump (ai.meta.com) | infotainment wrote: | It's amazing how far text to speech has come in the past few | years, but what I'm wondering is when this tech will finally make | it into local TTS engines baked into the OS (eg for screen | readers, etc) | PartiallyTyped wrote: | The accessibility nerd in me is excited! | callalex wrote: | This is already built into recent iOS devices and it's called | Live Captions. | freedomben wrote: | Same with Android (Pixel phones at least). | | I'm the most excited for an open source one though, and it | would be incredible if this could become it. I do 95% of my | compute on desktop linux and it sucks being behind. | coffeebeqn wrote: | We can't be that far off from almost perfect real-time | translation. There is some latency of course to hear and process | mrob wrote: | Differences in verb-subject-object word order will always add | latency. If you want to translate from German, with the verb at | the end, to Welsh, where the verb goes at the start, you'll | have to wait for the complete sentence before you can begin. | tralarpa wrote: | It's very impressive what simultaneous interpreters can do. | They don't wait for the end of the sentence. | numpad0 wrote: | Yeah they backtrack on branch prediction failures. | dylan604 wrote: | What kind of heartbleed that must introduce. | Vecr wrote: | You mean meltdown/Spectre? | dylan604 wrote: | probably, but you got the gist anyways | MrsPeaches wrote: | Even they struggle with jokes though. | | This may be apocryphal but I've heard that in formal | settings (e.g. UN) they won't translate it and will instead | give instruction on when to laugh. | d3m0t3p wrote: | Not necessarily true, for the first few sentences you won't | be able to do it. But afterwards, once the context is | established you don't really need to wait for the verb, you | can predict it. For example if you are speaking about | cleaning the house and you detail that you have cleaned the | kitchen the stove and so on, you can predict the verb with | only the start of the sentence. I don't have any source to | back this up, but it sounds plausible | gberger wrote: | What if the predicted verb was incorrect, but the model has | already translated the incorrect prediction? How does it | tell you about a mistake? | mrandish wrote: | A good approach might be to start with how top notch, | ultra-experienced human translators handle corrections | for real-time scenarios, for example, the expert | translators that do the ear monitors at the United | Nations. I've worked with a few such real-time | translators when preparing keynote speeches and they seem | to have rigorous processes that appeared quite deep. | Probably a ton of domain expertise to be captured there. | | That said, I suspect that real-time language translation | is always going to be somewhat imperfect due to its | nature. Non-real-time translation of literature is still | a subjective art form even at the very high-end of human | expertise. | shkkmo wrote: | Once you start predicting what someone is going to say you | are no longer translating their speech | Teever wrote: | Yeah but then you're just introducing branch mispredictions | which will cause latency and potential confusion down the | line. | | It's all a trade off. | | Either way it's extremely exciting that we get to even | discuss this stuff as real possiblities. | Innervisio wrote: | Although true and considering what "mrob" had also replied, | this will never mean full translation every time, all the time. | This will work with specific environments and linguistic | expectations. | | I've been learning german since 8 years, and the amount of | expressions and different ways to say things around the country | is impressive. There'll be a "interpretative" real-time | translation, but it won't guarantee fully understanding in so | many cases, maybe ever. | | Other thing, and we have this in common with all languages, is | the context and this is difficult to address i believe. | | Nevertheless, it's impressive how far we've reached and i | acknowledge the usability of these tools. However, human | knowledge will be always crucial and primordial if we want to | guarantee full understanding. | InCityDreams wrote: | >I've been learning german since 8 years, | | "Since", as used here, would lead me to guess you are not a | native English speaker? | WhatsName wrote: | Did anyone compare this to nllb (also meta) yet? | trovas wrote: | in the paper, the results reported show very similar level of | quality | jkw wrote: | We're the same team! We have some comparisons in the paper. | ukuina wrote: | Next step is combining the output with few-sample speech | synthesis so the output is in the original speaker's voice! | modeless wrote: | This does that already. At least, to a first approximation. | Voice cloning is not that great in general right now. | blovescoffee wrote: | The voice cloning worked pretty well for me. From english to | spanish I noticed that the first few words sounded more like | me than the last few words. Also it doesn't sound like how I | speak in spanish but that's expected. | coffeebeqn wrote: | Voice cloning works pretty well already but not necessarily | on one 10 sec sample as the source data. If you can give it | some hours of data it'll work much better | modeless wrote: | Do you have examples of it working well? I haven't heard | anything that really impressed me. Nothing close to a good | human impersonator. We're a long, long way from replacing | voice actors, even considering the rapid rate of progress. | kaycebasques wrote: | Besides the obvious good news about making it easier for people | to communicate with each other across languages, it's also | exciting to me that we're trending towards a world where I can | tap into all the knowledge that only exists on the non-English | web. I'm sure there are vast troves of programming knowledge in | the Japanese-only web for example. The Chinese-only and Russian- | only web are obvious candidates too but presumably those are | harder to access for other reasons. | nickreese wrote: | My wife was training to be a professional voice actor to do | dubbing in several languages when we met. | | I told her then that the industry would be disrupted by AI before | she retired. | | Glad she pivoted. Really impressive results. | 0_____0 wrote: | It won't replace high-end talent, I don't think models can | replicate the nuance for a long time, however the entire low- | to-mid end of the market is going to get nuked from low earth | orbit | Shish2k wrote: | I wonder which will happen first - AI evolves to work well at | the high-end, or high-end humans retire and there's nobody | left in the low-to-mid end to fill their shoes... | callalex wrote: | Given the modern trend of on-screen actors doing voice | work, I think there will be a supply of talent for at least | a few more generations. | crakenzak wrote: | It will absolutely replace high-end talent. Anything that a | human can do will be able to be done 10x better by a model -- | especially in such a narrow and well defined domain. | sushisource wrote: | Did you hear the output examples? Yeah, I think not. I | mean, definitely on the way, but there's no way if you need | quality acting in your dub that you're going with this. | ygjb wrote: | These are models specially tuned and sized for near real- | time, instant translation. It would be naive to think | that there aren't technical creatives building and | training models tuned for expressiveness and nuance in a | more controlled environment. | crakenzak wrote: | Maybe not in the current state of the model, but judging | by the rate of improvement we're all seeing it's just a | matter of time (and data+compute+research obv). | dvngnt_ wrote: | i think the key word is will. | | a few more years of improvements if they happen could be | disruptive | dontupvoteme wrote: | That's what they gave us plebs. To think they don't have | a superior one they can sell... | chrismorgan wrote: | It won't _replace_ it, but it's very likely to _supplant_ it, | just about destroying the segment by reducing demand by being | _good enough_ and so much cheaper, especially as people get | more used to it. | | Typesetting. Music engraving. Bookbinding. The quality of all | these fields have been materially harmed by advancements. | | Computer typesetting has, by and large, been a significant | regression, though the gap has largely been made up now if | you make the right choices. | | Published music scores used to be set by experts. Now they're | set by novices using software that is mechanical in method | and generally quite insipid. Most are _atrocious_ compared to | the old masters, and mediocre at best compared to the typical | published scores from a hundred years ago; and very few | popular scores are really good (... and if they are, there's | a reasonably high chance they've used GNU LilyPond, which has | focused on this problem). But the barrier for entry is _so | much lower_ , and people have got used to the inferior | results, so I don't know if _anyone_ engraves music the old | way, and even people that know better largely just shrug and | make do with the new. Like with computer typesetting, there | is hope because things _have_ slowly improved. But most will | continue to be mediocre. | | Books used to be bound with cold glue. It takes time to set, | but the results are very good, supple and long-lasting. Then | along came hot-melt glue, and it's just _so_ much friendlier | for cheap manufacturing because books are finished within a | few minutes instead of a day or two, that I don't think | _anyone_ produces books the old way any more, even though the | results are _abysmal_ in comparison (compare the binding _and | reading experience_ of a paperback from the '40s or '50s with | one from the turn of the century; no one after tasting the | old will desire the new; for he says, the old is good). But | they're just (barely) good enough. Unlike the other two, I | don't think there's any hope here--the regressive advancement | crowded out the superior but dearer option so that no place | was found for it. | pclmulqdq wrote: | You can still get relatively good published music scores | from a few of the old German shops (Schirmer, Henle, etc.), | but they are very expensive. They are a joy to use when | playing, though, since the music is very clearly laid out | and page turns are in the perfect place, etc. Finale and | Sibelius are controllable enough that you can use them to | do fantastic layout, but many people either do not | understand how to make a score readable or don't care | enough. | TeMPOraL wrote: | That, and what GP describes, is what I see as the overall | trend of the market to hollow out the middle. It's not | just about technology (though it plays a big role), as | all optimization coming from competitive pressure - | materials, processes, business models, marketing. | | What seems to universally happen is, the market | bifurcates - one part is in a race to the bottom, the | other (much smaller) aims for super premium tier | (overpriced quality), because only those two positions | are sustainable, once the race-to-the-bottom side drags | all the economies of scale with it. So as a consumer, you | get to chose between cheap low-quality garbage that's | barely fit for purpose, and rare, super-expensive, | professional/elite high-end products. There is no option | for "good value for reasonable price". | | This has been happening to everything - software, | furniture, construction, electronics, vehicles, _food_ , | you name it. | RowanH wrote: | I'm using AI for training videos for my startup. Never going | back to voice actors outside of primary marketing videos. | tThe sheer convenience of write/listen/tweak cycle on scripts | is insane. In minutes you can do a voiceover which would have | have taken hours + days delay prior. | | Sure the final result sounds slighty robotic. 99% of people | wouldn't care, and you can get more training videos done, | faster for a fraction of the cost. | | [Edit] And I'll add the difference from 6 months ago is | noticeable to today. I imagine every 6 months we can just re- | download updated voiceovers and every 6 months will sound | just slightly more polished.. | ggregoire wrote: | > I told her then that the industry would be disrupted by AI | before she retired. | | Yes. I just discovered there is a text-to-speech addon [1] (now | a few months old) for World of Warcraft that adds voices for | every NPC in the game... It is so impressive and game changer | (pun intended) that I naively asked in the chat of the Twitch | stream I was watching "when did Blizzard add voices to the | NPCs??". For an instant I really thought Blizzard contracted | actors, but no, someone like you and me just used AI to | generate realistic voices for every character in the game. I | don't think it's ready yet to completely replace actors in | video games (surely it will in the near future tho) but voice | acting is something so expensive to do that I can see studios | and developers in 2024 already use this tech for all the | optional dialogues and secondary characters' voices. | | [1] https://www.curseforge.com/wow/addons/voiceover | lyu07282 wrote: | Another recent example, the finals uses AI voice generation | for realtime game announcements | | https://youtu.be/kZ87wiHps9s | freedomben wrote: | I've wondered at what point this would happen. I think it | could now, but from what I've read the voice actor unions are | able to prevent it currently (at least for AAA games or non- | indie devs). Many of them have agreements/contracts in place | for the foreseeable future, and being the first big company | to replace them is a heap of terrible press that nobody is | going to want to touch. I think it's the same reason | Hollywood reached the AI agreement recently too. | Halong wrote: | My wife is paying our mortgage teaching English on Preply. I'm | extrememely worried about where we'll be in 10 years. | ilaksh wrote: | What did she pivot to? I don't think any currently existing job | is really safe in the medium-to-long term. | Jayakumark wrote: | How does this compare to whisper-large-v3 on STT? | trovas wrote: | I work on seamless. You can see the results in the paper. M4Tv2 | is significantly ahead (Whisper Large v3 - 16.9 BLEU vs. M4Tv2 | 26.6). These are averages over 81 directions X->english | 999900000999 wrote: | Can't wait for someone to roll a language tutor out with this | tech. | | Everyone gets a personal tutor for hours a day. | | I would absolutely love a VR game where I just need to work in | China or Mexico all day and pick up the language that way. | modeless wrote: | This is what I'd like to build (the tutor part at least, not | the VR game part yet). I'm planning to extend my current | English only rough prototype[1] to support Mandarin. (I happen | to be learning Mandarin myself at the moment, and there are a | bunch of open source bilingual Mandarin LLMs and speech | synthesizers from China to choose from.) | | I think a lot of people are working on similar things right | now. I know of one called http://yourteacher.ai | | [1] https://apps.microsoft.com/detail/9NC624PBFGB7 | siraben wrote: | Is there a high quality speech synthesizer (ideally local) | for Mandarin you have found? There are some subtleties with | tone sandhi rules and how they interact with prosody that I | feel are lacking with current TTS voices I've tried. | modeless wrote: | The first one I plan to try is https://github.com/netease- | youdao/EmotiVoice | | I don't have the expertise to judge the quality of Mandarin | pronunciation myself, being a beginner. But it sounds OK in | English and it's made by native Mandarin speakers in China | so I expect that it sounds better in Mandarin than English. | siraben wrote: | Sounds pretty good, although still lacking in natural- | sounding tone sandhi (e.g. try Yi Xia , it should be | yi2xia4 instead of yi1xia4). | gattr wrote: | I love the idea of LLMs being super-efficient language | tutors. And you have a good point; coming soon: "We've been | getting a lot of these tourists here lately, they're eerily | fluent, but all seem to have the same minor speech | impediment" (read: messed-up weights in a commonly used | speech model). | siraben wrote: | I've been using ChatGPT 4 to translate and explain | various texts in Mandarin and it's been very on point | (checking with native speakers from time to time, or | internet searches). As expected, it has trouble with | slang and cross-language loanwords from time to time. | However for languages with much lower information online, | it hallucinates like crazy. | | > coming soon: "We've been getting a lot of these | tourists here lately, they're eerily fluent, but all seem | to have the same minor speech impediment" | | Haha, if that were to pass, that would still be a far | better outcome than our current situation of completely | blind machine translation (this is especially for various | Asian languages that are very sensitive to phrasing) and | mispronunciation by non-native speakers. | bityard wrote: | > all seem to have the same minor speech impediment | | Ah, that is called an accent. | dontupvoteme wrote: | Kind of, Accents are typically derived from the | intersection of natural languages, specifically which | ones you learned the phonetics of first. (With the | exception of the Mid-Atlantic accent...) | | This would be something quite novel as the speech | irregularities would not have their origin in people | | I don't know what you would call it but it needs at least | some adjective before accent to differentiate it IMO | rnjesus wrote: | the azure neural tts voices in chinese are the best i've | heard, specifically the "xiaochen" voice. i use it in anki | daily to generate sentences for my mandarin decks with an | api key/plugin. it's not something you run locally of | course, but they have a decent enough free tier. | | i'm hoping a voice as realistic as this becomes a local app | soon, but i've not found anything that's nearly as natural | sounding yet. (also, honorable mention to chatgpt's "sky." | she pronounces mandarin with a funnily american accent, but | it sounds natural and not as robotic as the open-source | alternatives i've tried) | meowtimemania wrote: | There's already a few of them. Checkout https://hallo.ai | 999900000999 wrote: | I wouldn't feel good about anything that's not focused on a | single language. | | You end up with the Duolingo problem where you know to say | the names of 20 different fruits but not how to introduce | yourself. | apwell23 wrote: | > You end up with the Duolingo problem where you know to | say the names of 20 different fruits but not how to | introduce yourself. | | Not sure if this is a duolingo problem. There are of | modules in duolingo specifically for saying your name. I | think its the travel module. | coldtea wrote: | Never seen that in Duolingo. It starts with the basics and | phrases, not random useless vocabulary. | cptskippy wrote: | I was going to Italy and started using Duolingo to try | and help. I learned such useful phrases as "the children | have bread". | gs17 wrote: | Duo has a different problem for me. The lack of focus means | some languages don't get features. Chinese still doesn't | have Stories (there's an unofficial version of it, but | we've been waiting _years_ ). | numpad0 wrote: | (Duolingo problem(, AIUI): Duolingo is designed around such | premise that, by exposing your subconsciousness to such | small set of words and phrases in target languages, your | brain should be able to trivially construct output shims | from Universal Grammar, which must exist, to desired | languages; but that doesn't work in practice and you end up | with small set of words and phrases your subconsciousness | had recorded) | massimokris wrote: | the Duolingo's problem it is not because they have a bunch | of languages, it is because achieving fluency in a target | language it is about been able to produce/generate phrases, | and they just move you to consume and sort words and | phrases. in the case of any AI Language tutor, the student | must produce phrases in order to practice, and that makes | them advance in the path to achieving fluency | jahewson wrote: | Isn't having the AI do it for you better than having the AI | teach humans to do it? | dylan604 wrote: | Sure, if you're not into personal growth. Not everyone wants | to become the useless bit of lard sitting in a chair while a | computer does everything for them. Yet. Some of us still like | to do the actual things, but just need some assistance along | the way. We still have a bit of time before we're all the | humanoids from Wall-E | ericmcer wrote: | Yeah thats why I mill my own grain and am getting into | textiles. | djvdq wrote: | I love when people use this pathetic extreme examples, | when they don't have any meaningul arguments. | ericmcer wrote: | That isn't an extreme example at all, people used to mill | grain and make clothing by hand, now we don't. We somehow | are not sitting around getting fat even though technology | takes care of those tasks. | | The parents suggestion is that if we don't have to learn | languages that will lead to us all laying down drinking | big gulps while robot slaves take care of us. Their take | is the extreme example. People have literally made this | same suggestion about every technological advance and it | never comes true. | TeMPOraL wrote: | > _We still have a bit of time before we 're all the | humanoids from Wall-E_ | | Obligatory reminder that the movie itself explains that | people are what they are _not_ because of their lifestyle, | but because of the time spent in low-gravity environment. | dylan604 wrote: | not sure that really matters to the point | modeless wrote: | Even a perfect human translator following you around wouldn't | be anywhere near as good as knowing the language yourself. | whoisburbansky wrote: | It depends on what your goal is; for some tasks it's possible | that getting the AI to do it is best, but, e.g. the existence | of auto-pilot doesn't mean that hobbyist pilots wouldn't | benefit from/enjoy exercising the same skills manually. | swatcoder wrote: | _Maybe_ prior to fluency, for something like an odd business | or tourist trip. | | But there's a point in language learning where you can come | to express yourself directly in a new language without | intermediary "thinking" in your first tongue. The | communicative and expressive potential of that mode is much | higher than trying to squeeze one's intent through any kind | of translation, machine or internal. | | Plus, you know, it's fun. | j33zusjuice wrote: | Not necessarily. It depends on the use case. For taking a | vacation, having an AI that can instantly translate to your | native language would be amazing. That'd solve a lot of real | world problems, no doubt. | | However, translation has a great deal of subjectivity | embedded in it, particularly when there aren't 1:1 | translations. Case-in-point: there are many English | translations of the Christian bible, all similar enough, but | there are enormous variations in some cases. And there are at | least as many branches of Christianity as there are English | translations of the Bible. Some of them strictly recommend | the same translation, and they still disagree on the meaning | of various passages. | | Besides the problems inherent to translation, learning | another language gives you another paradigm of thinking. The | words we use, the way we construct sentences, etc., all | impact our view of the world. Here's a paper that discusses | the impact of the over-reliance on English in cognitive | sciences, and how this has downstream effects: https://www.sc | iencedirect.com/science/article/pii/S136466132... | | Learning languages as an adult also has protective benefits. | It reduces the probability of Alzheimer's (maybe dementia, | overall?). | coldtea wrote: | In the way that watching porn is better than having sex. | advaith08 wrote: | seen a lot of these, but none for Indian languages. Would love | to try an Indian language one! | 999900000999 wrote: | Are Indian languages hard for English speakers? | thinkingtoilet wrote: | I'm learning Hindi and there are somethings that are easy | (phonetic alphabet, nothing like 7 different sounds for | 'ough') but the sentence structure is very different and | can be hard to get right. Pronunciation isn't too bad for | the most part but there a few tricky things, for example | four different 't' sounds and four different 'd' sounds. | The hardest part is that there really aren't that many | resources. Even though Hindi is the third most spoken | language in the world, you will find far more resources for | many of the less spoken European languages. | tmountain wrote: | Started a project to do this a while back. It's pretty fleshed | out: | | https://www.parcero.ai/ | | I could integrate this instead of Polly pretty easily. | bilsbie wrote: | I think it would be so ironic if advanced AI ended up simply | teaching us new languages quickly instead of translating for | us. | toomuchtodo wrote: | Might be able to generate a better language than what we | have. | bilsbie wrote: | Good point. Maybe they invent a better language and easily | teach it to everyone. | dontupvoteme wrote: | Finally Esperanto has a use case! | spaceywilly wrote: | To me the key functionally for any language learning app is | giving you feedback on your pronunciation and general | understanding. I've been using Duolingo to learn Mandarin and | when I try to speak to anyone it's difficult for them to | understand me, because my pronunciation is all wrong. The app | is just feeding info to me one way, and I can try my best to | recreate what I'm hearing, but there's no way to know if I'm | messing it up. They do have a speaking feature but it doesn't | work very well, certainly not to the same level as speaking | with a real person who is fluent in the language and having | them correct you. | throwaway4aday wrote: | As a quick solution, you should try recording yourself | speaking and then listen to it to check your pronunciation | against some reference. So for example, find a YouTube video | in the language you're learning that also has good subtitles | (use https://filmot.com/ ) and listen to how they say the | phrase and then record yourself saying the same phrase and | play it back and compare. | dog321 wrote: | I practiced for a long time using the below pronunciation | trainer and I get a ton of compliments from native speakers | on how accurate my pronunciation is. | | https://fluent-forever.com/product/fluent-forever- | pronunciat... | inbread wrote: | I built just this a month ago with the Azure AI speech API, | which is already pretty good at multilingual speech. | | https://github.com/adrianmfi/gpt-tutor | | I look forward to testing if switching to Seamless can improve | it further, Seamless supporting nearly 100 languages is a nice | improvement. | jbird11 wrote: | Absolutely, what I've noticed is that the current apps are | great for beginners but after a certain point the only way to | improve your ability to speak a new language is to well... | speak it. I built Proseable to help people move beyond the | generic how to order a coffee or ask to go to the bathroom, and | have more meaningful conversations in the real world. Check it | out! | | https://www.proseable.com/ | Jeff_Brown wrote: | > game | | Yes! Better yet, you're a spy, or a hostage negotiator, or the | leader of any kind of enterprise (army, business, aid | organization) ... | | Programming games like that will resemble directing improv | theater. You can't program every response; you'll have to | instead fit each character with beliefs and motivations. | | I can hardly wait. | dontupvoteme wrote: | For Language Acquisition, Input Is All You Need. (Mostly) | | What would be really cool is something that can autodub videos | or audio into your target language. The hardest problem | learning languages that aren't English is often finding content | to consume in them. | | Disclaimer : I am Krashenist so this take is biased | massimokris wrote: | I built one for people in Latam to practice languages in a | conversational way through a WhatsApp chat | https://wa.me/+5491162951713?text=hola%20Speakeasy | flanbiscuit wrote: | I would love a game that helped you learn a language (not | necessarily VR though as I don't have that equipment). The game | drops you into a world (a country of the language the game is | meant to teach you) where no one speaks your language and you | have to figure out what people are saying in order to fulfill | quests. You get some hints, like maybe you have a simple | translation guide in your inventory or sometimes you meet | people who can speak a few words of your language. That would | motivate me to learn faster than self-taught tutorials. | | I'd love to learn French and the game would take place in | locations all around modern France. | | It would have to a good story. Maybe something in the style of | Professor Layton series could be interesting, or something more | open world. | dwighttk wrote: | and the language tutor company could have you pilot around a | menial labor droid while you are learning... | zbyforgotp wrote: | But will people use them? | pnut wrote: | I was hoping to find out, that the actor's voice in the demo | video was generated, or that he had recorded the video speaking | in another language or something. | | That would have been the knockout punch. | polygamous_bat wrote: | "The Babel fish is small, yellow, leech-like, and probably the | oddest thing in the Universe. It feeds on brainwave energy | received not from its own carrier, but from those around it. It | absorbs all unconscious mental frequencies from this brainwave | energy to nourish itself with. It then excretes into the mind of | its carrier a telepathic matrix formed by combining the conscious | thought frequencies with nerve signals picked up from the speech | centres of the brain which has supplied them. The practical | upshot of all this is that if you stick a Babel fish in your ear | you can instantly understand anything said to you in any form of | language. The speech patterns you actually hear decode the | brainwave matrix which has been fed into your mind by your Babel | fish. "Now it is such a bizarrely improbable coincidence | that something so mind-bogglingly useful could have evolved | purely by chance that some thinkers have chosen to see it as a | final and clinching proof of the non-existence of God. | "The argument goes something like this: 'I refuse to prove that I | exist,' says God, 'for proof denies faith, and without faith, I | am nothing.' 'But, says Man, the Babel fish is a dead giveaway, | isn't it? It could not have evolved by chance. It proves you | exist, and, by your own arguments, you don't. QED.' 'Oh dear,' | says God, 'I hadn't thought of that,' and vanishes in a puff of | logic." | fassssst wrote: | Try the demo here, you record a video of yourself and it does | voice cloning and a comparison: | | https://seamless.metademolab.com/expressive/?utm_source=meta... | ceejayoz wrote: | > This research demo is not open to residents of, or those | accessing the demo from, the States of Illinois or Texas. | | Interesting mix. | solardev wrote: | Illinois has a facial recognition / cloud biometrics ban. | Familiar face detection for doorbells etc. isn't allowed | there. Wonder if Texas has something similar? | ceejayoz wrote: | Ah, that makes sense. | | In Texas it seems to be part of AG Paxton's culture war | stuff. https://www.texastribune.org/2022/05/12/texas-face- | filters-i... | aschla wrote: | Likely related to biometrics laws. I know Illinois has | restrictions on the collection of biometrics, not sure about | Texas. Facebook in particular paid out a significant amount | of money in a class action in Illinois, I know because I got | a chunk of change from it. | dylan604 wrote: | which you mean someone took a dime and carved off a piece | of it, and then sent you a piece of paper with postage that | cost more than the value of that chunk? yeah, we all got | hosed by that one too i'd imagine | ceejayoz wrote: | https://www.nbcchicago.com/news/local/illinois-facebook- | user... | | > According to the Settlement Administrator, payments to | class members between $200 to $400 started going in the | mail May 9. | | I got a $0.19 check from an iTunes settlement once, but | this wasn't one of those cases. | jlund-molfese wrote: | It's because of https://www.ilga.gov/legislation/ilcs/ilcs3.a | sp?ActID=3004&C... | | Facebook has had to pay out hundreds of millions of dollars | in settlements for related class-action lawsuits, and rather | than trying to get informed consent, they're deciding not to | collect biometrics from residents of those states. | SillyUsername wrote: | And that demo is now overloaded and fails to translate the | input :D | teacpde wrote: | As someone working in tech and following along the progression | of AI, I believe I have the right expectation. But still feels | surreal seeing myself speaking a foreign language in my own | speech style. | wedn3sday wrote: | Well that was spectacularly bad. Failed to translate a single | word from english->spanish. Admittedly I was using George | Carlins favorites, but if you're trying to have an expressive | language translator that refuses to translate "fuck" then what | you've got is bullshit. | StrangeDoctor wrote: | Any more info about the watermarking? Only Meta can make the | determination? | | Edit: I can't find the weights but if I'm reading the paper right | anyone could train their own detector. | hadyelsahar wrote: | Hey! a RS from Meta seamless team here. | | Yes, we chose not to release the watermark detector to | safeguard against adversarial attacks. This decision helps | prevent any attempts to erase the watermark by malicious users. | | The watermark generator and detector are trained together, one | can use the information in our paper to train your own | generator and detector model, however in this case the | watermark signature created will be distinct from the one we | use to protect our seamless translation models. This approach | ensures each model maintains its unique security features. | StrangeDoctor wrote: | Thanks for clarifying, and seems like a completely reasonable | approach. Thanks for the great work. | gagabity wrote: | I had pretty terrible results when I tried English -> Swahili I'm | using the Huggingface M4T V2 spaces, it pretty much doesn't work | most of the time and I just get English back with a different | voice, Expressive on the other hand only has a few languages it | seems. | | It would be nice if they could layout what exactly is missing in | terms of data to make a language work better, while the actual AI | bit is out of reach for most of us maybe we could provide more | data. | | There is also a 60 sec limit and wonder if this is HuggingFace | limitation or Seamless? | yorwba wrote: | > maybe we could provide more data. | | If you want to contribute by recording yourself speaking | Swahili, https://commonvoice.mozilla.org/sw is the place to go. | Although Meta has access to much larger data sets, they | nonetheless use Common Voice as a "known good" source. E.g. the | paper on their SONAR speech encoder reports experiments on | Common Voice data, coincidentally involving Swahili | https://ai.meta.com/research/publications/sonar-sentence-lev... | whbrown wrote: | Can anyone help demystify the licensing? | | Besides the ACCEPTABLE_USE_POLICY, there's a CC BY-NC 4.0 | (NonCommercial) license, a 'SEAMLESS_LICENSE' (NonCommercial), | but also an MIT license? It would seem these other licenses | contradict the MIT license, could somebody help clarify how these | all interact in practice? | dankle wrote: | MIT for the code, NonCommercial for the trained models I bet. | disattention wrote: | The license details are listed on the project GitHub | | https://github.com/facebookresearch/seamless_communication#l... | jeffbee wrote: | How will Meta put these models into practice? I understand why | Google and Apple have models for their mobile OS users, but I | don't understand where users for Meta speech models come from. | Are they planning to show Instagram videos with English narration | in French or what? | solardev wrote: | Ads in any language! | polygamous_bat wrote: | Ads and Reels (their TikTok competitor) I imagine would be the | primary use-case. Imagine spreading the "wonders" of TikTok- | like videos to non-$native_language speaking world. | dylan604 wrote: | but isn't that a TikTok shtick to use the obviously fake | voice in your video? | crakenzak wrote: | They have arguably the most diverse userbase of any company, | with users from pretty much every single country + language | across all their services & apps. I could easily imagine a | handful of use cases having a high performing universal | translation model would be incredibly useful. | spacemanspiff01 wrote: | The metaverse will not have any language barriers... | beders wrote: | I'm thrilled to see the progress made in the last 30 years. | | As a student in the mid-90s I worked on a system called Verbmobil | at the German Research Center for AI and it did speech-to-speech | for English, German and Japanese in very limited domain. | | This was done via "classical" NLP: You had to model the domain | with concepts, you needed sentence parsers, semantic engines, | speech-to-text hand-crafted for 3 languages etc. | | As it turns out, this approach is/was a dead-end. | kapp_in_life wrote: | Neat. How translatable are tones of voice for intent across | languages? Like does a person trying to do a "nerdy" | voice(nasally, whiny, etc.) in English translate to the "nerdy" | stereotype for a French speaker. Seems to do very good on | whispers which made me wonder what could be next. | jeffbee wrote: | If you don't speak the language into which these models | translate your inputs, how do you know if or why the model has | generated, without being commanded to do so, a campy American | gay male sociolect, or an African American regional accent, or | some other thing that may convey unintended meaning to native | listeners? | apwell23 wrote: | . | jvolkman wrote: | The Google Translate app has a conversation mode. | wg0 wrote: | And just the other day StyleTTS[0]. | | Just text to speech has gone too far. Audio books would be mainly | generated on the fly like this? | | I think some RPGs in some 5 years time might have something like | this: | | - A text file that outlines characters and a lose plot/Story | line. Human written. | | - 3D Mesh Generation based on character description via | Transformers based models. Auto generated. | | - Dialogues for each NPC via LLM. | | - This TTS engine again based on such models. | | Result - almost unlimited replayability. Or even edit text file, | have a new world based on a new story line with characters having | different personas. | | [0]. https://news.ycombinator.com/item?id=38335255 | mpalmer wrote: | How has TTS gone too far? | wg0 wrote: | Came a long way, that is. From the days of let's say if I | recall correctly, from Windows 98 screen reader. | TheCaptain4815 wrote: | The demo is so much fun to use. I can't wait for all these | technologies to start integrating into filmmaking / games. | anonzzzies wrote: | How far from a real-time Star Trek translator? Whisper is fast | enough and light enough, LLMs are getting there, so it's close | isn't it? | Sol- wrote: | Seems like there will always be latency, because it's not | possible to easily stream over languages that have different | structure. You need to wait a bit before you can start | faithfully translating the meaning. | | They also mention it in one of the videos about the streaming | variant of their translator. But I guess 2s delay or what they | mention is close enough for practical purposes. | | I feel like for personal relationships where true real-time is | required, having a computer intermediary would be weird anyway | and you have to learn the language, at least for the time being | and as long as personal relationships are still relevant (in | the post-AI world they might not be). | forgot_old_user wrote: | > You need to wait a bit before you can start faithfully | translating the meaning | | I guess it's possible that the AI learns about a specific | person over time? That way it can be confident about what's | being said as soon the person starts saying it | ziptron wrote: | If you are multilingual but have young children and plan to | continue residing in your current English speaking country for | the foreseeable future, are you opting to teach your children | those additional languages or are you adhering to the idea that | they can always learn those languages later if necessary, | considering it might not be essential (esp with models like | this)? | esafak wrote: | It is easier to learn multiple languages when you are young. | robga wrote: | There isn't a lot of good evidence behind this popular | conception. | | If anything, the evidence is that it isn't true, see https:// | journals.plos.org/plosone/article?id=10.1371/journal... | | Any apparent causality of age of acquisition seems to be a | proxy of hours of exposure. It may well be that it is easier | for young people to rack up a lot of exposure to a second | language, but not much evidence that age plays much of a | factor for people of different ages who had the same degree | of exposure. | debugnik wrote: | > we argue that the late learners resort to computationally | less efficient processing strategies when confronted with | (lexically determined) syntactic constructions different | from the L1. | | > we show that the ERP signal in response to grammatical | violations depends on the AoA of an L2 learner, as well as | on the regularity of the structure under investigation. In | (lexically determined) syntactic constructions different | from the L1, we found a gradual change in processing | strategies that varies by AoA, with a native-like effect | for early learners and a less efficient neural processing | strategy for later starters. | | Although they do clarify that these effects _could_ be | confounded with age of acquisition instead of it being the | cause. | navbaker wrote: | Seamless Streaming looks really promising! We just had a new | employee start a few months back with profound hearing loss and | our company had no idea what to do with him from an accessibility | standpoint. They threw out solutions like Dragon, not realizing | those solutions are not real-time. | | He ended up rolling his own solution by standing up Whisper in | one of our clusters and writing a basic front end and API to take | his laptop's mic input and chunk it every few seconds to send to | the model and get back text in pseudo-realtime. We got him a | pretty beefy Alienware so he wouldn't be tied to the cluster | GPUs. I can't wait to see what he does with these new models! | cgb223 wrote: | Just wanted to say you're a great employer to be so incredibly | accommodating to the point you get them an Alienware and let | them roll an accessibility solution | | We need more support for employees like this! | cced wrote: | Second this! | | Also, what about Apple's latest M3 series chips? Are this in | the same realm as Alienware in terms of AI compute? | jackson1442 wrote: | I think generally the consensus of Apple Silicon is that | they're great _for a laptop_, but still aren't going to | beat a dedicated graphics card + high-end CPU like i9/Ryzen | 9. Biggest thing going for apple is the performance/watt | though which is critical for a laptop. | cjbprime wrote: | I think this is missing the main reason to use Apple | Silicon, which is that your dedicated graphics card | probably has 24GB or less of RAM, whereas e.g. an M2 | Ultra Mac Studio can have 192GB of RAM with a far | superior memory bandwidth to anything on x86. This is | important because even a "small" LLM like Llama2 13B | would require quantization to fit in the 24GB RAM that | the dedicated graphics card will give you, whereas the | Mac could run Llama2 70B without quantization (at FP16). | aftbit wrote: | Whisper doesn't need that much RAM though. | willy_k wrote: | They definitely are in terms of energy efficiency | nodja wrote: | They're better than most consumers x86 CPUs but worse than | using a GPU. Where they shine is when the ML model can't | fit the GPU's VRAM since you have better options for ram | size with macs. | romwell wrote: | >Just wanted to say you're a great employer to be so | incredibly accommodating to the point you get them an | Alienware | | So gracious, to give a software developer some hardware to | run the software they _need to work_ , that costs a whopping | _nothing_ more than what other people in the industry get on | the average. | | >and let them roll an accessibility solution | | "You're such a good employer! You let your employee build | _their own_ accessibility ramp to the back entrance _in their | own time_ , and _even_ got them a mortar spatula to do so! " | We need more support for employees like this! | | >We need more support for employees like this! | | And less support for _employers_ like this. | Solvency wrote: | Not sure why you're being downvoted. Literally the | equivalent of building your own ramp. | freedomben wrote: | I didn't downvote, but I considered doing so because | nowhere that I saw in GP does it say _in his own time_ , | and that's a critical piece of the equation. | Hallucinating that datum means they got the argument | wrong, and worse they were harshly critical of the | company based on that _wrongly assumed_ information. | | It reminds me of the Homer Simpson quote, "I don't mind | being called a liar when I'm lying, or about to lie, or | just finished lying, but NOT WHEN I'M TELLING THE TRUTH!" | I would be equally critical if it was warranted, but when | it isn't it's deeply unfair to the accused. | | If the person _wanted_ to build their own ramp, and the | employer let them do it on the clock, that 's a | completely different scenario than the employee having to | come in during their off-hours to build the ramp just so | they can go to work. | qkeast wrote: | Awesome! I love hearing about places making the effort to be | inclusive. | | As someone who's profoundly deaf myself, another less technical | approach is to install Rogue Amoeba's Loopback, and use it to | pipe audio from a given app into a tool like Google Meet or | Otter.ai using the Loopback device as the audio source. This | effectively provides real time captions for anything running on | your existing machine. | tuukkah wrote: | Clever use of Google Meet as a tool! Also, Google Pixel | phones now provide realtime captions to any speech playing on | the phone (Accessibility > Live Caption). You can also choose | a "preferred language" and the captions will be automatically | translated to that language from other languages. | jallmann wrote: | Google Chrome [1] also has captioning built-in [2], so this | could also work from a plain page that hooks into the | loopback device. Pretty sure it's using the same TTS backend | that Google Meet uses. | | The nice thing about Chrome feature is you can move the | caption box around and keep it in the foreground while doing | other things, although styling options seem limited (the text | might be a little small for some). | | [1] on desktop, not sure about mobile | | [2] via chrome://settings/accessibility -> Live Caption | romwell wrote: | >Awesome! I love hearing about places making the effort to be | inclusive. | | The extent of the effort being getting their employee a | slightly-more-expensive-than-average tool that would enable | them to do their job better _regardless_ of the disability? | | Such inclusive, much pat-yourself-on-the-back, wow. | | "We gave our woodworking shop employee a quality saw so that | they'd make _their own_ accessibility ramps! " | callalex wrote: | What would you have them do instead? | qkeast wrote: | I have literally been told in job interviews that the | company would not be "allowed" to hire me because I'm | hearing impaired, so yes, making an effort to support an | employee's disability and their needs is worth recognizing. | RogerL wrote: | So what? Okay, in the case of a ramp, if you need one you | probably are going to have difficulty building one. So pay | employee Sally to build it instead, absolutely. | | But hearing loss does not impair standing up servers and | software. They can pay the employee who probably is the | expert at this, the guy with the hearing loss, or go task | Emil to go do it to ... avoid 'appearances'? | pawelduda wrote: | That's very nice of you | romwell wrote: | >He ended up rolling his own solution | | >That's very nice of you | | ...doesn't compute. | | What exactly was nice here? | diab0lic wrote: | > We got him a pretty beefy Alienware so he wouldn't be | tied to the cluster GPUs. | | Probably this. | lovich wrote: | Y'all should turn that into a product, or at least open source | it and get the positive PR + helping others | FloatArtifact wrote: | > Y'all should turn that into a product, or at least open | source it and get the positive PR + helping others | | There you go. https://github.com/dictation-toolbox/dragonfly | kylixz wrote: | I recommend checking out: https://talonvoice.com/ | FloatArtifact wrote: | It's not open source nor does the author intend to open the | stack. | aftbit wrote: | Check out Willow! It does essentially this, using WebRTC. It | doesn't handle the near-real-time response yet, but it does | stream the audio to the server and the change would be pretty | minor. | FloatArtifact wrote: | > Check out Willow! It does essentially this, using WebRTC. | It doesn't handle the near-real-time response yet, but it | does stream the audio to the server and the change would be | pretty minor. | | Simply voice to text is not what's needed for dictating | commands. Unless I can load commands of on the fly and decode | utterances that may be useful. | | The client would need to be able to send its commands to the | server on the fly. | FloatArtifact wrote: | Problem with whisper is its not really optimized for command | recognition versus general dictation. | | - Whisper processes 30 second audio chunks. So if you process 5 | seconds of audio you have to pad it out with 25 seconds of | silence. Hence a loss of efficiency with wasted CPU / GPU | cycles on 25 seconds per chunk in the case above. | | - Whisper most likely can't handle hundreds of commands much | less than a thousand performantly. | | - Whisper doesn't handle short commands very well with a degree | of accuracy post processing commands from free dictation | utterances. | | Command dictation should be weighted higher than general | dictation when decoding. | | I work with a little under 1500 of commands dragon naturally | speaking. DNS is hot garbage as a program despite it has the | best accuracy to date with the feature of commands and | dictation in one utterance. You get to pay $750 for the | privilege m | | I've yet to see a free and open source speech recognition | engine that can handle both dictation and commands with a high | degree of accuracy. | | Please please let me know if there's alternatives out there. I | would definitely pay to support an open source project like | this that focuses on command and dictation. | | Most solutions out there that are open source nowadays focus so | much on iot command recognition with intents. That's not well | suited for controlling your computer with grammars containing | voice commands. | novok wrote: | Is 30s the input size set by the model, or programs that wrap | the model? Is it how it's trained? | bakkoting wrote: | It's a property of the model itself. | | > Input audio is split into 30-second chunks, converted | into a log-Mel spectrogram, and then passed into an | encoder. | | https://openai.com/research/whisper | sagz wrote: | Do they need realtime transcription? | | Computer: webcaptioner.com Android: Live Transcribe | (g.co/livetranscribe) iOS: Live Caption with the 'mic' icon | enabled. | | Web conferencing: Meet, Zoom, Teams all support realtime CC, | which is pretty good. | londons_explore wrote: | Does "reduce toxic words" and "promoting safer communication" | mean that if you say something wrong about LGBTQIA+ people it | will 'correct' what you say? | | I'm not sure I want the latest twitter trend to be involved in | the design of my translator... | jwineinger wrote: | Their video said it was to reduce toxic word hallucinations, | which does seem admirable/useful. I'm testing real-time | translation in a church setting, and I've witnessed whisper | hallucinating profanity, which is quite undesirable. | cgb223 wrote: | "Toxic word hallucination" would be a great punk rock band | name | kelseyfrog wrote: | It also happens to be quite hilarious. | mortimerp9 wrote: | Hi, I work on seamless. What this refers to is added toxicity | mitigation. We try to detect the level of toxicity in the input | and make sure that the output toxicity level is not higher. | This protects the model from doing egregious errors in the | translation. | | There are more details in the paper if you want and the | mitigation code is all open source if you want to check what it | actually does. | Domenic_S wrote: | > _What this refers to is added toxicity mitigation._ | | Oh, well _that_ clears it up! </snark> | | I don't see any definition of 'toxicity' on the landing page | - it seems to be one of those 'I know it when I (hear) it' | kind of words... unless there's some widely-accepted | definition in this area of study? | mortimerp9 wrote: | Sorry if I wasn't clear, internally we've been talking | about it a lot, but I forgot that it doesn't have such a | solid definition outside of our work. Thankfully, we try to | define it in section 7.3 of the NLLB paper: | https://arxiv.org/pdf/2207.04672.pdf | | The tldr is that if you say: "Thank you for this job | offer." you wouldn't want it to be (mis)translated as "Go | F*k yourself.". But if you do say "Go F yourself", you | still want it to be translated as that. | Reubend wrote: | That's an awesome feature. I think one of the worst possible | outcomes of machine translation is something that ends up | being accidentally offensive, and this is a smart way to | mitigate that. | fl7305 wrote: | > one of the worst possible outcomes of machine translation | is something that ends up being accidentally offensive | | The Hitchhiker's Guide To The Galaxy claims the opposite: | | "Meanwhile, the poor Babel fish, by effectively removing | all barriers to communication between different races and | cultures, has caused more and bloodier wars than anything | else in the history of creation." | SoftTalker wrote: | Or maybe we'll finally come around to the idea that being | offended by _words_ doesn 't make a lot of sense. | dontupvoteme wrote: | How do you account for colloquial (non-English) language | which could be naively misconstrued as toxic? | | e.g. "geil" (either cool or horny depending on usage) in | German | | It's not fundamentally different than e.g. "wicked" in | English, but the biggest bias that potentially all these ML | models exhibit is predisposition towards Anglophoneism | mortimerp9 wrote: | Our goal is to have a good recall, sometimes to the | detriment of precision, so for words with multiple | meanings, it might consider them toxic when in the actual | context they are used in, they are not. The toxicity | mitigation algorithm will search for alternative | translations that have the correct meaning but not the | potentially toxic word so that there is no added toxicity | in the output. This means that sometimes the model might | prefer a less coloquial phrasing than what a human would. | | You can find details on how the multi-language creation of | the toxicity lists was done in section 7.3 of the NLLB | paper: https://arxiv.org/pdf/2207.04672.pdf. TLDR: it's not | just a translation of a base English list, even if we | started from that, each language has a curated list that | was built by professional translators. | dontupvoteme wrote: | That's significantly less myopic than I pessimistically | assumed. Thanks! | novok wrote: | Is there an ability to turn it off? If you're translating an | R rated movie with criminals who swear a lot, is it possible | to get non-toxic filtered output to make sure it's being | translated properly? | mortimerp9 wrote: | it only kicks-in if the output is more "toxic" than the | input. If the input has a lot of swear words and the output | has the same amount, then it will be left alone. | beardicus wrote: | the site makes it pretty clear in multiple places that they're | talking about "added" or "hallucinated" toxicity. maybe your | culture war outrage is misplaced? | Domenic_S wrote: | Ok so I know nothing about how this works. It seems like if | the model was able to properly detect words in the first | place, it would never hallucinate 'toxicity'; if it _can 't_ | recognize the word with high probability, how will it know | whether the speaker actually said $toxicWord or whether it | should print something else? | | Perhaps it's taking a Big List of Naughty Words and weighting | them so that the system must be "extra sure" that's what the | speaker said, or else fall back to a G-rated word? | numpad0 wrote: | Maybe it's for preventing unwarranted fucks[1]? Translation | is more than just concatenating dictionary definitions, and | machine translations routinely make this kind of out-of- | place and technically correct lookups. | | 1: https://www.google.com/search?q=engrish+fucking+sign&tbm | =isc... | mortimerp9 wrote: | Meta employee here. The system is not perfect, or it would | not "hallucinate", while it's pretty good, it does sometime | make errors (not just hallucination, maybe just some | mistranslation due to noise in the training data). What we | want is to avoid these errors to introduce toxicity (think | swear words) that weren't in the input as this could be | very bad for the user. There is a separate system that | double checks the output (compared to the input) and tells | the translation model to try again if it's too bad. | madeofpalk wrote: | Your framing of basic respect as being a "twitter trend" is... | bizzare. | jadbox wrote: | Your comment seems to imply LGBTQIA+ is just a Twitter trend, | versus people's lived experience and lifelong identity. This is | as unnecessarily judgment as small identities claiming that | straight people must self-identify cis. | | There is no moral superiority to deny or force label other | people's identities. You're an attack helicopter? Great, roger | dodger, let's go get coffee Seahawk. | | No one is seriously asking for litter boxes in school bathrooms | or helicopter refueling stations. | mpalmer wrote: | > No one is seriously asking for litter boxes in school | bathrooms or helicopter refueling stations. | | This feels a bit out-of-nowhere. | | My read on parent comment was that "Twitter trends" are fast- | changing norms about what language is (un)acceptable. They | were not saying that LGBTQIA+ identity itself is a trend. | jadbox wrote: | Perhaps so. In light of yesterday's Russia announcement for | labeling the "international LGBT public movement" as terror | extremists, I think we should be careful what we label as | fads or (worse) insidious activity. Source: | https://www.themoscowtimes.com/2023/11/30/russia-bans- | intern... | mpalmer wrote: | You seem to me to be arguing against points no one is | making. You're taking the word "trend" and extrapolating | it to "fad" and "insidious activity" - both of which have | very different meanings and connotations to the phrase | "Twitter trend". | | The original comment you replied to made the point that | they don't want their own personal expression curtailed | or modified according to someone else's opinion of | acceptable speech. | | As someone who repudiates Russia's policies, I support | and agree with their point. | sjbase wrote: | > Please don't use Hacker News for political or ideological | battle. That tramples curiosity. | | From the hackernews guidelines | zengid wrote: | If "toxic word hallucinations" isn't a cyberpunk phrase I don't | know what is. | | (quote from the video presentation in the link) | spacephysics wrote: | Oh god they're gonna censor the output. Time for musk to make a | non-censored version lol... | drexlspivey wrote: | I am sorry Dave, "merde" is not in the pre-approved word list | dontupvoteme wrote: | I wonder if it doesn't understand the common colloquial usage | of "geil" in German. This sounds like it is going to mess up | natural language | troseph wrote: | I feel like naming something "seamless" is not dissimilar to | calling the Titanic unsinkable. | bsza wrote: | "We need access to your microphone and camera to record your | voice and translate it with your expressions." | | None of the videos shows any modified/lip-synced footage. There | doesn't seem to be a reason for this thing to need access to my | camera. | | Also, using it with tape over the camera doesn't seem to work | either. (Perhaps it needs to see facial expressions in order to | work?) | Havoc wrote: | Can this also do straight tts or is it translation only? Is t | quite clear to me from the site | tambourine_man wrote: | Every video in this page is a bit out of sync with the audio. | Combined with the blandness of feature expressions and the whole | mood in general, I kept waiting for the moment when the video | would disclosure that everything on it was created by AI. | nextworddev wrote: | RiP elevenlabs? | Reubend wrote: | Wow, after trying out the demo, I'm floored by how high quality | this is. The translations worked perfectly, the voice cloning was | "good enough", and the emotions conveyed in my voice was retained | pretty accurately. | | I don't think this would fool anyone that I was a real native | speaker of the target language, but for casual conversation this | would work pretty much perfectly. It basically avoids all of the | traditional pitfalls of machine translation, like the unnatural | robotic voice that it outputs, the slow translation speed and | huge latency for realtime conversation, and the loss of emotion. | stephc_int13 wrote: | As a French native speaker, I am surprised by the low quality | (frankly ridiculous) voice of the French translation example. | | Especially because the head of AI at Meta is a French guy AFAIK | (Yann Lecun). | sangnoir wrote: | They are optimizing for speed (low latency) | yread wrote: | Does the spanish expressive sample sound muffled for others too? | And the french sounds super mechanical. Hopefully, it's more | impressive the other way. | | Also: "This research demo is not open to residents of, or those | accessing the demo from, the States of Illinois or Texas" | dentalperson wrote: | Yes, they all have significant 'ghosting' artifacts where the | harmonics are a bit fuzzy if you listen closely. AFAIK all of | the recent neural speech engines have this, from SoundStream to | EnCodec, especially in low latency causal setups. Wavenet was a | bit better in that regard but has fallen out of style due to | complexity and the lack of a bottleneck. It seems like | something diffusion post processing would be able to clean up. | TacticalCoder wrote: | The "expressive" example in french exhibits a _thick_ accent | which bothers me more than the mechanical aspect of the non- | expressive french example. | | It's not dissimilar to some kind of a "ch'ti" / "chtimi" accent | or a belgian-french accent (which is not dissimilar to the | french ch'ti accent, heard in some part of the north of France. | "Ne partez pooooo" (with a longer "a" which sounds nearly like | an 'o': that's not proper french at all) instead of "Ne partez | pas". | | That's said I'll take the non-expressive accent any day over | subtitles for when watching video in a language I don't | understand: it's clearly good enough. | grogenaut wrote: | Illinois is possibly because they don't allow storage of | biometric data without express permission and I believe | explicit usage restrictions. So I bet they're keeping all of | your utterances, which would violate that law. | iFire wrote: | LICENSE | | Attribution-NonCommercial 4.0 International | | https://github.com/facebookresearch/seamless_communication/b... | iFire wrote: | Took me 2 minutes to find the Github. | nathanfig wrote: | Impressive work, really excited for this. | | I will note though that I feel safer getting an occasional bad | word than I do having a translator straight up deceive me. | | For example, "what the fuck" in English->Spanish is giving "que | diablos" output. Definitely toning down the meaning there. | | If someone says something mean to me, I want to know it. | jonathanlb wrote: | This may be an intentional decision given that there are | several ways to say "what the fuck" in Spanish, such as "que | mierda" or "que carajos". And that's not including regional | expressions like "que cono" or "que chingados". So, saying "que | diablos" may be the most common expression across dialects | conveying the same meaning. | nathanfig wrote: | Yeah could be, I still need to read the paper to better | understand the safety tuning. | | Would be interesting to see some work stress-testing the | ability to convey ill-intent across multiple languages. | Accurately conveying ill-intent is safety-critical for the | person being threatened. | trinovantes wrote: | Currently Steam bans games from using AI-generated assets (for | good reason). I wonder if they'll back track on this or carve | exceptions because this tech seems really useful for indie devs | to add voice work to their otherwise silent games. | yjftsjthsd-h wrote: | Very speculative amateur opinion: My understanding is that | Valve didn't exactly ban AI, they banned AI that was fed | copyrighted works that could possibly make the results | copyright infringement ( | https://www.theverge.com/2023/7/1/23781339/valve-steam-ai-ar... | ). (Side note: Regardless of individual views on whether AIs | are just copyright regurgitaters or not, I can understand Valve | being cautious until courts have actually decided.) So _if_ | speech models can be made purely from assets that their | creators can prove they have the rights to use, it would | probably be easy enough to get it approved. | ChuckMcM wrote: | I look forward to the day where I'm wearing my headphones in a | foreign land and hearing all of the discussions in my own | language. | | The "universal translator" which was part of Star Trek and a lot | of other Sci-Fi I was exposed to as a kid was something I was | really fascinated with. My Dad worked as a simultaneous | French->English translator and sadly spent long hours away from | home and, as a kid, I started trying to build a translator so | that it could do his work and he could be home more. | | Translation is important work and one that could help a lot of | people. It's my hope that we get to the point where these models | work entirely on locally carried resources. | sacvnsune wrote: | If I am not wrong, Google Pixel buds offer live translate | feature. | echelon wrote: | Not in the voice of the original speaker. | stevenicr wrote: | now if I could just get the pixel buds tech to remove the | voice of the original speaker and translate some youtube | videos from thick accent english into no accent am-english. | ChuckMcM wrote: | This is a really interesting use case. I could definitely | see this as a service for content providers to get more | reach and I think you could justify a subscription price | for the service based on this. | | By keeping creating speaker specific tonal ranges and | profiles you maintain the better cohesion on the final | product. | keerthiko wrote: | Obligatory, not directed at you in particular since I'm | sure you mean no offense, but just voicing a pet peeve: | | I grew up bilingual outside the US, and speak English | with a hybrid British/Indian/Middle Eastern accent (with | some of my personal quirks, and mixing increasing amounts | of various American accents over time). I can understand | English in nearly any accent (Singaporean, Chinese, | Vietnamese, Indian, Nigerian, eastern European) as long | as the words involved are globally used and the grammar | is passably queen's. Especially after hearing it for | about an hour. And people who natively speak English with | these various accents usually can understand my English | better than they can an average American accent. Yet in | this country, my accent is belittled, despite being | perfectly understood and more versatile. Even by others | who don't speak with the American accent! | | This is the problem of the "default accent" anywhere | being referred to as "no accent", and therefore anything | deviating is considered "having an accent". This makes | "accent" a negative trait, scaling from 0-bad to heavy- | bad. But if the vernacular were such that we said | "American accent" instead of "no accent", then noone's | accent is bad, just not used to. | | Most of my non-American peers who were raised on English | have a better command of the language than my American | ones, yet they are mocked for their accents as if they | don't know the language, when in reality it's the | Americans lack of familiarity with the language (as its | used globally) preventing them from comprehending the | language. | | So yes, put in more work, the world is shrinking and | English is the global language (for better or worse). | What you're saying is spoken from a position of privilege | because the culture allows you to mock others' accents | and imply your version of it is the correct one that | everyone else should put in work to provide you with, | rather than the other way around. | | Every time you hear English with an accent other than | British, American or Australian, remember that it usually | means the speaker knows at least one entire other | language as well, probably one that you would sound like | an idiot if you tried to speak it. Don't be rude or | dismissive of their command of English. | | In fact, you were so close -- you called it a "no accent | am-english", when you could have just called it what it | is -- "an american accent". | freedomben wrote: | I'm not OP, but doing what you did is a pet peeve of | _mine_ : | | > _What you 're saying is spoken from a position of | privilege because the culture allows you to mock others' | accents and imply your version of it is the correct one | that everyone else should put in work to provide you | with, rather than the other way around._ | | > _Every time you hear English with an accent other than | British, American or Australian, remember that it usually | means the speaker knows at least one entire other | language as well, probably one that you would sound like | an idiot if you tried to speak it. Don 't be rude or | dismissive of their command of English._ | | This is so uncharitable an interpretation of GP that it | makes me wonder if it's Poe's Law at play and you're | actually trolling. Nevertheless, I will assume you are | being serious and address your comments as such. | | You clearly have some deeply held frustrations (at a | minimum), but unless you have a history with GP and | therefore a _lot_ more context on them than I do from | just reading these comments, or unless GP edited their | post in between my writing this and reading yours, then | you are majorly projecting upon them based purely upon a | negative stereotypes that you harbor against Americans. | If I 've missed the mocking or rude dismissiveness you | refer to, then please point it out with a direct quote so | I can further examine what you are referring to. | | There definitely are people (and definitely some | Americans, though it's certainly not monopolized by them. | I was once ridiculed by locals in Mexico City for my | terrible Spanish) who "mock" accents and are generally | assholes who don't appreciate the difficulty of speaking | a non-native language, and many of them would deserve the | criticism you've levelled at GP, but to unload those | accusations and chastisement at a person without cause, I | don't think you're behaving any better than the people | you would criticize who. | archagon wrote: | I don't think it's unreasonable to remind people that a | "default" accent does not exist, and that AI-editing an | accent out starts to feel a bit like dystopian identity | erasure and homogenization. Even if we scope ourselves to | Americans speaking English as a first language, there are | dozens of diverse accents across the country. | ChuckMcM wrote: | I think this is one of those times when my Mom, | understanding my desire to be understood and to ask | questions about motives and related understanding, would | observe the, oblivious to me, effect of inflaming the | conversation and say, "Charles, this is not the time." | :-) | archagon wrote: | I don't like seeing a comment that's relatively | reasonable get greyed out just because it grinds | somebody's gears. Alas, I only have one counter-downvote | to give, so I feel obliged to comment. | stevenicr wrote: | My original statement was wanting a translator device, | hardware or software, so I could understand and learn | better. | | There was not desire for identity erasure or | homogenization, leave whoever's voice the way it is | online, give me an option to translate it. I added more | about my issue downthread. | | Diverse accents across the country. - absolutely! which | is why I said 'no accent am-english.' (for me, as I can't | learn well outside that) - and assuming if this tech | exists it could help me, and perhaps be tweaked to change | to other accents for other people.. also mentioned in | downthread reply. | stevenicr wrote: | I appreciate your sharing, and stating that you assume I | meant no offense, and that your thoughts are not directed | at me specifically. | | I could of been more specific, but my request for the | tech to vary, I think would lead to specific options for | different people. | | And actually to be even more.. not sure the word.. I want | 'the Chicago accent' I think it's called, or midwest / no | accent. Personally as much as I enjoy some entertainment | from Jersy / NY accents, I would not volunteer to watch | tutorials on tech taught by the Sopranos cast - as funny | as that might be (and I get if you are from the NE, you | may be learning just fine being taught with such a | language style). | | As annoying some of the Cali style of language is, I can | understand the words and meanings without squinting my | ears and spending double the brain cycles trying to | understand the words, while then interpreting the | meaning, and then trying to put together concepts for | understanding new ways of coding or using tech. | | I've run into folks in Louisana that I could not | understand at all and had to ask for an interpreter at a | gas station. From Florida to Chicago to Seattle down to | Miss and Ala - I can hear what people are saying and | learn without spending lots of extra energy trying to | understand. | | With that being said, I understand there are parts around | Miami where accents may be thicker (or not) - and with | some folks even if using the rights words and grammar, I | may need to slow down the speech to actually learn if | they were teaching a class. | | The slow down and speed up options already exist with | youtube. | | "So yes, put in more work" | | - I do try a bit. I don't mind accents with some folks | and media.For example I can listen to and enjoy Shankar | sharing via the 'hidden brain' series, partially because | his accent is limited but also because the media requires | less thought intensity. | | I have tried many youtubes, and bought a few courses | taught from folks in India and other places where I just | could not muster the energy. I literally squint with my | ears and feel like my head gets hot trying to decipher | what is being said, translate into what is meant, and how | it should create new patterns of understanding in my | brain. | | I can only do that for so long and I am done. Now I just | skip any learning video that has non-am English speakers. | When I consider courses to sign up for or buy, I have to | research the authors / speakers and find video of them to | hear the audio, because I just can't learn well that way. | | "other than British," - True story, a few years ago I had | to call an ISP in Britain(?) and the person I got to to | file an issue with, I could not understand them. I had | ask 'what did you just say' many times. I laughed at | myself for even thinking of saying 'can you slow down and | speak clearer English please' - I mean, crazy... I was | paying by the minute for the long distance at the time | and it ended up being a 25 minute call that could of been | 10 if I had a magic translate without accent device. | | "a position of privilege because the culture allows you | to mock others' accents" | | - This is truly not about mocking accents, this is truly | about my lack of ability to learn well. | | Yes, I would defintely sound like an idiot trying to | speak another language. Like I said, I do not learn as | well as some others. | | Truly not my intent to be rude. I apologize if the | shortness came off that way, I was trying to be brief in | the hope that there's a chance that some tech like this | exists and someone here could point me to it. Before I | posted, I DDG'ed it and found a couple of things | attempting to be in that space with a 'speak to sales' | type of 'you'll never afford this' button for info. | | I will never be dismissive of anyone's command of | English, or other spoken language, or computer language | or anything like that. There is no way for me to know | someone else's situation and circumstances led them to | their current command of whatever language. If someone is | trying to learn more at any age; I applaud and encourage | them - being rude or dismissive does not encourage more | learning. | | "no accent am-english", when you could have just called | it what it is -- "an american accent". - Well maybe, but | actually I meant to be more specific, as mentioned a bit | above - I mean '"no accent" American accent' - because | there are plenty 'American accent' types that I would | want removed by a magic earpiece to make it easier for me | to understand and learn. | keerthiko wrote: | I appreciate the thoughtful reply. I don't think you're | rude, and I get what you're saying as someone who thinks | a lot about accents and languages. However, I still think | you missed my point. | | There is no "no accent". An accent is a baseline feature | of intelligible human speech, like a voice, or a volume, | or a language. You can't say stuff without those | features. When you say "the Chicago accent", or the | "Midwest accent", that's an accent! Not "no accent". | | I understand it's common usage to refer to the default | "radio accent" as "no accent", but in a country like | America, all kinds of people with all kinds of accents | speak English. Reinforcing an expectation that a certain | (usu. majority-white-spoken) one is the "default" by | referring to it as "no accent", implicitly suggests all | others are erroneous affectations, even if I trust that | is not your personal intent. | | All that said, I think your idea for a translation device | capable of revocalizing what is said with an unfamiliar | accent into one you are used to is not a bad one, and | likely easier than translating between languages while | retaining expressiveness. | TheHumanist wrote: | Babel Fish | dimitrios1 wrote: | Another lesson we can learn from Sci-Fi is very often different | species on a planet would have their tribal / local languages | and dialects but all spoke a common tongue. I think this is the | more humanizing approach, rather than delegate even more of our | fleshly processing power to machines. | somewhereoutth wrote: | This seems to be what is happening in Europe (and perhaps | more generally across the globe), with English being the | common tongue. | | Question is, what will happen to the tribal / local | languages? Will they survive? | Cthulhu_ wrote: | It varies. A lot of local languages have gone extinct | already. There's linguists hard at work to try and document | / record dying languages, but it won't be the same as | living the language from childhood. | micromacrofoot wrote: | then of course, there's always Darmok and Jalad at Tanagra | rangestransform wrote: | how am i supposed to talk shit with my friends about other | people in public then | flanbiscuit wrote: | I'm curious to know how well these models can pick up slang. | Maybe if you talk shit in as thick a slang as you can it | won't be able to give a good enough translation. | kredd wrote: | With my bi/trilingual friends who speak the same languages, | we intermix them to make our point more clear. Don't think | models will be good enough for mixes for a few more years, | so we're safe! | smcin wrote: | Can you show us an example of such a sentence? | kredd wrote: | Hm, think of things like "On va bruncher" (we're going to | brunch). The word "brunch" doesn't exist in french, but | we add suffixes to fit into the sentence. Very common in | Montreal. My french isn't very good to do that on the | fly, but my francophone friends do that all the time. | | In my other languages that I am actually fluent in, it's | kinda the same -- you use specific suffixes to soften or | embolden your point and so on. Maybe add "exclamation | making sounds in specific language" too. Eventually your | nouns and verbs end up in different languages, with | different suffixes where it "makes sense", yet the person | whom you're talking to will "get it". | | Would be curious to try the new Seamless model on such | speeches. | bertil wrote: | This is extremely common for every new technology: | "upload," "download," "stream," "google," "FaceTime," | most code patterns, all the new ML apps, "venmo" or | whatever the name of the app you use for payment, etc. | all of those are taken as is, slapped a verb termination | and it's good enough. That's true in German, Danish, | Dutch, French, Italian, and Spanish. | | The only thing that doesn't work is if you talk to people | too young to remember Skype. Then you feel old. | dontupvoteme wrote: | I'd love to see a map of how it matches up to regional | English/British accents and their slang. | fasquoika wrote: | Reinventing polari is certainly one way to make yourself | less understood... | ugh123 wrote: | learn Klingon? | bertil wrote: | Klingon is definitely going to be in the top 50 languages | covered... | csa wrote: | Speak in metaphor and/or code. | | I've been in mixed language communities in which I wasn't | sure who spoke what, and I have found this to be quite | effective when done right. | | Good time to reference st:ng "darmok" episode and quotes like | "darmok and jalad at tanagra". | buryat wrote: | get better at double speak | https://en.wikipedia.org/wiki/Doublespeak | baby wrote: | I'm wearing the Rayban Meta right now and they are already mind | blowing, I can already talk to that Meta AI assistant | seamlessly. I bet one of the future iteration will have exactly | this. | figers wrote: | Curious, what do you ask it besides take a picture / video or | what's the weather? | | I have a pair and have only asked it that so far... | diob wrote: | The problem is you need a full sentence, plus surrounding | sentences to properly translate a lot of things (aka context | matters). | | So no matter what, conversations in your native speech would | have to be delayed before translation. | ChuckMcM wrote: | I think I could adapt to that. But it would be an interesting | experiment. | ItsMattyG wrote: | My understanding is that they trained a separate model to | specifically estimate when they have enough context to begin | translating, as a skilled translator would. | DigiDigiorno wrote: | Even the native original version needs the proper context. | Sometimes you need the entire sentence to figure out what the | sentence was really about. | | I'm reminded of Mark Twain complaining about verbs arriving | at the very end of sentencess in German (among a myriad of | other complaints) | | "The Awful German Language* -Mark Twain | https://faculty.georgetown.edu/jod/texts/twain.german.html | scotty79 wrote: | Sometimes you even need a second sentence of even a few to | understand what the first sentence was about. | sexy_seedbox wrote: | So then we need something like neuralink to get the whole | thought from one's brain first, then the sentences are | processed properly for the context, then translated before | the speech is delivered. | freetanga wrote: | What most people have to say is not that interesting, and tech | won't change that | btbuildem wrote: | The near-realtime aspect of this is so promising -- we're getting | closer and closer to IRL babelfish! | | What I would love to see is an ability to add my own voice (yes, | at the risk of deepfakes) so that the model could "speak" in any | language and sound more like me, not some random voice actor it | was trained on. | gagabity wrote: | Can this do speech to text English -> English? Get strange | results if I do a translation to the same language would be an | interesting alternative to Whisper if it could. | I_am_tiberius wrote: | I hope all these AI products will have privacy focused | alternatives quicker than when web2 happened. | mkagenius wrote: | Yet again, Hindi (the major language in India) is not even in the | samples. India is the largest user base of facebook (and probably | 1/3rd of the engineers working there are Indians) but never will | facebook put enough effort to contribute back. Only use the DAU | from India in investor calls. | cafed00d wrote: | By "samples" do you mean examples on the marketing/landing | page? It sure looks like the model supports many major Indian | languages like Telugu, Tamil & Kannada. | https://huggingface.co/facebook/seamless-m4t-v2-large | | Yeah, I kinda agree with the spirit of your comment; it sure | would be nice to see a major Indian language like Telugu on | their landing page for sure. But that's just my Indian-person | bias speaking. | mkagenius wrote: | The lack of focus shows up in the results. The models never | performs as good as french or spanish on Indian languages. | This goes for Google, too. | gorbypark wrote: | I've been trying (and mostly failing) at settings up a pipeline | to get system audio into whisper and feed that transcription into | a seamless m4t text-to-text translation model. It seems like | seamless streaming is going to solve most of my issues, and | should significantly reduce latency! | | My ultimate goal is to have realtime translations of video | conferences. I've moved to a new country, and while I'm super | privileged that most of my colleagues speak English, we still | have a number of "all hands" meetings that I get lost in pretty | easily. | xnx wrote: | This tech from Google seems similar, but doesn't have a fancy | demo: https://blog.research.google/2023/12/unsupervised-speech- | to-... | jwineinger wrote: | Any ideas on what kind of hardware this would require to run | S2ST? | gloyoyo wrote: | This is so world changing! Exactly how I wanted to speak so | confidently! | | Thank you Meta! | mightytravels wrote: | Like how easy it is to get going but you need to download about | 20GB and s2st needs 40GB GPU RAM! | | It runs but any audio input (you will need to provide wav not | mp3's) I tried (tried 20s/40s/300s) I get just one short sentence | returned in target language that seems not related at all to my | audio input (i.e. Tous les humains sont crees egaux). | | Seems like some default text but it runs on full GPU for 10 | minutes. Tons of bug reports in GitHub as well. | | Text Translate works but not sure what is the context length of | the model. Seems short at first glance (haven't looked into it). | | Oh and why is Whisper a dependency? Seems not need if FB has | their own model? | novok wrote: | I wonder how well this will perform for automatic comic's | translation. Current local models are pretty bad. | MagicMoonlight wrote: | >Automatically filters out toxic speech >Watermarking | | So it can't be trusted at all then | quickthrower2 wrote: | How did that page get camera access without my permission? | | Edit: by the upvote I guess it wasn't just me? | rammer wrote: | Marketing has been heavily involved in this page...there's at | least one coloured person for every white photo.. | asylteltine wrote: | It really sucks that a company so irresponsible with all your | data is one of the leading AI companies now. | bozhark wrote: | I want this as a channel in our discord. | | Would allow more interactions of people that don't speak the same | language ___________________________________________________________________ (page generated 2023-12-01 23:00 UTC)