[HN Gopher] Common Voice ___________________________________________________________________ Common Voice Author : oblib Score : 174 points Date : 2023-12-05 16:13 UTC (6 hours ago) (HTM) web link (commonvoice.mozilla.org) (TXT) w3m dump (commonvoice.mozilla.org) | rwmj wrote: | I wish they'd concentrate on the browser. | dingnuts wrote: | voice integration in a browser for control and feedback would | be great if you were blind | culi wrote: | And text-to-speech. Which is already a standard: | https://developer.mozilla.org/en- | US/docs/Web/API/Web_Speech_... | | The web is in a hilarious state where it's harder to style an | option in a drop down than it is to generate speech from some | text | dragonwriter wrote: | Better in the DE than an app, even the browser, unless its | like ChromeOS and the browser is the DE. | joomooru wrote: | Accessibility is an important part of the browser :) | OfSanguineFire wrote: | Mycroft users really wished that Mozilla had kept up efforts in | this direction, because otherwise the only option for reliable | speech-to-text is uploading every command you give your agent | to Google or Baidu. The browser is important, and I don't | support Mozilla's vacuous projects for social-justice cred, but | there are a handful of areas where we need some non-profit to | provide a privacy-respecting solution. | rwmj wrote: | That is indeed important, so I take it back (can't edit | original post now). | user_7832 wrote: | Didn't mozilla also have a related speech to text software that | got canned/moved to a different company? Or was that different? | salynchnew wrote: | DeepSpeech? https://github.com/mozilla/DeepSpeech | posguy wrote: | Mozilla didn't want to fund further development, most of the | team ended up at Coqui.ai | rasz wrote: | Mozilla shut that project down same day (Apr 12, 2021) as: | "Mozilla is partnering with NVIDIA, which is investing $1.5 | million in Mozilla Common Voice,". Aka they got paid off by | Nvidia to not compete. | bitvoid wrote: | This is an open dataset of voice samples to train models, so | not really STT/TTS software. | sxp wrote: | FF's TTS is an important project for anyone who wants a trivial | to use text-to-speech system. It's built into the browser so you | can just run wss = window.speechSynthesis; | for (let i = 0; i < wss.getVoices().length; ++i){ str = | `Voice ${i} is ${wss.getVoices()[i].name}`; s = new | SpeechSynthesisUtterance(str); s.voice = | wss.getVoices()[i]; wss.speak(s); | console.log(str); } in the console to get | various TTS examples. For some browsers, this can be done offline | while others use a cloud based TTS system. | j45 wrote: | This is handy to know, thanks. I was just trying out Common | Voice a few days ago. | | They have a good example of a community page for folks wanting | to help with a particular language. | | I was just thinking today that Firefox is worthy of switching | back to because it was so fast,except I hadn't had a chance to | do it. | | If anyone else thinks it's important for there to be an | independent browser dedicated to privacy and security (and | independence), they could as many casual browser switchers. I'm | happy to be back on a few FF extension that didn't work quite | the same on any chrome based browser. | vlod wrote: | This also works in Chrome (My version is: 119.0.6045.199) | | FF has 8611 voices, chrome has 19. | joshstrange wrote: | That's odd, my Chrome (119.0.6045.199) has 176 voices. Not | all are English though. | vlod wrote: | Maybe it's because I'm linux? (Pop!_OS 22.04 LTS) | | Also I have 3 English only. | rollcat wrote: | On macOS, it's say "enter text here" | | To pick a different voice: say -v Fred "enter | text here" | | To list voices: say -v "?" | | (The quoting is necessary to prevent ZSH from interpreting the | question mark as a glob.) | | I hear Firefox's TTL is important, yet prior to your comment I | didn't even know it existed. This sort of stuff should be more | discoverable, and have a more accessible (ahem) API. | fzzzy wrote: | It's part of the web apis, it's not just firefox. Chrome and | Safari have supported it since 2013/2014. | marcellus23 wrote: | It looks like speechSynthesis is supported in all the major | browsers, not just FF. https://developer.mozilla.org/en- | US/docs/Web/API/Window/spee... | dan-robertson wrote: | Do you know if it's been extracted into a standalone library? | The state of the open source TTS seems to not be great. | Presumably the data for a voice is harder to put together than | training a speech to text system like whisper. | miki123211 wrote: | The voices don't come from the browsers themselves, but from | operating systems and their underlying TTS APIs, SAPI on | Windows, Speech Dispatcher on Linux and AVSpeechSynthesizer | on Apple Devices. If you install a third-party voice | compatible with one of these, the browsers will pick that up. | amelius wrote: | Is there a handy demo website somewhere to access that? | imjonse wrote: | While this dataset is orders of magnitude smaller than what | recent speech models like Whisper and Seamless got trained on, | and while it is meant for supervised as opposed to self- | supervised learning where data is more abundant, it can still be | useful for finetuning an existing model for improving its score | on a specific language. | skrebbel wrote: | I'm sad that this is English only. I'll love to contribute lots | of voice for a Dutch TTS from an nonprofit org like Mozilla | meepmorp wrote: | They do collect other languages - there's a setting for it in | the annotation section, and the dataset downloads let you | choose other languages. | | e.g.: https://commonvoice.mozilla.org/nl/listen | skrebbel wrote: | Woops! Thanks :-) | meepmorp wrote: | Don't feel bad - it's not especially obvious. I only | thought about it because I'm already familiar with the | project. | dabinat wrote: | Although English is the most-contributed language, one of the | goals of Common Voice is to support languages that wouldn't | normally receive attention from commercial providers. | yorwba wrote: | The most-contributed language is Catalan with 3678 hours | recorded vs. 3395 hours in English | https://commonvoice.mozilla.org/en/languages (The language | list sorts your browser's UI languages ahead of all others, | which is why English may appear on top for you.) | zerotolerance wrote: | https://commonvoice.mozilla.org/en/about?tab=how-add-languag... | dang wrote: | Related. Others? | | _Mozilla Common Voice Adds 16 New Languages and 4,600 New Hours | of Speech_ - https://news.ycombinator.com/item?id=28073016 - Aug | 2021 (170 comments) | | _Firefox Voice_ - https://news.ycombinator.com/item?id=24096082 | - Aug 2020 (154 comments) | | _Firefox Voice: Browse the web with your voice_ - | https://news.ycombinator.com/item?id=23902560 - July 2020 (2 | comments) | | _Mozilla Common Voice Dataset: More data, more languages_ - | https://news.ycombinator.com/item?id=23695377 - June 2020 (41 | comments) | | _The Common Voice Project by Mozilla reached its first goal: 1k | hours in englisch_ - | https://news.ycombinator.com/item?id=23051756 - May 2020 (1 | comment) | | _Common Voice: A Massively-Multilingual Speech Corpus_ - | https://news.ycombinator.com/item?id=21887693 - Dec 2019 (9 | comments) | | _Common Voice - Mozilla 's initiative to help teach machines how | real people speak_ - | https://news.ycombinator.com/item?id=21268579 - Oct 2019 (49 | comments) | | _Mozilla releases the largest to-date public domain transcribed | voice dataset_ - https://news.ycombinator.com/item?id=19270646 - | Feb 2019 (61 comments) | | _Mozilla Overhauls Speech-To-Text Contribution Interface_ - | https://news.ycombinator.com/item?id=17436958 - July 2018 (42 | comments) | | _Initial Release of Mozilla's Open Source Speech Recognition | Model and Voice Data_ - | https://news.ycombinator.com/item?id=15808124 - Nov 2017 (88 | comments) | | _Project Common Voice_ - | https://news.ycombinator.com/item?id=14794654 - July 2017 (57 | comments) | | _Mozilla: Project Common Voice_ - | https://news.ycombinator.com/item?id=14786881 - July 2017 (1 | comment) | vidarh wrote: | I submitted a request for Norwegian Bokmal, and realised a | complication which I'm sure must affect other languages too: | | Norway has two separate official languages. They are unusually | close - one is relatively close to Danish, and the other started | as a collection of dialects, but technically they are written | languages, _especially Bokmal_ which basically means "book | language". | | I'm _unusual_ in that I speak close to "pure" Bokmal. Thanks to | expectations at school etc., a lot of speakers who write Bokmal | will adjust or tone down their dialect if asked to read a text | that is written in grammatically and orthographically correct | bokmal, but will otherwise speak in a manner that can deviate | fairly significantly from the written language. | | As such, depending on whether your goal is text to speech or | speech recognition, the pronunciation you will need is very | different. | | E.g. people I know who write Bokmal might _say_ something like | "hva erredu ser pa a?" ("what are you looking at?") with hardly | any gaps between words, while I would stick close to the written | "hva er det du ser pa?" with clear gaps. In recognition you need | to handle both (and many other variations), while for generation | you'd at least by default usually want the latter unless there | are indications the text is written in dialect. | | It strikes me you'd _really_ want people to write more detail | about what it is they are speaking and /or let people tag/label | data with additional info about accents. Not just for this, but | for other multi-lingual speakers as well. E.g. it'd be helpful to | have many foreign accents in the English (and other languages) | dataset for recognition, but as much as I want speech recognition | to understand me, I'm not particularly interested in teaching it | to speak English with a strong Norwegian accent. | | That is _less_ of an issue than the dialects in some languages | that can involve much more than just speaking the same words | differently. | | To take another example "Jeg apnet doren og gikk ut i solen" og | "Jeg apna dora og gikk ut i sola" are both valid Bokmal. | Depending on _context_ a reader may stick strictly to the text or | swap apnet <->apna, doren<->dora, sola<->sola, and _every | permutation is valid_. Which exact set you use differs and some | speakers will write one but use the other when speaking. E.g. I | would _say_ apna, dora, sola, but write apnet, doren, solen. The | latter is more formal and /or old-fashioned in some parts of the | country, but the perception of that also varies by region. And | this totally leaves out all the dialect variations used by people | who'd say their language is Bokmal, and would be recognized as | such by Norwegian speakers, but who use variants of words or | conjugations that aren't technically recognized as valid Bokmal. | | The former is more "modern" (several of the forms are only valid | Bokmal as a result of successive language reforms), more common | in the Eastern part of Norway outside of the posher parts of Oslo | and other wealthy regions, and (weirdly) more common in 1970's | radical left-wing academics (especially people involved with the | Maoist Workers Communist Party/AKP-ML) as an | affectation/sociolect, with each of these groups also deviating | in other aspects.... | | If you want to maximize the utility of a dataset like this, you | _really_ would want to let each speaker at least assign a lot of | tags /labels to their profile; even if you don't want to deal | with the hornet nest of trying to figure out all the | distinctions, even unstructured labels would be a start, and | ideally allowing people to tag individual recordings as well, | because there are a _lot_ more variations than just "language" | and "accent" here. | indigo945 wrote: | This is a great argument. | | I particularly agree with your point regarding English - my | German accent sounds jarring to probably most native English | speakers, but it should still be understood. To add to your | argument, I have sometimes tried to turn on subtitles for | Youtube videos in some accent of English that I haven't had | much contact with (such as Nigerian English), but the auto- | generated closed captions turned out to be even more useless | than my own comprehension. | | However, one should keep in mind that Mozilla's main goal here | is accessibility, with the implication that they mean | accessibility for blind and deaf people in particular - as | opposed to accessibility for stunted multilinguals like us. For | these purposes, being able to transcribe mainly mainstream uses | of the language is fine, and so is being able to generate | speech in a hodge-podge averaged dialect. I highly doubt most | blind people care about whether their TTS engine speaks The | Queen's English or not, as long as it is clear and | understandable. | vidarh wrote: | What is "clear and understandable" varies greatly, though. | E.g. Nigerian English is often subtitled in the UK, but | fairly often so is Scottish English... Both often to the | great dismay of speakers of the two who sometimes are very | annoyed at the expectation that people might not understand | them. | | Nigerian English is actually fascinating in that there's a | whole spectrum from Nigerian Pidgin, which ranges from nearly | unintelligible to English speakers, to "mostly British | English" in terms of orthography and grammar, but which still | tends to incorporate words from several differences Nigerian | languages and pidgin. (e.g. abeg, don't give me any wahala; | Please, don't give me any trouble) | | Now consider Nigeria is about to become the country with the | second largest number of English speakers worldwide (it's | close to tied with India, depending which sources and level | of proficiency you consider, and Nigeria's population is | growing far faster than India's), and while it's still quite | far behind the UK for people speaking it as their _first_ | language, with current population growth and increasing use | of English (e.g. my ex wife 's first language is English | because her parents first languages were Igbo and Yoruba, and | that kind of situation is driving adoption) likely to cause | Nigeria to become the second largest on that measure as well. | | So handling a broader range of dialects will matter, at least | in terms of recognition - I do agree that there's _more_ | flexibility for generation, though even there if you try feed | a broader Nigerian English pidgin to a TTS engine and it | doesn 't know what to do with the words it might well end up | being unintelligible both to eg. American or British English | speakers and Nigerian English speakers. | OfSanguineFire wrote: | Are you autistic? I ask because this is HN where lots of people | are, and choosing to speak the literary norm in countries with | diglossia is often associated with autism. For example, | foreigners in Finland are urged to quickly get to grips with | _puhekieli_ (spoken Finnish) because speaking _kirjakieli_ (the | literary norm) in everyday contexts, or writing it in chats, is | "something only autistic people do". | vidarh wrote: | Not to my knowledge, though I may have some traits. | | That said, in Norway the literary form is/was spoken on e.g. | TV and radio similar to how RP (received pronunciation) | is/was spoken on the BBC, more so (in both cases) before than | now where dialects are more broadly tolerated. On top of | that, in affluent areas of Western Oslo and adjoining | affluent areas the dialect sits mostly within what is | "allowed" in Bokmal, and actually mostly towards a more | conservative end of the allowed range than where I sit, and | it's somewhat political, in that more conservative forms of | Bokmal historically tended to be associated with social | status (or aspirations...). | | It's unusual more in that the pockets and social groups where | dialects that overlaps fully or almost entirely with Bokmal | are fairly small. | | My spoken dialect is within that spectrum, exacerbated by | reading _a lot_ of older literature at early age that used | quite old fashioned forms of Bokmal, and picking up more | formal language than many of my peers spoke through that, but | I tend to be closer to the more affluent dialect in writing | than spoken. | | (EDIT: My spoken dialect would probably fit as a somewhat | "posh" version of Urban East Norwegian[1] today, with | somewhat more conservative word choices in places where | contemporary Urban East Norwegian would have deviated from | Bokmal in minor ways in the 70's and 80's by being somewhat | more "relaxed" in ways that have since been accepted in | subsequent adjustments of the rules) | | If you heard me alongside my dad there'd be relatively minor | differences between our dialects, and I'd probably sound | marginally less formal as I adopted some spoken patterns from | the more working class area I grew up in outside Oslo, while | he at least when younger would be recognisable as having | grown up on the Western edges of Oslo. | | Beyond that, language has always fascinated me, and I tended | to take a certain level of delight in torturing my Norwegian | teacher who favoured the other official language - Nynorsk. | Nynorsk and Bokmal overlaps very significantly, and more so | after recent language reforms which have tended towards | allowing more Nynorsk forms of words, or ones closer to them, | in Bokmal. Our Norwegian teacher very much wanted us to use | those forms (that'd be favouring "sola" over "solen" etc.), | and I used to express my distaste for Nynorsk by instead | exaggerating my preference for the more conservative Bokmal | forms. | | [1] https://en.wikipedia.org/wiki/Urban_East_Norwegian | CoBE10 wrote: | I'd like to give a shout-out to Common Voice Android: | https://github.com/Sav22999/common-voice-android | | It's a handy app for those interested in contributing to the | project. You can record voices for the languages you speak and | validate other user contributions. I used to be a frequent | contributor about two years ago, and this app had a much more | user-friendly design compared to the official website version. | | Additionally, check out the official Common Voice Matrix channel: | https://chat.mozilla.org/#/room/#common-voice:mozilla.org | jeena wrote: | Why then is the text2speech in reader mode (which other than that | is excellent) on a Linux Firefox so extremely bad? Much worse | than Steven Hawkins text2speech. | spadufed wrote: | Crowdsourced datasets like this and the ones produced by the | OpenAssistant project could easily become the ONLY way to build | foundational models if the courts decide that what OpenAI and co | are doing is not Fair-Use. I don't think I would call this | scenario unlikely, either. | pimlottc wrote: | With recent events in AI and deepfake technology, I would need to | see some assurances before I agreed to "donate my voice" to | something like this. It seems like the project is for voice | recognition, not generation, but it's not immediately clear. | thih9 wrote: | What assurances would you like to see? | moron4hire wrote: | > Voice datasets also underrepresent: non-English speakers, | people of colour, disabled people, women and LGBTQIA+ people. | | How does being gay change your voice? | pseudalopex wrote: | https://en.wikipedia.org/wiki/LGBT_linguistics#Accents_of_En... | moron4hire wrote: | I'm aware of the trope. I've yet to meet anyone that adheres | to it, though. Always thought it was just one of those things | that Hollywood overemphasizes to "other" gay people. | pseudalopex wrote: | > Always thought it was just one of those things that | Hollywood overemphasizes to "other" gay people. | | Did you think they over emphasized it or did you think they | made it up? ___________________________________________________________________ (page generated 2023-12-05 23:00 UTC)