[HN Gopher] Common Voice
       ___________________________________________________________________
        
       Common Voice
        
       Author : oblib
       Score  : 174 points
       Date   : 2023-12-05 16:13 UTC (6 hours ago)
        
 (HTM) web link (commonvoice.mozilla.org)
 (TXT) w3m dump (commonvoice.mozilla.org)
        
       | rwmj wrote:
       | I wish they'd concentrate on the browser.
        
         | dingnuts wrote:
         | voice integration in a browser for control and feedback would
         | be great if you were blind
        
           | culi wrote:
           | And text-to-speech. Which is already a standard:
           | https://developer.mozilla.org/en-
           | US/docs/Web/API/Web_Speech_...
           | 
           | The web is in a hilarious state where it's harder to style an
           | option in a drop down than it is to generate speech from some
           | text
        
           | dragonwriter wrote:
           | Better in the DE than an app, even the browser, unless its
           | like ChromeOS and the browser is the DE.
        
         | joomooru wrote:
         | Accessibility is an important part of the browser :)
        
         | OfSanguineFire wrote:
         | Mycroft users really wished that Mozilla had kept up efforts in
         | this direction, because otherwise the only option for reliable
         | speech-to-text is uploading every command you give your agent
         | to Google or Baidu. The browser is important, and I don't
         | support Mozilla's vacuous projects for social-justice cred, but
         | there are a handful of areas where we need some non-profit to
         | provide a privacy-respecting solution.
        
           | rwmj wrote:
           | That is indeed important, so I take it back (can't edit
           | original post now).
        
       | user_7832 wrote:
       | Didn't mozilla also have a related speech to text software that
       | got canned/moved to a different company? Or was that different?
        
         | salynchnew wrote:
         | DeepSpeech? https://github.com/mozilla/DeepSpeech
        
           | posguy wrote:
           | Mozilla didn't want to fund further development, most of the
           | team ended up at Coqui.ai
        
             | rasz wrote:
             | Mozilla shut that project down same day (Apr 12, 2021) as:
             | "Mozilla is partnering with NVIDIA, which is investing $1.5
             | million in Mozilla Common Voice,". Aka they got paid off by
             | Nvidia to not compete.
        
         | bitvoid wrote:
         | This is an open dataset of voice samples to train models, so
         | not really STT/TTS software.
        
       | sxp wrote:
       | FF's TTS is an important project for anyone who wants a trivial
       | to use text-to-speech system. It's built into the browser so you
       | can just run                   wss = window.speechSynthesis;
       | for (let i = 0; i < wss.getVoices().length; ++i){           str =
       | `Voice ${i} is ${wss.getVoices()[i].name}`;           s = new
       | SpeechSynthesisUtterance(str);           s.voice =
       | wss.getVoices()[i];           wss.speak(s);
       | console.log(str);         }           in the console to get
       | various TTS examples. For some browsers, this can be done offline
       | while others use a cloud based TTS system.
        
         | j45 wrote:
         | This is handy to know, thanks. I was just trying out Common
         | Voice a few days ago.
         | 
         | They have a good example of a community page for folks wanting
         | to help with a particular language.
         | 
         | I was just thinking today that Firefox is worthy of switching
         | back to because it was so fast,except I hadn't had a chance to
         | do it.
         | 
         | If anyone else thinks it's important for there to be an
         | independent browser dedicated to privacy and security (and
         | independence), they could as many casual browser switchers. I'm
         | happy to be back on a few FF extension that didn't work quite
         | the same on any chrome based browser.
        
         | vlod wrote:
         | This also works in Chrome (My version is: 119.0.6045.199)
         | 
         | FF has 8611 voices, chrome has 19.
        
           | joshstrange wrote:
           | That's odd, my Chrome (119.0.6045.199) has 176 voices. Not
           | all are English though.
        
             | vlod wrote:
             | Maybe it's because I'm linux? (Pop!_OS 22.04 LTS)
             | 
             | Also I have 3 English only.
        
         | rollcat wrote:
         | On macOS, it's                   say "enter text here"
         | 
         | To pick a different voice:                   say -v Fred "enter
         | text here"
         | 
         | To list voices:                   say -v "?"
         | 
         | (The quoting is necessary to prevent ZSH from interpreting the
         | question mark as a glob.)
         | 
         | I hear Firefox's TTL is important, yet prior to your comment I
         | didn't even know it existed. This sort of stuff should be more
         | discoverable, and have a more accessible (ahem) API.
        
           | fzzzy wrote:
           | It's part of the web apis, it's not just firefox. Chrome and
           | Safari have supported it since 2013/2014.
        
         | marcellus23 wrote:
         | It looks like speechSynthesis is supported in all the major
         | browsers, not just FF. https://developer.mozilla.org/en-
         | US/docs/Web/API/Window/spee...
        
         | dan-robertson wrote:
         | Do you know if it's been extracted into a standalone library?
         | The state of the open source TTS seems to not be great.
         | Presumably the data for a voice is harder to put together than
         | training a speech to text system like whisper.
        
           | miki123211 wrote:
           | The voices don't come from the browsers themselves, but from
           | operating systems and their underlying TTS APIs, SAPI on
           | Windows, Speech Dispatcher on Linux and AVSpeechSynthesizer
           | on Apple Devices. If you install a third-party voice
           | compatible with one of these, the browsers will pick that up.
        
         | amelius wrote:
         | Is there a handy demo website somewhere to access that?
        
       | imjonse wrote:
       | While this dataset is orders of magnitude smaller than what
       | recent speech models like Whisper and Seamless got trained on,
       | and while it is meant for supervised as opposed to self-
       | supervised learning where data is more abundant, it can still be
       | useful for finetuning an existing model for improving its score
       | on a specific language.
        
       | skrebbel wrote:
       | I'm sad that this is English only. I'll love to contribute lots
       | of voice for a Dutch TTS from an nonprofit org like Mozilla
        
         | meepmorp wrote:
         | They do collect other languages - there's a setting for it in
         | the annotation section, and the dataset downloads let you
         | choose other languages.
         | 
         | e.g.: https://commonvoice.mozilla.org/nl/listen
        
           | skrebbel wrote:
           | Woops! Thanks :-)
        
             | meepmorp wrote:
             | Don't feel bad - it's not especially obvious. I only
             | thought about it because I'm already familiar with the
             | project.
        
           | dabinat wrote:
           | Although English is the most-contributed language, one of the
           | goals of Common Voice is to support languages that wouldn't
           | normally receive attention from commercial providers.
        
             | yorwba wrote:
             | The most-contributed language is Catalan with 3678 hours
             | recorded vs. 3395 hours in English
             | https://commonvoice.mozilla.org/en/languages (The language
             | list sorts your browser's UI languages ahead of all others,
             | which is why English may appear on top for you.)
        
         | zerotolerance wrote:
         | https://commonvoice.mozilla.org/en/about?tab=how-add-languag...
        
       | dang wrote:
       | Related. Others?
       | 
       |  _Mozilla Common Voice Adds 16 New Languages and 4,600 New Hours
       | of Speech_ - https://news.ycombinator.com/item?id=28073016 - Aug
       | 2021 (170 comments)
       | 
       |  _Firefox Voice_ - https://news.ycombinator.com/item?id=24096082
       | - Aug 2020 (154 comments)
       | 
       |  _Firefox Voice: Browse the web with your voice_ -
       | https://news.ycombinator.com/item?id=23902560 - July 2020 (2
       | comments)
       | 
       |  _Mozilla Common Voice Dataset: More data, more languages_ -
       | https://news.ycombinator.com/item?id=23695377 - June 2020 (41
       | comments)
       | 
       |  _The Common Voice Project by Mozilla reached its first goal: 1k
       | hours in englisch_ -
       | https://news.ycombinator.com/item?id=23051756 - May 2020 (1
       | comment)
       | 
       |  _Common Voice: A Massively-Multilingual Speech Corpus_ -
       | https://news.ycombinator.com/item?id=21887693 - Dec 2019 (9
       | comments)
       | 
       |  _Common Voice - Mozilla 's initiative to help teach machines how
       | real people speak_ -
       | https://news.ycombinator.com/item?id=21268579 - Oct 2019 (49
       | comments)
       | 
       |  _Mozilla releases the largest to-date public domain transcribed
       | voice dataset_ - https://news.ycombinator.com/item?id=19270646 -
       | Feb 2019 (61 comments)
       | 
       |  _Mozilla Overhauls Speech-To-Text Contribution Interface_ -
       | https://news.ycombinator.com/item?id=17436958 - July 2018 (42
       | comments)
       | 
       |  _Initial Release of Mozilla's Open Source Speech Recognition
       | Model and Voice Data_ -
       | https://news.ycombinator.com/item?id=15808124 - Nov 2017 (88
       | comments)
       | 
       |  _Project Common Voice_ -
       | https://news.ycombinator.com/item?id=14794654 - July 2017 (57
       | comments)
       | 
       |  _Mozilla: Project Common Voice_ -
       | https://news.ycombinator.com/item?id=14786881 - July 2017 (1
       | comment)
        
       | vidarh wrote:
       | I submitted a request for Norwegian Bokmal, and realised a
       | complication which I'm sure must affect other languages too:
       | 
       | Norway has two separate official languages. They are unusually
       | close - one is relatively close to Danish, and the other started
       | as a collection of dialects, but technically they are written
       | languages, _especially Bokmal_ which basically means  "book
       | language".
       | 
       | I'm _unusual_ in that I speak close to  "pure" Bokmal. Thanks to
       | expectations at school etc., a lot of speakers who write Bokmal
       | will adjust or tone down their dialect if asked to read a text
       | that is written in grammatically and orthographically correct
       | bokmal, but will otherwise speak in a manner that can deviate
       | fairly significantly from the written language.
       | 
       | As such, depending on whether your goal is text to speech or
       | speech recognition, the pronunciation you will need is very
       | different.
       | 
       | E.g. people I know who write Bokmal might _say_ something like
       | "hva erredu ser pa a?" ("what are you looking at?") with hardly
       | any gaps between words, while I would stick close to the written
       | "hva er det du ser pa?" with clear gaps. In recognition you need
       | to handle both (and many other variations), while for generation
       | you'd at least by default usually want the latter unless there
       | are indications the text is written in dialect.
       | 
       | It strikes me you'd _really_ want people to write more detail
       | about what it is they are speaking and /or let people tag/label
       | data with additional info about accents. Not just for this, but
       | for other multi-lingual speakers as well. E.g. it'd be helpful to
       | have many foreign accents in the English (and other languages)
       | dataset for recognition, but as much as I want speech recognition
       | to understand me, I'm not particularly interested in teaching it
       | to speak English with a strong Norwegian accent.
       | 
       | That is _less_ of an issue than the dialects in some languages
       | that can involve much more than just speaking the same words
       | differently.
       | 
       | To take another example "Jeg apnet doren og gikk ut i solen" og
       | "Jeg apna dora og gikk ut i sola" are both valid Bokmal.
       | Depending on _context_ a reader may stick strictly to the text or
       | swap apnet <->apna, doren<->dora, sola<->sola, and _every
       | permutation is valid_. Which exact set you use differs and some
       | speakers will write one but use the other when speaking. E.g. I
       | would _say_ apna, dora, sola, but write apnet, doren, solen. The
       | latter is more formal and /or old-fashioned in some parts of the
       | country, but the perception of that also varies by region. And
       | this totally leaves out all the dialect variations used by people
       | who'd say their language is Bokmal, and would be recognized as
       | such by Norwegian speakers, but who use variants of words or
       | conjugations that aren't technically recognized as valid Bokmal.
       | 
       | The former is more "modern" (several of the forms are only valid
       | Bokmal as a result of successive language reforms), more common
       | in the Eastern part of Norway outside of the posher parts of Oslo
       | and other wealthy regions, and (weirdly) more common in 1970's
       | radical left-wing academics (especially people involved with the
       | Maoist Workers Communist Party/AKP-ML) as an
       | affectation/sociolect, with each of these groups also deviating
       | in other aspects....
       | 
       | If you want to maximize the utility of a dataset like this, you
       | _really_ would want to let each speaker at least assign a lot of
       | tags /labels to their profile; even if you don't want to deal
       | with the hornet nest of trying to figure out all the
       | distinctions, even unstructured labels would be a start, and
       | ideally allowing people to tag individual recordings as well,
       | because there are a _lot_ more variations than just  "language"
       | and "accent" here.
        
         | indigo945 wrote:
         | This is a great argument.
         | 
         | I particularly agree with your point regarding English - my
         | German accent sounds jarring to probably most native English
         | speakers, but it should still be understood. To add to your
         | argument, I have sometimes tried to turn on subtitles for
         | Youtube videos in some accent of English that I haven't had
         | much contact with (such as Nigerian English), but the auto-
         | generated closed captions turned out to be even more useless
         | than my own comprehension.
         | 
         | However, one should keep in mind that Mozilla's main goal here
         | is accessibility, with the implication that they mean
         | accessibility for blind and deaf people in particular - as
         | opposed to accessibility for stunted multilinguals like us. For
         | these purposes, being able to transcribe mainly mainstream uses
         | of the language is fine, and so is being able to generate
         | speech in a hodge-podge averaged dialect. I highly doubt most
         | blind people care about whether their TTS engine speaks The
         | Queen's English or not, as long as it is clear and
         | understandable.
        
           | vidarh wrote:
           | What is "clear and understandable" varies greatly, though.
           | E.g. Nigerian English is often subtitled in the UK, but
           | fairly often so is Scottish English... Both often to the
           | great dismay of speakers of the two who sometimes are very
           | annoyed at the expectation that people might not understand
           | them.
           | 
           | Nigerian English is actually fascinating in that there's a
           | whole spectrum from Nigerian Pidgin, which ranges from nearly
           | unintelligible to English speakers, to "mostly British
           | English" in terms of orthography and grammar, but which still
           | tends to incorporate words from several differences Nigerian
           | languages and pidgin. (e.g. abeg, don't give me any wahala;
           | Please, don't give me any trouble)
           | 
           | Now consider Nigeria is about to become the country with the
           | second largest number of English speakers worldwide (it's
           | close to tied with India, depending which sources and level
           | of proficiency you consider, and Nigeria's population is
           | growing far faster than India's), and while it's still quite
           | far behind the UK for people speaking it as their _first_
           | language, with current population growth and increasing use
           | of English (e.g. my ex wife 's first language is English
           | because her parents first languages were Igbo and Yoruba, and
           | that kind of situation is driving adoption) likely to cause
           | Nigeria to become the second largest on that measure as well.
           | 
           | So handling a broader range of dialects will matter, at least
           | in terms of recognition - I do agree that there's _more_
           | flexibility for generation, though even there if you try feed
           | a broader Nigerian English pidgin to a TTS engine and it
           | doesn 't know what to do with the words it might well end up
           | being unintelligible both to eg. American or British English
           | speakers and Nigerian English speakers.
        
         | OfSanguineFire wrote:
         | Are you autistic? I ask because this is HN where lots of people
         | are, and choosing to speak the literary norm in countries with
         | diglossia is often associated with autism. For example,
         | foreigners in Finland are urged to quickly get to grips with
         | _puhekieli_ (spoken Finnish) because speaking _kirjakieli_ (the
         | literary norm) in everyday contexts, or writing it in chats, is
         | "something only autistic people do".
        
           | vidarh wrote:
           | Not to my knowledge, though I may have some traits.
           | 
           | That said, in Norway the literary form is/was spoken on e.g.
           | TV and radio similar to how RP (received pronunciation)
           | is/was spoken on the BBC, more so (in both cases) before than
           | now where dialects are more broadly tolerated. On top of
           | that, in affluent areas of Western Oslo and adjoining
           | affluent areas the dialect sits mostly within what is
           | "allowed" in Bokmal, and actually mostly towards a more
           | conservative end of the allowed range than where I sit, and
           | it's somewhat political, in that more conservative forms of
           | Bokmal historically tended to be associated with social
           | status (or aspirations...).
           | 
           | It's unusual more in that the pockets and social groups where
           | dialects that overlaps fully or almost entirely with Bokmal
           | are fairly small.
           | 
           | My spoken dialect is within that spectrum, exacerbated by
           | reading _a lot_ of older literature at early age that used
           | quite old fashioned forms of Bokmal, and picking up more
           | formal language than many of my peers spoke through that, but
           | I tend to be closer to the more affluent dialect in writing
           | than spoken.
           | 
           | (EDIT: My spoken dialect would probably fit as a somewhat
           | "posh" version of Urban East Norwegian[1] today, with
           | somewhat more conservative word choices in places where
           | contemporary Urban East Norwegian would have deviated from
           | Bokmal in minor ways in the 70's and 80's by being somewhat
           | more "relaxed" in ways that have since been accepted in
           | subsequent adjustments of the rules)
           | 
           | If you heard me alongside my dad there'd be relatively minor
           | differences between our dialects, and I'd probably sound
           | marginally less formal as I adopted some spoken patterns from
           | the more working class area I grew up in outside Oslo, while
           | he at least when younger would be recognisable as having
           | grown up on the Western edges of Oslo.
           | 
           | Beyond that, language has always fascinated me, and I tended
           | to take a certain level of delight in torturing my Norwegian
           | teacher who favoured the other official language - Nynorsk.
           | Nynorsk and Bokmal overlaps very significantly, and more so
           | after recent language reforms which have tended towards
           | allowing more Nynorsk forms of words, or ones closer to them,
           | in Bokmal. Our Norwegian teacher very much wanted us to use
           | those forms (that'd be favouring "sola" over "solen" etc.),
           | and I used to express my distaste for Nynorsk by instead
           | exaggerating my preference for the more conservative Bokmal
           | forms.
           | 
           | [1] https://en.wikipedia.org/wiki/Urban_East_Norwegian
        
       | CoBE10 wrote:
       | I'd like to give a shout-out to Common Voice Android:
       | https://github.com/Sav22999/common-voice-android
       | 
       | It's a handy app for those interested in contributing to the
       | project. You can record voices for the languages you speak and
       | validate other user contributions. I used to be a frequent
       | contributor about two years ago, and this app had a much more
       | user-friendly design compared to the official website version.
       | 
       | Additionally, check out the official Common Voice Matrix channel:
       | https://chat.mozilla.org/#/room/#common-voice:mozilla.org
        
       | jeena wrote:
       | Why then is the text2speech in reader mode (which other than that
       | is excellent) on a Linux Firefox so extremely bad? Much worse
       | than Steven Hawkins text2speech.
        
       | spadufed wrote:
       | Crowdsourced datasets like this and the ones produced by the
       | OpenAssistant project could easily become the ONLY way to build
       | foundational models if the courts decide that what OpenAI and co
       | are doing is not Fair-Use. I don't think I would call this
       | scenario unlikely, either.
        
       | pimlottc wrote:
       | With recent events in AI and deepfake technology, I would need to
       | see some assurances before I agreed to "donate my voice" to
       | something like this. It seems like the project is for voice
       | recognition, not generation, but it's not immediately clear.
        
         | thih9 wrote:
         | What assurances would you like to see?
        
       | moron4hire wrote:
       | > Voice datasets also underrepresent: non-English speakers,
       | people of colour, disabled people, women and LGBTQIA+ people.
       | 
       | How does being gay change your voice?
        
         | pseudalopex wrote:
         | https://en.wikipedia.org/wiki/LGBT_linguistics#Accents_of_En...
        
           | moron4hire wrote:
           | I'm aware of the trope. I've yet to meet anyone that adheres
           | to it, though. Always thought it was just one of those things
           | that Hollywood overemphasizes to "other" gay people.
        
             | pseudalopex wrote:
             | > Always thought it was just one of those things that
             | Hollywood overemphasizes to "other" gay people.
             | 
             | Did you think they over emphasized it or did you think they
             | made it up?
        
       ___________________________________________________________________
       (page generated 2023-12-05 23:00 UTC)