[HN Gopher] Mozilla Common Voice Adds 16 New Languages and 4,600...
       ___________________________________________________________________
        
       Mozilla Common Voice Adds 16 New Languages and 4,600 New Hours of
       Speech
        
       Author : heyhillary
       Score  : 554 points
       Date   : 2021-08-05 12:54 UTC (10 hours ago)
        
 (HTM) web link (foundation.mozilla.org)
 (TXT) w3m dump (foundation.mozilla.org)
        
       | pkz wrote:
       | Openly licensed speech data for smaller languages is great! I
       | hope as many as possible contribute in order to get better
       | representation across ages and pronunciation. In the end, this
       | may be what is needed for the hyperscale companies to support
       | speech assistants in more languages?
        
       | dabinat wrote:
       | Common Voice is a great project that I'm glad Mozilla kept alive.
       | 
       | One problem is that data for speech recognition needs to be
       | extremely accurate (i.e. the speech matches the transcript
       | perfectly) and the human review process is infallible and there
       | are quite a number of bad clips that made it past the review
       | process (to be fair, Mozilla provides no official guidance to
       | reviewers or recorders).
       | 
       | Plus in the early days, they were recording the same small
       | sentence pool over and over again, so the first 700 hours or so
       | are duplicates.
       | 
       | I hope there will be efforts in the future to clean up the
       | existing dataset to improve its quality.
        
         | lunixbochs wrote:
         | I'm an ASR researcher shipping high quality English models
         | trained on limited resources, and while I've needed to include
         | other datasets to make the model more robust to different kinds
         | of text, Common Voice is a substantial part of my training
         | process. I did not do any manual cleanup. Most of my automated
         | cleanup was done with very basic (low quality) models. My
         | latest models trained this way are competitive with e.g. Google
         | or Apple English speech recognition accuracy.
         | 
         | I'm going to disagree that there's ultimately a need for
         | perfect training data in ASR. I'm sure it helps with some model
         | types and training processes, but it simply hasn't been a
         | factor in my use of Common Voice (English). I'll also note my
         | best model can hit around 10% WER on Common Voice Test without
         | any language model, which is better than any public numbers
         | I've seen posted for it so far (I'm not even using a separate
         | transformer decoder or RNN decoder layers for this number, just
         | the raw output of CTC greedy decode).
         | 
         | None of the above even factors in techniques like wav2vec and
         | IPL (iterative pseudo labeling) with noisy student, which
         | suggest you can hit extremely competitive accuracy with very
         | little correctly labeled data. These techniques are the
         | underpinnings of the current state of the art models.
        
         | ma2rten wrote:
         | Why does data for speech recognition need to be prefect. That's
         | certainly not the case for other machine learning applications.
         | Can you train the less clean data and fine-tune on a clean
         | subset?
        
           | dabinat wrote:
           | Well that was kind of my point: you need to manually figure
           | out what's clean and what isn't.
        
         | stegrot wrote:
         | Here are some draft guidelines for validation that have been
         | translated a lot: https://discourse.mozilla.org/t/discussion-
         | of-new-guidelines...
         | 
         | But you are right, the process has some flaws. Maybe we can
         | review the dataset automatically on some common errors, once an
         | STT system is ready for a language?
         | 
         | The only other option I can think about is a validation process
         | that includes more people per sentence. Right now, only two
         | people validate a sentence, and if they disagree a third person
         | decides. We could at least double check sentences with one "no"
         | vote one more time.
        
           | dabinat wrote:
           | The community guidelines are good but they're hidden away on
           | the forum. I was asking them for years to just make those the
           | official guidelines and link them prominently on the CV site
           | but they never did.
           | 
           | However, Hillary, the new community manager, seems good and
           | she's making a lot of positive changes so hopefully this will
           | be addressed soon.
           | 
           | Long-term the best approach may be some kind of user
           | onboarding before they can record / validate.
        
       | fareesh wrote:
       | Is voice transcription accessible to mere mortals yet?
       | 
       | I have tried pretty much every API offered by big tech, and also
       | various open source models. All of them seem to have incredibly
       | high word error rates. This is mostly for conversations with
       | various Indian accents.
        
         | nshm wrote:
         | Did you try Vosk Indian English model? It is specifically built
         | for Indian accent English
         | 
         | https://alphacephei.com/vosk/models/vosk-model-en-in-0.4.zip
         | 
         | In case you want more accuracy you can share a file with an
         | example, we can take a look on how to make the best accuracy.
         | 
         | For Indian ASR it is also worth to mention recently introduced
         | Vakyansh project which builds model for major Indian languages:
         | 
         | https://github.com/Open-Speech-EkStep/vakyansh-models
        
       | Edman274 wrote:
       | I'm guessing that of the 4,600 new hours of speech, maybe 4,100
       | of those hours are of men's voices and 500 hours are of women's
       | voices, yeah?
        
         | LoriP wrote:
         | To be fair not sure that's the best guess :) there seem to be
         | more female voices than men to me. Anyhow, I'd wager there's at
         | least a 50:50 mix.
        
       | johnnyApplePRNG wrote:
       | Just tried rating some of the English voices and I am conflicted.
       | 
       | Most of them were definitely speaking English, but in an Indian
       | intonation that I was barely able to understand coming from an
       | English as a First Language country.
       | 
       | Some of them were reading words syllable by syllable, which is
       | definitely English, but I would hate to have to listen to an
       | ebook or webpage read aloud to me in that manner.
       | 
       | By clicking yes am I training the system to speak English with an
       | Indian intonation?
       | 
       | Should I click no, not English?
       | 
       | Should/does english even have a "proper" intonation?
        
         | jturpin wrote:
         | Wow you're right. This is conflicting as many of the words are
         | not pronounced properly at all. Maybe it doesn't matter to the
         | accuracy of the speech-to-text system, but it feels like
         | training it with bad data.
        
           | ohgodplsno wrote:
           | Different accents isn't bad data. Your vision of the world of
           | "english is only spoken with an american accent" is what
           | leads to horrendous speech recognition APIs, like Google's.
           | 
           | If your ML model can't handle multiple accents, it is
           | worthless.
        
             | topspin wrote:
             | "english is only spoken with an american accent"
             | 
             | Which american accent?
        
             | jturpin wrote:
             | There's a difference between an accent and pronouncing
             | words wrong. I would expect an English speech recognition
             | system to handle the various accents there are in the world
             | (the US has several accents of course), but it shouldn't
             | handle incorrect pronunciation of syllables if it comes at
             | the expense of recognizing clean data. If it doesn't come
             | at its expense then I guess it's fine.
        
         | ma2rten wrote:
         | I think this dataset is mainly for speech recognition and not
         | text to speech. Speech recognition should be able to recognize
         | as many different accents as possible.
        
         | marc_abonce wrote:
         | From https://commonvoice.mozilla.org/en/criteria
         | 
         | > Varying Pronunciations
         | 
         | > Be cautious before rejecting a clip on the ground that the
         | reader has mispronounced a word, has put the stress in the
         | wrong place, or has apparently ignored a question mark. There
         | are a wide variety of pronunciations in use around the world,
         | some of which you may not have heard in your local community.
         | Please provide a margin of appreciation for those who may speak
         | differently from you.
         | 
         | > On the other hand, if you think that the reader has probably
         | never come across the word before, and is simply making an
         | incorrect guess at the pronunciation, please reject. If you are
         | unsure, use the skip button.
        
         | magicalhippo wrote:
         | Common Voice is not for generating speech, it's for detecting
         | speech.
         | 
         | So don't worry about weird intonation as long as they correctly
         | pronounce the sentences, that way even more people can enjoy
         | the fruit of this labor.
        
       | fisxoj wrote:
       | If anyone has interest contributing, I've found this app for
       | Android makes it very easy!
       | https://www.saveriomorelli.com/commonvoice/
        
         | nmstoker wrote:
         | Why on Earth would anyone use an app for this when mobile
         | browsers work perfectly well for adding audio to Common Voice?
         | 
         | We could possibly give the developer the benefit of the doubt
         | that they're not doing anything inappropriate with the data but
         | frankly why pass your data through a third party that's not
         | part of the project.
         | 
         | And why install an app requiring access to your shared local
         | storage? The GitHub repo claims the website an animations are
         | slow which sounds like BS to me. It works fine on a five year
         | old phone I use for submitting.
         | 
         | Just contribute here if you're so inclined, much more sensible:
         | 
         | https://commonvoice.mozilla.org/en
        
           | commoner wrote:
           | The unofficial CV Project Android app is entirely open source
           | and available on F-Droid:
           | 
           | https://github.com/Sav22999/common-voice-android
           | 
           | https://f-droid.org/packages/org.commonvoice.saverio/
        
             | nmstoker wrote:
             | Yes, I referenced the GitHub repo comments.
             | 
             | Sure, you can get the source but as I said it's still a
             | pointless step to go via a third party
        
           | totetsu wrote:
           | because mozilla fired all the cv team, and the app is under
           | active development?
        
             | nmstoker wrote:
             | You aren't distinguishing the projects correctly. The CV
             | project isn't the same as the DeepSpeech project (even
             | though they were related).
             | 
             | And your point makes little sense, because if the site was
             | not working how could the app get voice data into the
             | project. I've had some involvement with these projects over
             | the years so I'm not just firing off arm-chair comments on
             | this. They wouldn't have been able to add this new voice
             | data if the site was under developed as you imply.
        
           | stegrot wrote:
           | The app has a few nice features the website doesn't have,
           | such as changing the speed during validation. It always
           | surprises me as well, but many people hate to use web apps on
           | mobile. I don't really know why, they simply ask for an app
           | and refuse to use a browser.
        
       | alpb wrote:
       | This may be off-topic but: What's the relationship between Coqui
       | (an OSS TTS startup) https://coqui.ai/about and Mozilla? I recall
       | that the project at one point was called mozilla/TTS
       | (https://github.com/mozilla/TTS/) and now I see that has a fork
       | in the startup's own repo (https://github.com/coqui-ai/TTS).
       | Presumably Common Voice is used to train mozilla/TTS and other
       | OSS TTS solutions?
        
         | ftyers wrote:
         | Common Voice is mostly used for STT not TTS. TTS requires
         | single speaker, clean audio. STT requires multi speaker, noisy
         | audio.
        
       | arghwhat wrote:
       | People seem to speak extremely mechanically in these samples,
       | which I suspect may lead to training bias against native as
       | speech if used.
       | 
       | I think it should be explained that one should speak naturally
       | when reading the lines.
        
       | tsjq wrote:
       | nice. !
       | 
       | news from the past about this :
       | 
       | Initial Release of Mozilla's Open Source Speech Recognition Model
       | and Voice Data : https://news.ycombinator.com/item?id=15808124
       | 
       | Mozilla releases the largest to-date public domain transcribed
       | voice dataset https://news.ycombinator.com/item?id=19270646
        
       | junon wrote:
       | Historically not been the biggest fan of Mozilla but I really,
       | really love this project. I'm glad they're keeping it alive.
        
       | ftyers wrote:
       | One of the most noticeable additions in my opinion is Guarani,
       | the first Indigenous language of the Americas to be added.
       | Indigenous languages are extremely poorly supported and forgotten
       | by all of the major platforms and companies, and it's great to
       | see one getting the attention they deserve. (Disclaimer: I was
       | involved)
        
         | runarberg wrote:
         | As an Icelander I am always really impressed with how well my
         | language--a language spoken by a few hundred thousand people
         | worldwide--is supported on various platforms and technologies.
         | This is probably in no small part thanks to active
         | participation by native speakers and even some government
         | funding.
         | 
         | However I at the same time I'm also deeply disappointed by the
         | lack of support for Iceland's closest neighbour's language--
         | Greenlandic--which is an indigenous language, the sole official
         | language of an autonomous country.
        
           | matsemann wrote:
           | I saw the same when I was younger for Norwegian. Bokmal is
           | the most commonly written form of Norwegian, but New
           | Norwegian is used by about ~15%. Most software included
           | Bokmal support, but you could bet some hardcore user of New
           | Norwegian had made a language pack available as well.
        
             | necovek wrote:
             | Ah, I remember "Nynorsk" (sorry for the bad spelling and
             | ASCIIation) localisation of GNOME from early 2000s!
             | 
             | Generally, it takes only a few dedicated people to get
             | software localised if good enough infrastructure is
             | provided by the community!
             | 
             | I hope that's what we see with Mozilla Common Voice too!
        
               | Sharlin wrote:
               | "Nynorsk" is correct, no non-ASCII shenanigans in that
               | word :)
        
         | neartheplain wrote:
         | Whoah, 6.5 million native speakers! That's several orders of
         | magnitude more than I was expecting. It's also significantly
         | larger than the native-speaking populations of languages like
         | Catalan, Basque, or Romansh, which might be more familiar to
         | North Americans or Europeans.
        
           | victorlf wrote:
           | Catalan has about 10 million speakers.
        
           | andrepd wrote:
           | >It is one of the official languages of Paraguay (along with
           | Spanish), where it is spoken by the majority of the
           | population, and where half of the rural population is
           | monolingual.
           | 
           | Wow, I had no idea
        
           | djoldman wrote:
           | Or, 20x more than Icelandic:
           | 
           | https://en.wikipedia.org/wiki/Icelandic_language
        
             | hkt wrote:
             | Without wishing to get political, is the difference that
             | Iceland is a country but Guarani speakers don't have a
             | nation-state of their own? Or something else?
        
               | moron4hire wrote:
               | Nation-states are political entities, so choosing
               | languages by such a distinction would absolutely be
               | political.
        
               | air7 wrote:
               | Like any feature, perhaps it has to do with the volume of
               | anticpated use vs the effort to support.
        
               | arp242 wrote:
               | Note that Icelandic is currently not well supported
               | either ("In progress" with 384/5000 sentences and 86%
               | Localized). Actually, Guarani is better supported at the
               | moment, and quite a number of other common smaller-ish
               | languages aren't well supported yet either such as
               | Hebrew, Danish, and even Korean (which is not small or
               | even small-ish at all). Some other smaller languages are,
               | such as Breton or Irish. Overall, it's a bit
               | inconsistent. I suppose that this is because in the end,
               | these things depend on the number of people contributing;
               | there's a reason Esperanto is near the top, as it has a
               | very active community of enthusiasts who love to promote
               | the language.
        
               | chudi wrote:
               | It's an official language of Paraguay
        
               | rudyfink wrote:
               | In case anyone else wanted to know more, there are,
               | apparently, 2 official languages and the other is
               | Spanish.
               | https://www.servat.unibe.ch/icl/pa00000_.html#A140_
        
               | interactivecode wrote:
               | The difference is completely and inherently political.
        
               | caymanjim wrote:
               | I think this is overly dismissive of other factors.
               | Whether or not a language is supported by something on
               | the Internet has a lot more to do with financial
               | incentives than politics. If there were a huge consumer
               | market clamoring to give their money to a site and the
               | only barrier were language, it'd get exploited pretty
               | quickly.
        
               | runarberg wrote:
               | No it has a lot to do with politics as well. A sovereign
               | nation may find it important to have their languages
               | supported widely on the internet so they might use some
               | of the public funds into funding translation efforts and
               | voice recognition/speech synthesizer contributions.
               | 
               | I know the Icelandic government spends some money for
               | this and it shows. This tiny language has way more
               | support then other way more spoken languages. If the
               | Norwegian government wanted I bet the Sami languages
               | could have just as good of a support as Icelandic. Or if
               | the Greenlandic government had more funds available I bet
               | we would see Kalaallisut in more places online.
        
               | ftyers wrote:
               | The Norwegian government and Sami parliament put a lot of
               | effort into language technology for the Samo languages. A
               | big problem is lack of openness in platform support. E.g.
               | Google and Apple make it very difficult for external
               | developers to do localisation.
        
               | necovek wrote:
               | What you are saying is that a small, relatively rich
               | country can invest in supporting their own language:
               | that, to me, is not political, but as raised previously,
               | financial. It's also a good incentive for other big
               | players (Google, Microsoft, Apple) to invest in a
               | language that has prospective customers willing to spend
               | more.
               | 
               | Serbian government would certainly support Serbian
               | language voice recognition and synthesis, but probably
               | not with as much money as Iceland would.
        
               | monocasa wrote:
               | > Politics (from Greek: Politika, politika, 'affairs of
               | the cities') is the set of activities that are associated
               | with making decisions in groups, or other forms of power
               | relations between individuals, such as the distribution
               | of resources or status.
               | 
               | It certainly sounds like this is a political situation to
               | me, almost to a tautology. The fact that these decisions
               | was made on the basis of financial gain doesn't make them
               | any less political.
        
               | eropple wrote:
               | _> that, to me, is not political, but as raised
               | previously, financial_
               | 
               | The idea that there is a difference between these two
               | things is one of the more pernicious ones of the last
               | hundred years.
               | 
               | Money is power. The exercise of power is politics. They
               | can't be separated.
        
               | singlow wrote:
               | I'm sure having a nation-state is a major factor, but I
               | bet it also has to do with the average wealth, geographic
               | location, historical alliances. However, I'd put my money
               | on skin color as the biggest factor.
        
               | runarberg wrote:
               | As an example in favor of your conclusion, I propose
               | Greenlandic. Geographically really close to Iceland, is
               | the sole official language of an autonomous country,
               | significant cultural heritage (with even a famous
               | [possible] dwarf planet named after one of their historic
               | gods). However--unlike Iceland--Greenland is not a
               | wealthy country, and tend to have darker skin color then
               | Icelanders.
        
               | puchatek wrote:
               | Autonomous territory, not a country.
        
           | kspacewalk2 wrote:
           | There are a number of Native American languages that have
           | numerous speakers, but until recently have been marginalized,
           | repressed and ignored (and some to this day). Guarani is the
           | most numerous, but also Quechua, Nahuatl, and the various
           | Mayan languages (spoken by around half of Guatemalans, and
           | another 2.5 million Mexicans).
        
       | olejorgenb wrote:
       | I find the recording UI a bit annoying. They make it unnecessary
       | hard to re-record a clip. Re-recording the previous clip is
       | likely to be a common thing to do. Instead of providing a
       | shortcut for this, they have shortcuts for re-recording each of
       | the individual 5 clips..
       | 
       | It's also impossible (?) to undo a clip. Eg.: If I've already
       | recorded 3 clips and mistakenly begin a clip I simply can't
       | pronounce correctly, there's no way of removing that clip without
       | discarding the whole set. (EDIT: it is possible by re-recording
       | that clip and pressing skip)
        
         | Vinnl wrote:
         | Re-recording a clip is very rare for me. Keep in mind that it's
         | supposed to emulate real-world conditions, with all its
         | messiness.
        
       | jalopy wrote:
       | Going along with this: What are the latest and greatest open
       | source speech-to-text models and/or tools out there?
       | 
       | Would love to hear from experienced practitioners and a bit of
       | detail on the experience.
       | 
       | Thanks HN community!
        
         | thom wrote:
         | Same question for text-to-speech!
        
         | orra wrote:
         | Mozilla announced Deep Speech[1] around the same time as Common
         | Voice.
         | 
         | Mozilla Deep Speech is an open source speech recognition
         | engine, based upon Baidu's Deep Speech research paper[2].
         | 
         | Unsurprisingly, Deep Speech requires a corpus such as... Common
         | Voice.
         | 
         | [1] https://github.com/mozilla/DeepSpeech
         | 
         | [2] https://arxiv.org/abs/1412.5567
        
           | rasz wrote:
           | They killed this after Nvidia grant.
        
             | orra wrote:
             | Ah, damn. Didn't realise.
             | 
             | It also looks like Baidu are now developing their Deep
             | Speech as open source?
             | https://github.com/PaddlePaddle/DeepSpeech
        
         | mazoza wrote:
         | https://github.com/coqui-ai/STT
        
         | zerop wrote:
         | Vosk is my favourite. I have used deep speech too. Vosk works
         | better.
        
           | nshm wrote:
           | Thank you. I deeply appreciate you mention our efforts. We
           | spend quite some time and knowledge to build accurate speech
           | recognition. Not that easy to get as much mentions as
           | Mozilla, so we are thankful for every single one!
        
         | kcorbitt wrote:
         | I've had good results with https://github.com/flashlight/flashl
         | ight/blob/master/flashli.... Seems to work well with spoken
         | english in a variety of accents. Biggest limitation is that the
         | architecture they have pretrained models for doesn't really
         | work well with clips longer than ~15 seconds, so you have to
         | segment your input files.
        
         | blackcat201 wrote:
         | I created edgedict [0] a year ago part of my side projects. At
         | that time this is the only open source STT with streaming
         | capabilities. If anyone is interested the pretrained weights
         | for english and chinese is available.
         | 
         | [0] https://github.com/theblackcat102/edgedict
        
         | woodson wrote:
         | NVidia NeMo: https://github.com/NVIDIA/NeMo
        
         | jononor wrote:
         | Have used VOSK a bit recently. The out-of-the-box experience
         | was great compared to earlier projects (looking at you Kaldi
         | and Sphinx...). Word-level audio segmentation was one usecase,
         | https://stackoverflow.com/a/65370463/1967571
        
           | woodson wrote:
           | Vosk is built on Kaldi.
        
           | stegrot wrote:
           | Kdenlive supports automatic subtitles created with VOSK now
           | btw. This makes it a lot more accessible for non-tech folks.
        
       | [deleted]
        
       | rasz wrote:
       | Whats the point when they killed DeepSpeech in exchange for
       | adapting closed Nvidia thing?
       | 
       | https://venturebeat.com/2021/04/12/mozilla-winds-down-deepsp...
       | 
       | https://blog.mozilla.org/en/mozilla/mozilla-partners-with-nv...
       | 
       | $1.5mil for shutting down open source initiative, almost half of
       | CEO salary right there.
        
         | jononor wrote:
         | What closed NVidia thing did they adopt? I don't see any
         | evidence of that here.
        
           | option wrote:
           | https://github.com/NVIDIA/NeMo which is open source, Pytorch
           | based and regularly publishes new models and checkpoints.
        
             | Seirdy wrote:
             | The source code is under a FLOSS license, but it only works
             | on Nvidia GPUs and uses proprietary Nvidia-specific
             | technologies like CUDA.
             | 
             | It's significantly closer to "nonfree" on the free-nonfree
             | spectrum than it should be, and is another example of the
             | difference between the guiding philosophies behind "free
             | software" and "open source"
        
               | yorwba wrote:
               | Can't you run it on CPU? And looking at the code, it
               | seems like they're using Numba to JIT their CUDA kernels,
               | so I guess someone could come along and provide a
               | compatibility shim to make the kernels run on a non-CUDA
               | accelerator?
        
           | rasz wrote:
           | Im sure they signed on adopting "something", otherwise it
           | would be receiving $1.5 million grant for closing open source
           | initiative. $3 million a year lawyer wouldn never be this
           | blatant.
        
         | stegrot wrote:
         | Deepspeech is still alive in a way, the team founded the
         | company coqui.ai after the Mozilla layoffs and they keep
         | everything open source.
        
         | mazoza wrote:
         | I know the old speech team continues as Coqui
         | https://github.com/coqui-ai/
        
           | tmalsburg2 wrote:
           | About their TTS system: "These models provide speech
           | synthesis with ~0.12 real-time factor on a GPU and ~1.02 on a
           | CPU." The quality of the samples is really impressive but,
           | wow, but isn't this computationally too expensive for many
           | applications?
        
         | jononor wrote:
         | Open-source speech recognition is doing pretty good with
         | projects such as VOSK, Athena, ESPNet and SpeechBrain. These
         | days models are the easy part of ML, and data is the hard one.
         | So for Mozilla to focus on Common Voice over DeepSpeech seems
         | reasonable.
        
           | tkinom wrote:
           | Would one use the youtube as training date?
           | 
           | Especially for the videos with Close Caption....
           | 
           | As simple as extracting the Audio and CC text?
        
             | soapdog wrote:
             | You can't really do it because of licensing reasons. One
             | cool thing Common Voice brings to the table, besides all
             | the fantastic data, is the licensing.
        
               | anonymfus wrote:
               | YouTube still allows uploaders to mark their videos as CC
               | BY 3.0 licensed, and it's still possible to check that
               | via YouTube's API.
               | 
               | (See https://support.google.com/youtube/answer/2797468
               | and the part about status.license here:
               | https://developers.google.com/youtube/v3/docs/videos)
        
               | NavinF wrote:
               | This is incorrect. Pretty much every state of the art
               | model uses copyrighted data. This is considered fair use
               | and it has never been a problem outside of concern
               | trolling.
        
               | ma2rten wrote:
               | Are you sure it's not fair use? I believe most legal
               | experts agree that language models such as GPT-3 are not
               | violating copyright due to fair use.
        
               | amelius wrote:
               | Source?
        
         | hkt wrote:
         | Having an open corpus means that researchers building the next
         | thing in voice research - which may or may not follow
         | DeepSpeech - have something to work with. This is enormously
         | important and their change of direction lets a thousand flowers
         | bloom. Meanwhile, their partnership with Nvidia provides a
         | fertile ground to prove the value of the open corpus in action.
         | Nvidia get access to Mozilla's (presumably superior) ability to
         | build said corpus, while Mozilla lay the foundations for others
         | to contribute work in the open. It is a great example of
         | comparative advantage, and a win win choice, IMO.
        
           | rasz wrote:
           | So in other words we provide data for free to Mozilla, and
           | Mozilla turns around and sells it for millions to Nvidia to
           | fund ... not open source, they killed that so umm ee, to fund
           | ceo salary?
        
             | nmstoker wrote:
             | You seem to imply that Nvidia are paying for data that is
             | freely available.
             | 
             | Anyone can use the Common Voice data within the terms of
             | the license and NVIDIA contributing towards the continued
             | gathering of data (that will continue to be made publicly
             | available) won't change that.
             | 
             | It's a huge shame that Mozilla didn't continue the
             | DeepSpeech project but Coqui is taking on the mantle there
             | and there are plenty of others working on open source
             | solutions too, all whilst the existence of CV will make a
             | big difference to research, in the academic, commercial and
             | open source spheres.
        
               | robbedpeter wrote:
               | Coqui is phenomenally good and well done, so this new
               | data should lower the barrier to entry for the
               | represented languages.
        
             | danShumway wrote:
             | > and sells it
             | 
             | If that was true that would be a profoundly bad purchase
             | for NVidia since the data is already freely licensed and
             | available for anyone to use at no cost.
             | 
             | This is like saying that Epic "bought" Blender when they
             | gave it a development grant, or that Google contributing
             | patches to upstream Linux means they own it now. Mozilla
             | didn't give NVidia any kind of special license, when NVidia
             | contributes data to Common Voice they're doing so under
             | _Common Voice 's_ license, not their own.
             | 
             | We want to encourage more companies to treat software and
             | training data as a public commons that is collectively
             | maintained, this is a good thing.
        
               | rasz wrote:
               | Its the kind of "bad" Nvidia purchase like when they pay
               | game publishers for incorporation of
               | physx/cuda/hairworks/gameworks resulting in
               | 
               | https://techreport.com/news/14707/ubisoft-comments-on-
               | assass...
               | 
               | https://techreport.com/review/21404/crysis-2-tessellation
               | -to...
               | 
               | https://arstechnica.com/gaming/2015/05/amd-says-nvidias-
               | game...
               | 
               | Here it appears they purchased this
               | https://venturebeat.com/2021/04/12/mozilla-winds-down-
               | deepsp...
        
         | moralestapia wrote:
         | Lol, these guys sell themselves for peanuts.
        
       | say_it_as_it_is wrote:
       | "The top five languages by total hours are English (2,630 hours),
       | Kinyarwanda (2,260) , German (1,040), Catalan (920), and
       | Esperanto (840)."
       | 
       | How did they get almost as much training for Kinyarwanda as they
       | have English?
        
         | stegrot wrote:
         | The German Federal Ministry for Economic Cooperation and
         | Development supported this language:
         | https://www.bmz.de/de/aktuelles/intelligente-sprachtechnolog...
        
           | say_it_as_it_is wrote:
           | Interesting! There's a market for this kind of audio data
           | entry? What was the total cost for that many hours? The
           | English data was entirely volunteer driven, correct? Maybe
           | it's worth funding the English corpus for the additional
           | hours needed to reach the sweet spot?
        
       | russian_nukes wrote:
       | What is this voice database? Do they have russian voices?
        
       | bravura wrote:
       | Is anyone aware of classification (e.g. word prediction) datasets
       | for low-resource and endangered languages?
       | 
       | If so, we would like to use it for the HEAR NeurIPS competition:
       | https://github.com/microsoft/DNS-Challenge/tree/master/datas...
       | 
       | The challenge is restricted only to classification tasks, and
       | sequence modeling like full ASR is unfortunately beyond the scope
       | of the competition.
        
       | danShumway wrote:
       | I don't really have anything of substance to add here, but I'm
       | very happy to see Mozilla continuing to put effort into this,
       | happy to see effort being put into broadening the support beyond
       | just English and major languages, and I'm grateful for the work
       | that people (inside and outside of Mozilla) have already put into
       | getting the project this far.
        
         | mgarciaisaia wrote:
         | You arguably have something of substance to add - you can help
         | improve the datasets by speaking or validating phrases in the
         | project's website
         | 
         | https://commonvoice.mozilla.org/
         | 
         | There are many languages available to pick from.
        
         | orra wrote:
         | Indeed, it's great to see open data corpuses expand.
        
       | _gtly wrote:
       | A direct link to where you can donate your voice here:
       | https://commonvoice.mozilla.org/en
        
       | donhaker wrote:
       | Let's take the time to appreciate the effort of Mozilla. To add
       | new languages with others came from the minorities, we can't deny
       | that they are continuously putting effort into the community.
        
         | Jnr wrote:
         | The great open source community around Mozilla helps a lot.
         | 
         | When I did not see my own language in the list a year ago, and
         | I had no clue how to get it there, I reached out to my
         | university contacts that I know used to translate Firefox years
         | ago.
         | 
         | With their help we quickly translated the whole common voice
         | site (it was a prerequisite to start contributing a language)
         | and provided first sets of text to start contributing.
         | 
         | In about a week we started contributing voice for a new
         | language. The Common Voice project is awesome and very well
         | made.
        
       | satya71 wrote:
       | > The top five languages by total hours are English (2,630
       | hours), Kinyarwanda (2,260), German (1,040), Catalan (920), and
       | Esperanto (840)
       | 
       | Some unusual suspects among the top languages, there!
        
         | ftyers wrote:
         | That's what happens when people have the opportunity and tools
         | to support their own languages and not just rely on hand outs
         | from big tech :)
        
           | umeshunni wrote:
           | Ah yes, major world languages with 10s or 100s of millions of
           | speakers (Bengali, Korean, Malayalam) are ignored or are
           | perpetually stuck "in progress" while hobby languages like
           | Esperanto are supported.
        
             | stegrot wrote:
             | Hey, I work on the Esperanto version of CV. You are right,
             | many languages should be bigger than Esperanto, and we
             | never planned to become this big, it just happened. We are
             | around ten active people and a telegram group with a few
             | hundred motivated donors. Plus, we write about the project
             | in Esperanto magazines and talk about it on Esperanto
             | congresses.
             | 
             | The point is: the only reason Bengali Korean and Malayalam
             | are stuck "in progress" is that no one is working on them.
             | No language but English is actively supported by Mozilla,
             | it all comes from the communities. And the success of
             | Esperanto shows that every language can make it. I hope
             | that people take our work as a motivation. Every language
             | can become big if a few motivated people work on it for a
             | year or two. Even the smallest language can make it. You
             | just need a lot of public domain sentences, a few thousand
             | donors and some technical knowledge then your language will
             | grow as well :)
        
               | umeshunni wrote:
               | Sure, I was responding to the factitious comment above.
               | 
               | When I can use Google or Facebook in any of these
               | languages for 10+ years, it's silly of this project to
               | claim some high moral ground when you can't support some
               | of the most widely spoken languages in the world and
               | stick to languages that hipsters in San Francisco think
               | is cool.
        
               | yorwba wrote:
               | It _can_ support those languages, they just need some
               | people who actually speak them to come along and make it
               | happen. If you can help, I 'm sure it will be
               | appreciated.
        
           | Anon1096 wrote:
           | Esperanto is a hobby language for upper-middle class people
           | in developed countries. It isn't anyone's "own language".
        
             | ndkwj wrote:
             | Is "upper-middle class in developed countries" meant to be
             | an expletive?
        
             | bradrn wrote:
             | Well, it has native speakers:
             | https://en.wikipedia.org/wiki/Native_Esperanto_speakers
        
             | crvdgc wrote:
             | > Esperanto is a hobby language for upper-middle class
             | people in developed countries.
             | 
             | I wonder what gave you such an impression of Esperanto. My
             | personal experience of Esperanto is quite different.
             | 
             | I started to casually self-learn Esperanto about one year
             | ago as my second foreign language apart from English. After
             | about half a year, I was confident enough to join online
             | Esperanto communities and it gave me a surprisingly much
             | more diverse experience than any community I had
             | encountered on the Internet.
             | 
             | For example, in an online chat group, active users mainly
             | come from US, South America, and Russia. As an person from
             | East Asia, there is little chance for me to get in touch
             | with the latter two groups otherwise. And there are often
             | new users from South America who speak only Spanish and
             | Esperanto.
             | 
             | I myself do not identify as a upper-middle class person,
             | and I don't know enough to assess other Esperanto speakers'
             | class status.
             | 
             | The impression of Esperanto speakers being upper-middle
             | class may come from the fact people learn Esperanto as a
             | hobby. But people not in the upper-middle class can have
             | other hobbies, why is Esperanto different? It doesn't come
             | with the many benefits that people may expect from learning
             | a "practical" language, but it takes significantly less
             | effort. I'd say it's about as hard as learning a new
             | instrument. So it is not that exclusive to only upper-
             | middle class people.
             | 
             | After one year of casual learning, I am now able to
             | contribute to the Common Voice project in Esperanto (175
             | recordings and 123 validations) and I actually use it as a
             | source of learning material.
        
             | krrrh wrote:
             | Technically there are a few hundred L1 speakers of
             | Esperanto, but that doesn't really contradict your point.
             | 
             | https://cogsci.ucsd.edu/~bkbergen/papers/NEJCL.pdf
        
             | stegrot wrote:
             | You are not wrong, but besides the upper-middle-class hobby
             | people, there is also a 130 years old culture that exists
             | parallel to it. I've met a few native Esperanto speakers,
             | and for them Esperanto is their identity. Traditional
             | Esperanto clubs exists in countries like Iran, Japan,
             | China, Burundi, Nigeria and many more. So Esperanto is
             | both, a nerdy hobby and an old culture.
        
             | hkt wrote:
             | Weirdly judgemental.
             | 
             | Esperanto was designed to be easy to learn. It isn't an
             | elite pursuit in the way you suggest, because its community
             | isn't gatekept. I personally have met people of all social
             | classes who have been interested in it.
             | 
             | It was also never meant to be a first language, it is an
             | auxiliary language. It is possible for an English speaker
             | to have a conversation with a Mandarin speaker with no
             | intermediary if both know the (comparatively easy to learn)
             | Esperanto. Its original purpose wasn't trivial either: it
             | was created to stop groups without a common language in the
             | same city (Warsaw, I think?) fighting, created on the basis
             | that they'd stop doing so if only they could speak a common
             | language.
             | 
             | Think of it as JVM bytecode for people.
        
               | least wrote:
               | Auxiliary languages are kind of inherently doomed to fail
               | to function as they're intended because in order for them
               | to function as such, commitment needs to be made to adopt
               | it multilaterally by governments with sufficient
               | influence. If today the United States and China
               | bilaterally decided to force Esperanto into their school
               | curriculum it'd likely be adopted very quickly by
               | everyone else, but that isn't the case and I doubt it
               | ever would be under almost any circumstance, because
               | learning English is just immediately more practical, even
               | if it's a significantly more difficult language to be
               | picked up.
               | 
               | And that's how it's played out. Nearly every developed
               | nation teaches English as a second language or is a
               | native population of English speakers. The universal
               | language is English. The JVM bytecode for people is
               | English.
        
               | voidnullnil wrote:
               | > The JVM bytecode for people is English.
               | 
               | What are you telling me? That I need to drop English?
        
               | jl6 wrote:
               | My takeaway is that nobody should speak English, but
               | instead people should compose their sentences in a
               | different language and then translate them to English at
               | the point of speaking (with small pauses in the
               | conversation for you to collect your thoughts on this
               | garbage).
        
               | hkt wrote:
               | Spoken like an anglophone. Tell that to Latin America and
               | East Asia..
        
               | least wrote:
               | I don't have to, you can look at pretty much any of their
               | language curriculum and find a huge presence of English
               | in nearly all their education systems.
               | 
               | Certainly you will find people learning other languages
               | for trade depending on the region, but even in East Asia,
               | as you say, English is taught in China, Japanese, Korea.
               | In Singapore English is the language everyone learns (and
               | is taught in). In Vietnam the primary foreign language
               | taught is English. In the Philippines one of its official
               | languages is English. Argentina teaches English in
               | elementary school. In Brazil students from grade 6 have
               | to learn a language, which is usually English. In
               | Venezuela English is taught from age 5.
               | 
               | So what exactly do I have to tell them?
        
               | yongjik wrote:
               | Not sure about Latin America, but bring someone from each
               | of China/Japan/Korea and they'll talk to each other in
               | English.
        
             | samtheDamned wrote:
             | They weren't exclusively talking about Esperanto. I read it
             | as a reference to Kinyarwanda and Catalan more than
             | anything else. In the bigger scheme of things there are a
             | lot of languages here that are definitely a product of
             | being able to share your own language. There's multiple
             | native languages that are being shared here, like the
             | thread above about Guarani.
        
           | 1-6 wrote:
           | You have a point there. I've been disappointed that Korean
           | has been stuck in the 'In Progress' state. The Korean tech
           | giants already have APIs to do common speech recognition. I
           | hope more Korean grassroots efforts focus on tools that are
           | open and accessible so it can be built scalable and better.
        
             | yorwba wrote:
             | It looks like Korean still needs a fully localized
             | interface and a sufficiently large collection of sentences
             | to record. You can help by translating the interface
             | https://pontoon.mozilla.org/projects/common-voice/ and
             | collecting public-domain sentences
             | https://commonvoice.mozilla.org/sentence-collector/ and of
             | course by getting Koreans you know excited about the
             | project so they'll help, too.
        
             | fleaaaa wrote:
             | Thank you for pointing it out, I had no idea but I'd happy
             | to contribute on this one. There is indeed a decent korean
             | natural language process engine but it's severely tied to
             | own ecosystem AFAIK.
             | 
             | https://papago.naver.com/
        
         | yorwba wrote:
         | The project seems to have some serious government backing in
         | Rwanda: https://digitalumuganda.com/
        
       | nyx-aiur wrote:
       | I love the datasets but they are still way to small especially
       | for exotic languages.
        
       | [deleted]
        
       | LoriP wrote:
       | Tips & Tricks incoming... I find that if I can't sleep and want
       | something that's kind of useful to do without getting too
       | involved, contributing to common voice is a great way to spend
       | half an hour and relax/forget whatever it is I was churning
       | about. I would recommend it for that, plus it's a great project.
       | Both listening and voicing...
        
       ___________________________________________________________________
       (page generated 2021-08-05 23:00 UTC)