[HN Gopher] Mozilla Common Voice Adds 16 New Languages and 4,600... ___________________________________________________________________ Mozilla Common Voice Adds 16 New Languages and 4,600 New Hours of Speech Author : heyhillary Score : 554 points Date : 2021-08-05 12:54 UTC (10 hours ago) (HTM) web link (foundation.mozilla.org) (TXT) w3m dump (foundation.mozilla.org) | pkz wrote: | Openly licensed speech data for smaller languages is great! I | hope as many as possible contribute in order to get better | representation across ages and pronunciation. In the end, this | may be what is needed for the hyperscale companies to support | speech assistants in more languages? | dabinat wrote: | Common Voice is a great project that I'm glad Mozilla kept alive. | | One problem is that data for speech recognition needs to be | extremely accurate (i.e. the speech matches the transcript | perfectly) and the human review process is infallible and there | are quite a number of bad clips that made it past the review | process (to be fair, Mozilla provides no official guidance to | reviewers or recorders). | | Plus in the early days, they were recording the same small | sentence pool over and over again, so the first 700 hours or so | are duplicates. | | I hope there will be efforts in the future to clean up the | existing dataset to improve its quality. | lunixbochs wrote: | I'm an ASR researcher shipping high quality English models | trained on limited resources, and while I've needed to include | other datasets to make the model more robust to different kinds | of text, Common Voice is a substantial part of my training | process. I did not do any manual cleanup. Most of my automated | cleanup was done with very basic (low quality) models. My | latest models trained this way are competitive with e.g. Google | or Apple English speech recognition accuracy. | | I'm going to disagree that there's ultimately a need for | perfect training data in ASR. I'm sure it helps with some model | types and training processes, but it simply hasn't been a | factor in my use of Common Voice (English). I'll also note my | best model can hit around 10% WER on Common Voice Test without | any language model, which is better than any public numbers | I've seen posted for it so far (I'm not even using a separate | transformer decoder or RNN decoder layers for this number, just | the raw output of CTC greedy decode). | | None of the above even factors in techniques like wav2vec and | IPL (iterative pseudo labeling) with noisy student, which | suggest you can hit extremely competitive accuracy with very | little correctly labeled data. These techniques are the | underpinnings of the current state of the art models. | ma2rten wrote: | Why does data for speech recognition need to be prefect. That's | certainly not the case for other machine learning applications. | Can you train the less clean data and fine-tune on a clean | subset? | dabinat wrote: | Well that was kind of my point: you need to manually figure | out what's clean and what isn't. | stegrot wrote: | Here are some draft guidelines for validation that have been | translated a lot: https://discourse.mozilla.org/t/discussion- | of-new-guidelines... | | But you are right, the process has some flaws. Maybe we can | review the dataset automatically on some common errors, once an | STT system is ready for a language? | | The only other option I can think about is a validation process | that includes more people per sentence. Right now, only two | people validate a sentence, and if they disagree a third person | decides. We could at least double check sentences with one "no" | vote one more time. | dabinat wrote: | The community guidelines are good but they're hidden away on | the forum. I was asking them for years to just make those the | official guidelines and link them prominently on the CV site | but they never did. | | However, Hillary, the new community manager, seems good and | she's making a lot of positive changes so hopefully this will | be addressed soon. | | Long-term the best approach may be some kind of user | onboarding before they can record / validate. | fareesh wrote: | Is voice transcription accessible to mere mortals yet? | | I have tried pretty much every API offered by big tech, and also | various open source models. All of them seem to have incredibly | high word error rates. This is mostly for conversations with | various Indian accents. | nshm wrote: | Did you try Vosk Indian English model? It is specifically built | for Indian accent English | | https://alphacephei.com/vosk/models/vosk-model-en-in-0.4.zip | | In case you want more accuracy you can share a file with an | example, we can take a look on how to make the best accuracy. | | For Indian ASR it is also worth to mention recently introduced | Vakyansh project which builds model for major Indian languages: | | https://github.com/Open-Speech-EkStep/vakyansh-models | Edman274 wrote: | I'm guessing that of the 4,600 new hours of speech, maybe 4,100 | of those hours are of men's voices and 500 hours are of women's | voices, yeah? | LoriP wrote: | To be fair not sure that's the best guess :) there seem to be | more female voices than men to me. Anyhow, I'd wager there's at | least a 50:50 mix. | johnnyApplePRNG wrote: | Just tried rating some of the English voices and I am conflicted. | | Most of them were definitely speaking English, but in an Indian | intonation that I was barely able to understand coming from an | English as a First Language country. | | Some of them were reading words syllable by syllable, which is | definitely English, but I would hate to have to listen to an | ebook or webpage read aloud to me in that manner. | | By clicking yes am I training the system to speak English with an | Indian intonation? | | Should I click no, not English? | | Should/does english even have a "proper" intonation? | jturpin wrote: | Wow you're right. This is conflicting as many of the words are | not pronounced properly at all. Maybe it doesn't matter to the | accuracy of the speech-to-text system, but it feels like | training it with bad data. | ohgodplsno wrote: | Different accents isn't bad data. Your vision of the world of | "english is only spoken with an american accent" is what | leads to horrendous speech recognition APIs, like Google's. | | If your ML model can't handle multiple accents, it is | worthless. | topspin wrote: | "english is only spoken with an american accent" | | Which american accent? | jturpin wrote: | There's a difference between an accent and pronouncing | words wrong. I would expect an English speech recognition | system to handle the various accents there are in the world | (the US has several accents of course), but it shouldn't | handle incorrect pronunciation of syllables if it comes at | the expense of recognizing clean data. If it doesn't come | at its expense then I guess it's fine. | ma2rten wrote: | I think this dataset is mainly for speech recognition and not | text to speech. Speech recognition should be able to recognize | as many different accents as possible. | marc_abonce wrote: | From https://commonvoice.mozilla.org/en/criteria | | > Varying Pronunciations | | > Be cautious before rejecting a clip on the ground that the | reader has mispronounced a word, has put the stress in the | wrong place, or has apparently ignored a question mark. There | are a wide variety of pronunciations in use around the world, | some of which you may not have heard in your local community. | Please provide a margin of appreciation for those who may speak | differently from you. | | > On the other hand, if you think that the reader has probably | never come across the word before, and is simply making an | incorrect guess at the pronunciation, please reject. If you are | unsure, use the skip button. | magicalhippo wrote: | Common Voice is not for generating speech, it's for detecting | speech. | | So don't worry about weird intonation as long as they correctly | pronounce the sentences, that way even more people can enjoy | the fruit of this labor. | fisxoj wrote: | If anyone has interest contributing, I've found this app for | Android makes it very easy! | https://www.saveriomorelli.com/commonvoice/ | nmstoker wrote: | Why on Earth would anyone use an app for this when mobile | browsers work perfectly well for adding audio to Common Voice? | | We could possibly give the developer the benefit of the doubt | that they're not doing anything inappropriate with the data but | frankly why pass your data through a third party that's not | part of the project. | | And why install an app requiring access to your shared local | storage? The GitHub repo claims the website an animations are | slow which sounds like BS to me. It works fine on a five year | old phone I use for submitting. | | Just contribute here if you're so inclined, much more sensible: | | https://commonvoice.mozilla.org/en | commoner wrote: | The unofficial CV Project Android app is entirely open source | and available on F-Droid: | | https://github.com/Sav22999/common-voice-android | | https://f-droid.org/packages/org.commonvoice.saverio/ | nmstoker wrote: | Yes, I referenced the GitHub repo comments. | | Sure, you can get the source but as I said it's still a | pointless step to go via a third party | totetsu wrote: | because mozilla fired all the cv team, and the app is under | active development? | nmstoker wrote: | You aren't distinguishing the projects correctly. The CV | project isn't the same as the DeepSpeech project (even | though they were related). | | And your point makes little sense, because if the site was | not working how could the app get voice data into the | project. I've had some involvement with these projects over | the years so I'm not just firing off arm-chair comments on | this. They wouldn't have been able to add this new voice | data if the site was under developed as you imply. | stegrot wrote: | The app has a few nice features the website doesn't have, | such as changing the speed during validation. It always | surprises me as well, but many people hate to use web apps on | mobile. I don't really know why, they simply ask for an app | and refuse to use a browser. | alpb wrote: | This may be off-topic but: What's the relationship between Coqui | (an OSS TTS startup) https://coqui.ai/about and Mozilla? I recall | that the project at one point was called mozilla/TTS | (https://github.com/mozilla/TTS/) and now I see that has a fork | in the startup's own repo (https://github.com/coqui-ai/TTS). | Presumably Common Voice is used to train mozilla/TTS and other | OSS TTS solutions? | ftyers wrote: | Common Voice is mostly used for STT not TTS. TTS requires | single speaker, clean audio. STT requires multi speaker, noisy | audio. | arghwhat wrote: | People seem to speak extremely mechanically in these samples, | which I suspect may lead to training bias against native as | speech if used. | | I think it should be explained that one should speak naturally | when reading the lines. | tsjq wrote: | nice. ! | | news from the past about this : | | Initial Release of Mozilla's Open Source Speech Recognition Model | and Voice Data : https://news.ycombinator.com/item?id=15808124 | | Mozilla releases the largest to-date public domain transcribed | voice dataset https://news.ycombinator.com/item?id=19270646 | junon wrote: | Historically not been the biggest fan of Mozilla but I really, | really love this project. I'm glad they're keeping it alive. | ftyers wrote: | One of the most noticeable additions in my opinion is Guarani, | the first Indigenous language of the Americas to be added. | Indigenous languages are extremely poorly supported and forgotten | by all of the major platforms and companies, and it's great to | see one getting the attention they deserve. (Disclaimer: I was | involved) | runarberg wrote: | As an Icelander I am always really impressed with how well my | language--a language spoken by a few hundred thousand people | worldwide--is supported on various platforms and technologies. | This is probably in no small part thanks to active | participation by native speakers and even some government | funding. | | However I at the same time I'm also deeply disappointed by the | lack of support for Iceland's closest neighbour's language-- | Greenlandic--which is an indigenous language, the sole official | language of an autonomous country. | matsemann wrote: | I saw the same when I was younger for Norwegian. Bokmal is | the most commonly written form of Norwegian, but New | Norwegian is used by about ~15%. Most software included | Bokmal support, but you could bet some hardcore user of New | Norwegian had made a language pack available as well. | necovek wrote: | Ah, I remember "Nynorsk" (sorry for the bad spelling and | ASCIIation) localisation of GNOME from early 2000s! | | Generally, it takes only a few dedicated people to get | software localised if good enough infrastructure is | provided by the community! | | I hope that's what we see with Mozilla Common Voice too! | Sharlin wrote: | "Nynorsk" is correct, no non-ASCII shenanigans in that | word :) | neartheplain wrote: | Whoah, 6.5 million native speakers! That's several orders of | magnitude more than I was expecting. It's also significantly | larger than the native-speaking populations of languages like | Catalan, Basque, or Romansh, which might be more familiar to | North Americans or Europeans. | victorlf wrote: | Catalan has about 10 million speakers. | andrepd wrote: | >It is one of the official languages of Paraguay (along with | Spanish), where it is spoken by the majority of the | population, and where half of the rural population is | monolingual. | | Wow, I had no idea | djoldman wrote: | Or, 20x more than Icelandic: | | https://en.wikipedia.org/wiki/Icelandic_language | hkt wrote: | Without wishing to get political, is the difference that | Iceland is a country but Guarani speakers don't have a | nation-state of their own? Or something else? | moron4hire wrote: | Nation-states are political entities, so choosing | languages by such a distinction would absolutely be | political. | air7 wrote: | Like any feature, perhaps it has to do with the volume of | anticpated use vs the effort to support. | arp242 wrote: | Note that Icelandic is currently not well supported | either ("In progress" with 384/5000 sentences and 86% | Localized). Actually, Guarani is better supported at the | moment, and quite a number of other common smaller-ish | languages aren't well supported yet either such as | Hebrew, Danish, and even Korean (which is not small or | even small-ish at all). Some other smaller languages are, | such as Breton or Irish. Overall, it's a bit | inconsistent. I suppose that this is because in the end, | these things depend on the number of people contributing; | there's a reason Esperanto is near the top, as it has a | very active community of enthusiasts who love to promote | the language. | chudi wrote: | It's an official language of Paraguay | rudyfink wrote: | In case anyone else wanted to know more, there are, | apparently, 2 official languages and the other is | Spanish. | https://www.servat.unibe.ch/icl/pa00000_.html#A140_ | interactivecode wrote: | The difference is completely and inherently political. | caymanjim wrote: | I think this is overly dismissive of other factors. | Whether or not a language is supported by something on | the Internet has a lot more to do with financial | incentives than politics. If there were a huge consumer | market clamoring to give their money to a site and the | only barrier were language, it'd get exploited pretty | quickly. | runarberg wrote: | No it has a lot to do with politics as well. A sovereign | nation may find it important to have their languages | supported widely on the internet so they might use some | of the public funds into funding translation efforts and | voice recognition/speech synthesizer contributions. | | I know the Icelandic government spends some money for | this and it shows. This tiny language has way more | support then other way more spoken languages. If the | Norwegian government wanted I bet the Sami languages | could have just as good of a support as Icelandic. Or if | the Greenlandic government had more funds available I bet | we would see Kalaallisut in more places online. | ftyers wrote: | The Norwegian government and Sami parliament put a lot of | effort into language technology for the Samo languages. A | big problem is lack of openness in platform support. E.g. | Google and Apple make it very difficult for external | developers to do localisation. | necovek wrote: | What you are saying is that a small, relatively rich | country can invest in supporting their own language: | that, to me, is not political, but as raised previously, | financial. It's also a good incentive for other big | players (Google, Microsoft, Apple) to invest in a | language that has prospective customers willing to spend | more. | | Serbian government would certainly support Serbian | language voice recognition and synthesis, but probably | not with as much money as Iceland would. | monocasa wrote: | > Politics (from Greek: Politika, politika, 'affairs of | the cities') is the set of activities that are associated | with making decisions in groups, or other forms of power | relations between individuals, such as the distribution | of resources or status. | | It certainly sounds like this is a political situation to | me, almost to a tautology. The fact that these decisions | was made on the basis of financial gain doesn't make them | any less political. | eropple wrote: | _> that, to me, is not political, but as raised | previously, financial_ | | The idea that there is a difference between these two | things is one of the more pernicious ones of the last | hundred years. | | Money is power. The exercise of power is politics. They | can't be separated. | singlow wrote: | I'm sure having a nation-state is a major factor, but I | bet it also has to do with the average wealth, geographic | location, historical alliances. However, I'd put my money | on skin color as the biggest factor. | runarberg wrote: | As an example in favor of your conclusion, I propose | Greenlandic. Geographically really close to Iceland, is | the sole official language of an autonomous country, | significant cultural heritage (with even a famous | [possible] dwarf planet named after one of their historic | gods). However--unlike Iceland--Greenland is not a | wealthy country, and tend to have darker skin color then | Icelanders. | puchatek wrote: | Autonomous territory, not a country. | kspacewalk2 wrote: | There are a number of Native American languages that have | numerous speakers, but until recently have been marginalized, | repressed and ignored (and some to this day). Guarani is the | most numerous, but also Quechua, Nahuatl, and the various | Mayan languages (spoken by around half of Guatemalans, and | another 2.5 million Mexicans). | olejorgenb wrote: | I find the recording UI a bit annoying. They make it unnecessary | hard to re-record a clip. Re-recording the previous clip is | likely to be a common thing to do. Instead of providing a | shortcut for this, they have shortcuts for re-recording each of | the individual 5 clips.. | | It's also impossible (?) to undo a clip. Eg.: If I've already | recorded 3 clips and mistakenly begin a clip I simply can't | pronounce correctly, there's no way of removing that clip without | discarding the whole set. (EDIT: it is possible by re-recording | that clip and pressing skip) | Vinnl wrote: | Re-recording a clip is very rare for me. Keep in mind that it's | supposed to emulate real-world conditions, with all its | messiness. | jalopy wrote: | Going along with this: What are the latest and greatest open | source speech-to-text models and/or tools out there? | | Would love to hear from experienced practitioners and a bit of | detail on the experience. | | Thanks HN community! | thom wrote: | Same question for text-to-speech! | orra wrote: | Mozilla announced Deep Speech[1] around the same time as Common | Voice. | | Mozilla Deep Speech is an open source speech recognition | engine, based upon Baidu's Deep Speech research paper[2]. | | Unsurprisingly, Deep Speech requires a corpus such as... Common | Voice. | | [1] https://github.com/mozilla/DeepSpeech | | [2] https://arxiv.org/abs/1412.5567 | rasz wrote: | They killed this after Nvidia grant. | orra wrote: | Ah, damn. Didn't realise. | | It also looks like Baidu are now developing their Deep | Speech as open source? | https://github.com/PaddlePaddle/DeepSpeech | mazoza wrote: | https://github.com/coqui-ai/STT | zerop wrote: | Vosk is my favourite. I have used deep speech too. Vosk works | better. | nshm wrote: | Thank you. I deeply appreciate you mention our efforts. We | spend quite some time and knowledge to build accurate speech | recognition. Not that easy to get as much mentions as | Mozilla, so we are thankful for every single one! | kcorbitt wrote: | I've had good results with https://github.com/flashlight/flashl | ight/blob/master/flashli.... Seems to work well with spoken | english in a variety of accents. Biggest limitation is that the | architecture they have pretrained models for doesn't really | work well with clips longer than ~15 seconds, so you have to | segment your input files. | blackcat201 wrote: | I created edgedict [0] a year ago part of my side projects. At | that time this is the only open source STT with streaming | capabilities. If anyone is interested the pretrained weights | for english and chinese is available. | | [0] https://github.com/theblackcat102/edgedict | woodson wrote: | NVidia NeMo: https://github.com/NVIDIA/NeMo | jononor wrote: | Have used VOSK a bit recently. The out-of-the-box experience | was great compared to earlier projects (looking at you Kaldi | and Sphinx...). Word-level audio segmentation was one usecase, | https://stackoverflow.com/a/65370463/1967571 | woodson wrote: | Vosk is built on Kaldi. | stegrot wrote: | Kdenlive supports automatic subtitles created with VOSK now | btw. This makes it a lot more accessible for non-tech folks. | [deleted] | rasz wrote: | Whats the point when they killed DeepSpeech in exchange for | adapting closed Nvidia thing? | | https://venturebeat.com/2021/04/12/mozilla-winds-down-deepsp... | | https://blog.mozilla.org/en/mozilla/mozilla-partners-with-nv... | | $1.5mil for shutting down open source initiative, almost half of | CEO salary right there. | jononor wrote: | What closed NVidia thing did they adopt? I don't see any | evidence of that here. | option wrote: | https://github.com/NVIDIA/NeMo which is open source, Pytorch | based and regularly publishes new models and checkpoints. | Seirdy wrote: | The source code is under a FLOSS license, but it only works | on Nvidia GPUs and uses proprietary Nvidia-specific | technologies like CUDA. | | It's significantly closer to "nonfree" on the free-nonfree | spectrum than it should be, and is another example of the | difference between the guiding philosophies behind "free | software" and "open source" | yorwba wrote: | Can't you run it on CPU? And looking at the code, it | seems like they're using Numba to JIT their CUDA kernels, | so I guess someone could come along and provide a | compatibility shim to make the kernels run on a non-CUDA | accelerator? | rasz wrote: | Im sure they signed on adopting "something", otherwise it | would be receiving $1.5 million grant for closing open source | initiative. $3 million a year lawyer wouldn never be this | blatant. | stegrot wrote: | Deepspeech is still alive in a way, the team founded the | company coqui.ai after the Mozilla layoffs and they keep | everything open source. | mazoza wrote: | I know the old speech team continues as Coqui | https://github.com/coqui-ai/ | tmalsburg2 wrote: | About their TTS system: "These models provide speech | synthesis with ~0.12 real-time factor on a GPU and ~1.02 on a | CPU." The quality of the samples is really impressive but, | wow, but isn't this computationally too expensive for many | applications? | jononor wrote: | Open-source speech recognition is doing pretty good with | projects such as VOSK, Athena, ESPNet and SpeechBrain. These | days models are the easy part of ML, and data is the hard one. | So for Mozilla to focus on Common Voice over DeepSpeech seems | reasonable. | tkinom wrote: | Would one use the youtube as training date? | | Especially for the videos with Close Caption.... | | As simple as extracting the Audio and CC text? | soapdog wrote: | You can't really do it because of licensing reasons. One | cool thing Common Voice brings to the table, besides all | the fantastic data, is the licensing. | anonymfus wrote: | YouTube still allows uploaders to mark their videos as CC | BY 3.0 licensed, and it's still possible to check that | via YouTube's API. | | (See https://support.google.com/youtube/answer/2797468 | and the part about status.license here: | https://developers.google.com/youtube/v3/docs/videos) | NavinF wrote: | This is incorrect. Pretty much every state of the art | model uses copyrighted data. This is considered fair use | and it has never been a problem outside of concern | trolling. | ma2rten wrote: | Are you sure it's not fair use? I believe most legal | experts agree that language models such as GPT-3 are not | violating copyright due to fair use. | amelius wrote: | Source? | hkt wrote: | Having an open corpus means that researchers building the next | thing in voice research - which may or may not follow | DeepSpeech - have something to work with. This is enormously | important and their change of direction lets a thousand flowers | bloom. Meanwhile, their partnership with Nvidia provides a | fertile ground to prove the value of the open corpus in action. | Nvidia get access to Mozilla's (presumably superior) ability to | build said corpus, while Mozilla lay the foundations for others | to contribute work in the open. It is a great example of | comparative advantage, and a win win choice, IMO. | rasz wrote: | So in other words we provide data for free to Mozilla, and | Mozilla turns around and sells it for millions to Nvidia to | fund ... not open source, they killed that so umm ee, to fund | ceo salary? | nmstoker wrote: | You seem to imply that Nvidia are paying for data that is | freely available. | | Anyone can use the Common Voice data within the terms of | the license and NVIDIA contributing towards the continued | gathering of data (that will continue to be made publicly | available) won't change that. | | It's a huge shame that Mozilla didn't continue the | DeepSpeech project but Coqui is taking on the mantle there | and there are plenty of others working on open source | solutions too, all whilst the existence of CV will make a | big difference to research, in the academic, commercial and | open source spheres. | robbedpeter wrote: | Coqui is phenomenally good and well done, so this new | data should lower the barrier to entry for the | represented languages. | danShumway wrote: | > and sells it | | If that was true that would be a profoundly bad purchase | for NVidia since the data is already freely licensed and | available for anyone to use at no cost. | | This is like saying that Epic "bought" Blender when they | gave it a development grant, or that Google contributing | patches to upstream Linux means they own it now. Mozilla | didn't give NVidia any kind of special license, when NVidia | contributes data to Common Voice they're doing so under | _Common Voice 's_ license, not their own. | | We want to encourage more companies to treat software and | training data as a public commons that is collectively | maintained, this is a good thing. | rasz wrote: | Its the kind of "bad" Nvidia purchase like when they pay | game publishers for incorporation of | physx/cuda/hairworks/gameworks resulting in | | https://techreport.com/news/14707/ubisoft-comments-on- | assass... | | https://techreport.com/review/21404/crysis-2-tessellation | -to... | | https://arstechnica.com/gaming/2015/05/amd-says-nvidias- | game... | | Here it appears they purchased this | https://venturebeat.com/2021/04/12/mozilla-winds-down- | deepsp... | moralestapia wrote: | Lol, these guys sell themselves for peanuts. | say_it_as_it_is wrote: | "The top five languages by total hours are English (2,630 hours), | Kinyarwanda (2,260) , German (1,040), Catalan (920), and | Esperanto (840)." | | How did they get almost as much training for Kinyarwanda as they | have English? | stegrot wrote: | The German Federal Ministry for Economic Cooperation and | Development supported this language: | https://www.bmz.de/de/aktuelles/intelligente-sprachtechnolog... | say_it_as_it_is wrote: | Interesting! There's a market for this kind of audio data | entry? What was the total cost for that many hours? The | English data was entirely volunteer driven, correct? Maybe | it's worth funding the English corpus for the additional | hours needed to reach the sweet spot? | russian_nukes wrote: | What is this voice database? Do they have russian voices? | bravura wrote: | Is anyone aware of classification (e.g. word prediction) datasets | for low-resource and endangered languages? | | If so, we would like to use it for the HEAR NeurIPS competition: | https://github.com/microsoft/DNS-Challenge/tree/master/datas... | | The challenge is restricted only to classification tasks, and | sequence modeling like full ASR is unfortunately beyond the scope | of the competition. | danShumway wrote: | I don't really have anything of substance to add here, but I'm | very happy to see Mozilla continuing to put effort into this, | happy to see effort being put into broadening the support beyond | just English and major languages, and I'm grateful for the work | that people (inside and outside of Mozilla) have already put into | getting the project this far. | mgarciaisaia wrote: | You arguably have something of substance to add - you can help | improve the datasets by speaking or validating phrases in the | project's website | | https://commonvoice.mozilla.org/ | | There are many languages available to pick from. | orra wrote: | Indeed, it's great to see open data corpuses expand. | _gtly wrote: | A direct link to where you can donate your voice here: | https://commonvoice.mozilla.org/en | donhaker wrote: | Let's take the time to appreciate the effort of Mozilla. To add | new languages with others came from the minorities, we can't deny | that they are continuously putting effort into the community. | Jnr wrote: | The great open source community around Mozilla helps a lot. | | When I did not see my own language in the list a year ago, and | I had no clue how to get it there, I reached out to my | university contacts that I know used to translate Firefox years | ago. | | With their help we quickly translated the whole common voice | site (it was a prerequisite to start contributing a language) | and provided first sets of text to start contributing. | | In about a week we started contributing voice for a new | language. The Common Voice project is awesome and very well | made. | satya71 wrote: | > The top five languages by total hours are English (2,630 | hours), Kinyarwanda (2,260), German (1,040), Catalan (920), and | Esperanto (840) | | Some unusual suspects among the top languages, there! | ftyers wrote: | That's what happens when people have the opportunity and tools | to support their own languages and not just rely on hand outs | from big tech :) | umeshunni wrote: | Ah yes, major world languages with 10s or 100s of millions of | speakers (Bengali, Korean, Malayalam) are ignored or are | perpetually stuck "in progress" while hobby languages like | Esperanto are supported. | stegrot wrote: | Hey, I work on the Esperanto version of CV. You are right, | many languages should be bigger than Esperanto, and we | never planned to become this big, it just happened. We are | around ten active people and a telegram group with a few | hundred motivated donors. Plus, we write about the project | in Esperanto magazines and talk about it on Esperanto | congresses. | | The point is: the only reason Bengali Korean and Malayalam | are stuck "in progress" is that no one is working on them. | No language but English is actively supported by Mozilla, | it all comes from the communities. And the success of | Esperanto shows that every language can make it. I hope | that people take our work as a motivation. Every language | can become big if a few motivated people work on it for a | year or two. Even the smallest language can make it. You | just need a lot of public domain sentences, a few thousand | donors and some technical knowledge then your language will | grow as well :) | umeshunni wrote: | Sure, I was responding to the factitious comment above. | | When I can use Google or Facebook in any of these | languages for 10+ years, it's silly of this project to | claim some high moral ground when you can't support some | of the most widely spoken languages in the world and | stick to languages that hipsters in San Francisco think | is cool. | yorwba wrote: | It _can_ support those languages, they just need some | people who actually speak them to come along and make it | happen. If you can help, I 'm sure it will be | appreciated. | Anon1096 wrote: | Esperanto is a hobby language for upper-middle class people | in developed countries. It isn't anyone's "own language". | ndkwj wrote: | Is "upper-middle class in developed countries" meant to be | an expletive? | bradrn wrote: | Well, it has native speakers: | https://en.wikipedia.org/wiki/Native_Esperanto_speakers | crvdgc wrote: | > Esperanto is a hobby language for upper-middle class | people in developed countries. | | I wonder what gave you such an impression of Esperanto. My | personal experience of Esperanto is quite different. | | I started to casually self-learn Esperanto about one year | ago as my second foreign language apart from English. After | about half a year, I was confident enough to join online | Esperanto communities and it gave me a surprisingly much | more diverse experience than any community I had | encountered on the Internet. | | For example, in an online chat group, active users mainly | come from US, South America, and Russia. As an person from | East Asia, there is little chance for me to get in touch | with the latter two groups otherwise. And there are often | new users from South America who speak only Spanish and | Esperanto. | | I myself do not identify as a upper-middle class person, | and I don't know enough to assess other Esperanto speakers' | class status. | | The impression of Esperanto speakers being upper-middle | class may come from the fact people learn Esperanto as a | hobby. But people not in the upper-middle class can have | other hobbies, why is Esperanto different? It doesn't come | with the many benefits that people may expect from learning | a "practical" language, but it takes significantly less | effort. I'd say it's about as hard as learning a new | instrument. So it is not that exclusive to only upper- | middle class people. | | After one year of casual learning, I am now able to | contribute to the Common Voice project in Esperanto (175 | recordings and 123 validations) and I actually use it as a | source of learning material. | krrrh wrote: | Technically there are a few hundred L1 speakers of | Esperanto, but that doesn't really contradict your point. | | https://cogsci.ucsd.edu/~bkbergen/papers/NEJCL.pdf | stegrot wrote: | You are not wrong, but besides the upper-middle-class hobby | people, there is also a 130 years old culture that exists | parallel to it. I've met a few native Esperanto speakers, | and for them Esperanto is their identity. Traditional | Esperanto clubs exists in countries like Iran, Japan, | China, Burundi, Nigeria and many more. So Esperanto is | both, a nerdy hobby and an old culture. | hkt wrote: | Weirdly judgemental. | | Esperanto was designed to be easy to learn. It isn't an | elite pursuit in the way you suggest, because its community | isn't gatekept. I personally have met people of all social | classes who have been interested in it. | | It was also never meant to be a first language, it is an | auxiliary language. It is possible for an English speaker | to have a conversation with a Mandarin speaker with no | intermediary if both know the (comparatively easy to learn) | Esperanto. Its original purpose wasn't trivial either: it | was created to stop groups without a common language in the | same city (Warsaw, I think?) fighting, created on the basis | that they'd stop doing so if only they could speak a common | language. | | Think of it as JVM bytecode for people. | least wrote: | Auxiliary languages are kind of inherently doomed to fail | to function as they're intended because in order for them | to function as such, commitment needs to be made to adopt | it multilaterally by governments with sufficient | influence. If today the United States and China | bilaterally decided to force Esperanto into their school | curriculum it'd likely be adopted very quickly by | everyone else, but that isn't the case and I doubt it | ever would be under almost any circumstance, because | learning English is just immediately more practical, even | if it's a significantly more difficult language to be | picked up. | | And that's how it's played out. Nearly every developed | nation teaches English as a second language or is a | native population of English speakers. The universal | language is English. The JVM bytecode for people is | English. | voidnullnil wrote: | > The JVM bytecode for people is English. | | What are you telling me? That I need to drop English? | jl6 wrote: | My takeaway is that nobody should speak English, but | instead people should compose their sentences in a | different language and then translate them to English at | the point of speaking (with small pauses in the | conversation for you to collect your thoughts on this | garbage). | hkt wrote: | Spoken like an anglophone. Tell that to Latin America and | East Asia.. | least wrote: | I don't have to, you can look at pretty much any of their | language curriculum and find a huge presence of English | in nearly all their education systems. | | Certainly you will find people learning other languages | for trade depending on the region, but even in East Asia, | as you say, English is taught in China, Japanese, Korea. | In Singapore English is the language everyone learns (and | is taught in). In Vietnam the primary foreign language | taught is English. In the Philippines one of its official | languages is English. Argentina teaches English in | elementary school. In Brazil students from grade 6 have | to learn a language, which is usually English. In | Venezuela English is taught from age 5. | | So what exactly do I have to tell them? | yongjik wrote: | Not sure about Latin America, but bring someone from each | of China/Japan/Korea and they'll talk to each other in | English. | samtheDamned wrote: | They weren't exclusively talking about Esperanto. I read it | as a reference to Kinyarwanda and Catalan more than | anything else. In the bigger scheme of things there are a | lot of languages here that are definitely a product of | being able to share your own language. There's multiple | native languages that are being shared here, like the | thread above about Guarani. | 1-6 wrote: | You have a point there. I've been disappointed that Korean | has been stuck in the 'In Progress' state. The Korean tech | giants already have APIs to do common speech recognition. I | hope more Korean grassroots efforts focus on tools that are | open and accessible so it can be built scalable and better. | yorwba wrote: | It looks like Korean still needs a fully localized | interface and a sufficiently large collection of sentences | to record. You can help by translating the interface | https://pontoon.mozilla.org/projects/common-voice/ and | collecting public-domain sentences | https://commonvoice.mozilla.org/sentence-collector/ and of | course by getting Koreans you know excited about the | project so they'll help, too. | fleaaaa wrote: | Thank you for pointing it out, I had no idea but I'd happy | to contribute on this one. There is indeed a decent korean | natural language process engine but it's severely tied to | own ecosystem AFAIK. | | https://papago.naver.com/ | yorwba wrote: | The project seems to have some serious government backing in | Rwanda: https://digitalumuganda.com/ | nyx-aiur wrote: | I love the datasets but they are still way to small especially | for exotic languages. | [deleted] | LoriP wrote: | Tips & Tricks incoming... I find that if I can't sleep and want | something that's kind of useful to do without getting too | involved, contributing to common voice is a great way to spend | half an hour and relax/forget whatever it is I was churning | about. I would recommend it for that, plus it's a great project. | Both listening and voicing... ___________________________________________________________________ (page generated 2021-08-05 23:00 UTC)