[HN Gopher] Stable Audio: Fast Timing-Conditioned Latent Audio D... ___________________________________________________________________ Stable Audio: Fast Timing-Conditioned Latent Audio Diffusion Author : JonathanFly Score : 313 points Date : 2023-09-13 10:00 UTC (13 hours ago) (HTM) web link (stability.ai) (TXT) w3m dump (stability.ai) | colesantiago wrote: | This is yet another amazing release from Stability AI. | | Will be adding this to my SaaS side grift and introduce generated | music you can listen to while you're chatting with your PDFs. | | Can't wait for the next one. | stainablesteel wrote: | i want something that can take in a song and transform it into a | different genre | jncfhnb wrote: | The bluegrass one is super weird. I can't identify exactly why. | zzbzq wrote: | I can identify a bunch of things. The chord structure jumps all | over randomly in a genre that usually does the opposite. The | banjo is clearly not an actual banjo being strummed/frailed, | but a weird agglomeration of bright toned instruments including | both frailed/scruggs banjo and dobro, and maybe harmonica and | fiddle creeping in. The AI doesn't know it's making a | combination of instruments, so where it's trained on | instruments blending, it thinks it can produce pre-blended | sounds. I guess maybe this is more like a return to being a | child hearing music for the first time with no preconceptions | or expectations. | smat wrote: | You are right it feels off. | | The position of the guitar in stereo is all over the place, | higher frequency elements appear to come from the left while | other parts are more centered. | ewan251 wrote: | I think the super weird part is that it's not great? I | understand this is most likely very impressive technologically | but musically it is disjointed, inconsistent and fake sounding. | Most of the "music" examples have weird phrasing and confusing | harmonic rhythm. | | Kudos to stability.ai for achieving this as I am sure it took a | lot of effort and this is a huge leap forward in terms of | generation of audio by generative AI. | | However as a musician (BMus and MMus at 2 different | conservatoires) I think it's important to say that the job risk | being experienced by creative writers will not be extending to | musicians... yet. | jncfhnb wrote: | I feel like music composition is a fundamentally hard task | for AI. Music production seems like it should be a lot easier | but I haven't seen that | viraptor wrote: | From what I've seen in the generated tracks so far (this | one and others), they're pretty good locally, but just | ignore the overall composition. For example any generated | blues tracks will have the vague blues feel, but won't keep | the 12 bar style. The bluegrass example here doesn't even | seem to keep to 4/4 (or is extremely fluid about it...). | Maybe one day someone will add a higher level "what's the | current section, how far are you into it" inputs to that | model to get something better - literally preparing the | structure first and then filling it in. That should get | much better results for context like "you're playing blues | in A with quick change and generating bars 3-4, match the | previous bars in style". | | I mean, chatgpt knows how to plan this out https://chat.ope | nai.com/share/976077c0-138b-4363-8065-3c8eed... Painting in | that picture should be much easier than generating | something freeflowing. Generating a good structure isn't | that hard for most styles, because you can literally use | the same pattern and do a few random changes that keep the | key. (See lots of pop songs using the same 3/4 chord | progression) | Jeff_Brown wrote: | Yes, and surprisingly so. I _never_ would have guessed we | 'd have AI stock photographs before AI muzak. | stef25 wrote: | Same for the death metal | benesing wrote: | Also, the music is not bluegrass as much as it is old-time, a | confusion that continually irritates old-time players. | jnwatson wrote: | The AI seems to understand 4/4 time but doesn't understand | groupings of 4 measures into phrases. It definitely doesn't | understand ABABACA or even the basic parts of a song. | | It is the musical equivalent of a meandering paragraph. | Jeff_Brown wrote: | Absolutely. And all AI for music I have seen suffers that | problem. | | It makes me wonder whether the music generation should be | stratified -- a coarse model lays out where parts like verse | and chorus are, what distinguishes them, how to transition, | etc., and then a finer-grained model fills in the details. | skilled wrote: | Relevant, | | https://news.ycombinator.com/item?id=37493741 | | https://www.stableaudio.com/ | coldcode wrote: | It's interesting tech but none of the musical pieces impressed me | (I play multiple instruments and have written and arranged | music), most sounded too repetitive and not very imaginative. | This is also an issue with diffusion based art AI in general, its | good at a limited set of things but gets rather repetitive after | a while. I could see using this as background music where quality | is not important, like in games, though I doubt you could run the | AI generator inside a game, you could generate it as a asset. | People like Hans Zimmer and Ludwig Goransson have nothing to | worry about. | | Singing would an interesting experiment, but I don't see that | here. | Art9681 wrote: | The real test is when this stuff is out in the wild and no one | tells you it's AI and the thought doesn't cross your mind. Of | course it's not impressive nor surprising when the answer was | given up front. | AuryGlenz wrote: | I disagree with your diffusion based art assessment, and I | think it's probably colored by what most people seem to want to | make with it. Just as with regular art, you need some sort of | vision to go beyond what everyone else does. Prompting "pretty | girl wearing sexy clothing" for the umpteenth time isn't new. | | AI art gets rid of the technical skill step but the rest is | there, although you may luck in to something at random. If | you're using ControlNet on Stable Diffusion or training your | own models you have a lot of control over the output as well. | zone411 wrote: | Yes, while working on my AI Melodies Assistant project, it | quickly became clear that generating a pleasant but boring | music isn't too difficult. To create a catchy tune, an element | of surprise is essential. In the end, I was able to use it as | an assistant to compose 60 melodies that I'm happy with | (https://www.melodies.ai/) | dylan604 wrote: | Yeah, the "rock drums" example was like a student in a | practice. I'll be impressed when it can sound like Danny Carey. | | From all of the hype, I want to be impressed with results. | Instead, we get these mediocre at best examples of what it can | do. They are not good sales pitches to me. | hoosieree wrote: | Hate to break it to you but there is a vast market for | mediocre content, and even Danny Carey has a bad day once in | a while. | dylan604 wrote: | I think you've confused the vast market's forced acceptance | of mediocre because that's what's available with their | wanting the mediocrity. Proof of that is seen with the | emergence of places for affordably licensed royalty free | options. The quality of production and styles today | compared to those from the 90s/00s is amazing. There's been | a few options on these sites that I would play in a set as | DJ. This ain't yo momma's needle drop selections. | chpatrick wrote: | Sir, your dog can talk! | | Yes, but not very well. | dylan604 wrote: | You joke, but even those videos of people saying their dog | can talk is just like this. It's cute because it's a real | dog making sounds we really want to believe when it's just | them mimicking sounds because they get pettin's and treats. | | What I want is "AI" to do something impressive. Why are we | trying to make the system generate the sounds itself? We | don't make artists do that, we give them instruments. Give | the models actual instruments, and then have it play them | like a real artist. I will be much more impressed with an | AI that understands composition and scoring, use of musical | voices, key signatures. That would still be generative. I | guess I just don't understand the point of the direction | being taken. It's like a solution looking for a problem. | Jeff_Brown wrote: | We work with what we have. We don't have a lot of | recordings of the physical movements of musicians; we | have recordings. | | Similarly, we don't have recordings of the actions of | painters; we have finished paintings -- but if you're not | impressed with what AI can do in the visual sphere, your | standards are, to put it mildly, high. | dylan604 wrote: | I'm not really sure how to take this. We absolutely have | recordings of instruments. You can buy them as complete | sets. You train on complete recordings, and then tell it | how to use the sampled instruments to compose a song in | the style of the trained data. Building something to make | a waveform that looks like another waveform just seems | like a very odd direction to take. | | Yes, my standards are if it isn't at least as good as | what's available now, what's the point. | chpatrick wrote: | Well, we found with Midjourney et al that these models | can work very well despite having no pre-conceived or | symbolic notions of composition, color theory, | perspective or anything. Yet they can produce really good | results in the image generation space. It's the same idea | here, except much earlier days. | | In the same way, many successful musicians can't read | sheet music or know music theory, they just know how to | produce something that sounds good. | dylan604 wrote: | >In the same way, many successful musicians can't read | sheet music or know music theory, they just know how to | produce something that sounds good. | | Right, because they can operate the instruments that make | the sound with natural talent, but they don't have to | draw the waveforms. Audio generation is much different | than image generation. It's just very odd to me. | cwillu wrote: | I thought "Prompt: 116 BPM rock drums loop clean production" | wasn't bad, but I'll grant that most of the rest would be an | excellent way of showing (for instance) a death metal fan what | their favoured music sounds like to those who don't have an ear | for it :D | Maschinesky wrote: | I'm surprised that you even leap to people like Hans Zimmer and | others. | | The people we need to worry about are aallll of the people | earning a living for everything else like background music for | Indi games, Ambient music etc. | mpsprd wrote: | >where quality is not important, like in games | | I don't believe that's a good example. Video game music is an | important part of the gaming experience, but its often taken | for granted or overlooked. | PatronBernard wrote: | What I haven't seen done well by these generative AI's so far | is structure (having a chorus, a verse, a bridge, ...) and | harmonic movement/progressions (except for maybe a V-I or I - | VI - ii - V) And those two be things are exactly what makes a | song interesting and non-repetitive. | Jeff_Brown wrote: | OpenAI's jukebox -- now 3 years old -- is creative and non- | repetitive. Witness, for instance, its jam on Uptown Funk | here: | | https://www.youtube.com/watch?v=KCaya74_NHw | | Or the changes shortly after 1:15, 2:15 and 2:40 in these | extensions of Take On Me: | | https://www.youtube.com/watch?v=_3yOrUJ0SzY | riskable wrote: | I thought the, "epic trailer music intense tribal percussion | and brass" was pretty good. Rather, good _enough_ for something | like a video game where the game engine is dynamically | generating music based on the present situation in the game. | | I could easily find that music entertaining if it started | playing the moment my character triggered a trap and suddenly, | "the floor is lava" or my character enters a scene with the | quest of winning over one of the love interests =) | Cloudef wrote: | Is the extreme metal music lacking from the training set? Why do | the extreme metal examples always sound horrible? | rafaelero wrote: | Because metal sounds horrible. | Jeff_Brown wrote: | Metal is especially hard to mix in a way that keeps the voices | distinct and clear. Maybe the training catalog includes a lot | of low-budget metal. | jasbur wrote: | It's interesting that the Death Metal was the hardest to | reproduce. I conclude that it's the most fundamentally human of | all genres. | Ninovdmark wrote: | The sound sample seemed to fit the 'vibe', but lacked any | discernible definition. Could it be that it's too sonically | dense to easily reproduce? Perhaps this could be improved with | a more tailored training set. | awestroke wrote: | I think it was just the genre that was the least represented in | the training data | chankstein38 wrote: | It sounds more like break core haha | dontreact wrote: | Well... they hardly tried all genres :) | | It sounds like it can't handle lyrics or semantics that well so | I suspect any genre where the lyricism is important would also | be quite mushy and recognizably AI | Jeff_Brown wrote: | The Beatles seemed to be the hardest music for JukeBox to | emulate. | jacooper wrote: | Everything looks very convincing apart from the airpane pilot and | the sound effects. They sound very weird as if one is | hallucinating | wiz21c wrote: | The airplane is super convincing as an encrypted Empire | communication :-) (see Star Wars episode 5 IIRC) | sebzim4500 wrote: | The airplane one just sounds like a foreign language over a bad | intercom, I think that could still be useful for some stuff. | xpe wrote: | Perhaps because generating good white noise requires | randomness without autocorrelation or detectable patterns. | stared wrote: | I would love to use it for background music when I am working. I | have specific tastes that depend on the task, mood, energy level, | and ambiance. | k12sosse wrote: | If you're not attuned to Cryo Chamber (label), check them out. | Maybe not fitting all use-cases, but a strong and deep | catalogue. | naillo wrote: | I keep thinking back to when we didn't have stabilityai and it | was just google and meta teasing us with mouth watering papers | but never letting us touch them. I'm so thankful stability | exists. | [deleted] | Tenoke wrote: | Stability is great but Meta's MusicGen is available with code | and weights while this isn't so that's a really odd place to | make that comparison and complaint. | waffletower wrote: | Unfortunately MusicGen's output quality isn't strong enough. | I applaud Meta for open sourcing it. The audio samples | released for Stable Audio show much more promise. I look | forward to code and model releases. I built out a Cog model | for MusicGen and took it for a fairly extensive test drive | and came back disappointed. | Taek wrote: | Before stable diffusion, nobody released weights at all. Meta | et al only started sharing their models with the world when | they realized how fast a developer ecosystem was building | around the best models. | | Without stability, all of AI would still be closed and | opaque. | Tenoke wrote: | >Before stable diffusion, nobody released weights at all. | | That's not true. There's been a lot of models with weights | from every player before Stability. | | >Without stability, all of AI would still be closed and | opaque. | | Most GANs (the practically spiritual predecessor to | diffusion models) for example were available. Huggingface | existed and has realistically done more to keep AI open. | And again, this specific release we are talking about by | Stability is _not_ Open. | | Stability is great but you are re-writting history and | doing it on the release where it makes least sense to do | so. | refulgentis wrote: | Nah. Dunno where this is coming from but infamously no AI | models were released by big players for years. Rewind 18 | months and all you got is GPT-3.0 that no one seems to | care about and Disco Diffusion-y type stuff. | thomashop wrote: | You are looking at a very short part of recent history. | It has not been like that at all. | refulgentis wrote: | I'm all ears. I was "in the room" from 2019 on. Can't | name one art model you could run on your GPU from a FAANG | or OpenAI before SD, and can't name one LLM with public | access before ChatGPT, much less weights available till | LLaMA 1. | | But please, do share. | thewataccount wrote: | Openai - GPT2 2019 - | https://openai.com/research/gpt-2-1-5b-release | | Google - T5 - Feb 2020 - | https://blog.research.google/2020/02/exploring-transfer- | lear... | | Both of these were and still are used heavily for on- | going research and T5 has been found to be decently | useful when fine-tuned. | | Weights were available for both. | refulgentis wrote: | See https://news.ycombinator.com/item?id=37501964 | smoldesu wrote: | > Can't name one art model you could run on your GPU from | a FAANG or OpenAI before SD | | Google published dozens to promote Tensorflow: | | https://experiments.withgoogle.com/font-map | https://experiments.withgoogle.com/sketch-rnn-demo | https://experiments.withgoogle.com/curator-table | https://experiments.withgoogle.com/nsynth-super | https://experiments.withgoogle.com/t-sne-map | | The list goes on. Many are source-available with weights | too. | | > can't name one LLM with public access before ChatGPT, | much less weights available till LLaMA 1. | | Do any of these ring a bell? | | - DistilBERT/MobileBERT/DeBERTa/RoBERTa/ALBERT | | - FNet | | - GPT2/GPT-Neo/GPT-J | | - Marian | | - MBart | | - M2m100 | | - NLLB | | - Electra | | - T5/LongT5/T5-flan | | - XLNet | | - Reformer | | - ProphetNet | | - Pegasus | | That's not comprehensive but may be enough to jog your | memory. | refulgentis wrote: | I understand your point. | | The gap in communication is we don't mean _literally_ no | one _ever_ open-sourced models. I agree, that would be | absurd. [1] | | Companies, quite infamously and well-understood, _did_ | hold back their "real" generative models, even from being | available for pay. | | Take a stab at a literal definition: - post-GPT2 LLMs | (ex. PALM, PALM2) - art like DaLL-E, Imagen, Parti | | Loosely, we had Disco Diffusion for art, and GPT-3 for | LLMs, and then Dall-E, then Midjourney. That was over an | _entire year_, and the floodgates on private ones didn't | open till post SD/ChatGPT. | | [1] thank you for the lengths you went to highlight the | best over a considered span of time, I would have just | said something snarky :) | | [2] I did not realize FLAN was open-sourced a month | before ChatGPT, that's fascinating: we're stretching a | bit, beyond that, IMHO: the BERTs aren't recognizable as | LLMs. | smoldesu wrote: | All good. I've also been working on LLMs since 2019-ish, | so I wanted to toss a hat in the ring for the | underrepresented transformer models. They were cool (eg. | dumb), fast and worked better than they had any right to. | In a lot of ways they are the ancestors of ChatGPT and | Llama, so it's important to at least bring them into the | discussion. | hofstee wrote: | https://github.com/google/deepdream | astrange wrote: | > Can't name one art model you could run on your GPU from | a FAANG or OpenAI before SD | | CLIP could be used as an image generator, slowly. | | > and can't name one LLM with public access before | ChatGPT, much less weights available till LLaMA 1 | | InstructGPT was available on OpenAI playground for months | before ChatGPT and was basically as capable as GPT3, | people were really missing out. Don't know any good | public models though. | waffletower wrote: | In the image generation space, weights were never released | for ImageGen and Dall-e, but yes you can find weights for | more specialized generative models like StyleGAN (2, 3 | etc). Stable Diffusion was arguably one of the most | influential open model releases, and I think the | substantial investment in StabilityAI is evidence of that. | astrange wrote: | There were open reproductions of DALLE1 like ruDALLE. | smoldesu wrote: | GPT-2, GPT-J, XLNET, BERT, Longformers and T5 were all | freely available before Stable Diffusion was even a press | release. | vhold wrote: | Stable Diffusion 1 _contains_ a model OpenAI released. The | CLIP encoder that was trained on text /image pairs at | OpenAI. | | https://huggingface.co/runwayml/stable-diffusion-v1-5 | | https://huggingface.co/runwayml/stable- | diffusion-v1-5/blob/m... | | Uploaded to Hugging Face Jan 2021 | | https://huggingface.co/openai/clip-vit-large-patch14 | seydor wrote: | Stability def helped push things forward , it even probably | showed them that open source is inevitable | refulgentis wrote: | Nah it's not because without the releases of | Stability/ChatGPT it'd be the same situation. Cool nihilism | though | Jeff_Brown wrote: | Is the source for this available? I found no mention of it on | the page. | fnordpiglet wrote: | It says source is coming | hoosieree wrote: | Humans take a long time to get good at art; in the meantime they | still have to eat. | | So they compete with generative AI for a fixed number of jobs. | The AI is cheaper and faster. Humans stop training to become | artists. | | Without new training data, the generative AI models stagnate. | Progress in art stops globally, forever. | | But for a brief glorious moment, we were able to say "huh, that's | not bad". | phone8675309 wrote: | This is by design - the capital that backs modern art isn't | doing it for love of the art but for money. | | For fine art, it's a way for them to launder money and keep it | out of bank accounts where it can be seized trivially. | | For mass art, it's about selling to enough rubes to make a | profit. | | Neither are impacted by a stagnation in art. If anything, | they're aided by it - suddenly the art you bought to launder | money retains its value because it's no longer the flavor of | the week with the arts crowd. | Jeff_Brown wrote: | I still consider OpenAI's JukeBox (now at least 2 years old!) far | and away the most creative music AI. But the combination of | coherence, sound quality and creativity of this model is (to my | knowledge) easily best in class. | 4RealFreedom wrote: | The sound quality of Jukebox is muddled. There are many | inconsistencies. The loudness of vocals and the quality of | instruments really stand out and not in a good way. Hard to | talk about creativity because it's so subjective but I've found | it lacking in all AI music including JukeBox. Don't get me | wrong - this tech is amazing. | Jeff_Brown wrote: | It's mushy and inconsistent, absolutely. But it also comes up | with wild yet coherent changes that I've seen from nothing | else. | | At this comment I listed a few instances: | | https://news.ycombinator.com/item?id=37499067 | [deleted] | naillo wrote: | This is gonna be great to finetune on. There's only so many | boards of canada/aphex twin songs out there but I wish there were | more and this will let us generate more. | 52-6F-62 wrote: | This is not the way. | naillo wrote: | Why not? Mostly for private use in my case. SDXL has created | some beautiful works of art in my experiments and I would | love to have a similar experience in the music world. | wokwokwok wrote: | Come on, be creative and make something new instead of | copying someone else. | | It's just kind of lame imo. | | "Mostly" private use? Mmm. :thumbs down emoji: | naillo wrote: | I meant private use and maybe share with a few friends. I | actually agree with you that we probably shouldn't | finetune on great artists and try to sell the output | without modification or added creativity. Private or | close friends sharing is fun and life enriching and | inspiring though in my eyes. | 52-6F-62 wrote: | Boards of Canada came to their sound because in their | youth, the brothers had to move to Canada for a time. | Even though it was only a couple of years their | experience made an indelible mark on them--particularly | school days watching old National Film Board of Canada | tapes on worn VCR heads. | | When they moved back to Scotland and started their music | they started incorporating both the machinery and the | sounds from the tapes in their compositions. And they | could play their compositions live. It was quite the rig. | | It's not just entertainment. It's communicating a very | specific feeling and perspective. Keep learning and | create, don't be satisfied with just copying. | | The biggest difference here is in the doing. You have to | grow into one mode over time and energy spent, the other | is immediate gratification with minimal personal energy. | | Everything valuable comes during the course of that | process of growing and committing energy. And it's so | good. Don't deny yourself. | naillo wrote: | I get that. I think it's really cool what they did and | when musicians put in time and energy into making amazing | tracks. I get enough satisfaction from my normal coding | job though, I don't have time to dedicate my life to | music like they have. So from that perspective I'm just | happy that it's possible to get more music like that | type. Just a cool thing that exists in the world now, but | I still think working hard to realize an artistic vision | is also cool, separately. | 93po wrote: | there is very little creativity in most music already | | (Axis of Awesome - 4 Four Chord Song) | | https://youtu.be/5pidokakU4I?t=52 | 52-6F-62 wrote: | Do you know where the sounds came from that you like so | much? I think such a perspective is only reachable if you | do not. | | I recommend learning about that before deciding it's | satisfactory to reduce it to an algorithm suited for | copying. | | It'll enrich your life. Endless copies will not. They take | that music, that emergence of order out of chaos, and | return it back to chaos. | | It's void. | TheAceOfHearts wrote: | Does this model support / "understand" concepts of spatial audio? | For example, something like "an alarm moving around you in a | circle". | | When AudioGen was announced this was my first question, but from | what I've been able to test the model just ignores spatial audio | prompts. | | Unfortunately I haven't been able to find any discussion or | interest in online discussion about the importance / significance | of spatial audio. Why not? | cheald wrote: | My guess is that it's not a very interesting problem because | it's not particularly difficult to add spatial dimensions to | arbitrary audio - after all, it is already commonly done in | video games. All you have to do is manipulate the multichannel | outputs with an understanding of the spatial positioning of | each channel's speaker location relative to the listener and | some basic trig. | Jerrrry wrote: | Dolby wouldn't appreciate it. | gyumjibashyan wrote: | This is crazy tech! | 2Gkashmiri wrote: | So.... Wait for llama for audio and train your own voice to | having to call you friends by text and the software instead of | actually saying the words? This is going to be nice for | authentication, proving to a third party that you are yourself | PcChip wrote: | it's funny how they're all very impressive except the death metal | skybrian wrote: | As an amateur musician, I'd be more interested in these tools if, | along with the text description, they took as input a melody or | chord progression or performance data. Maybe ABC notation or a | MIDI track? Anyone doing that? | | Other cool things would be a way to generate a sampled instrument | from a text description, or to generate a new track given a text | description and all the previous tracks for other instruments. | There could be a new generation of audio tools that let you | generate placeholders or better for everything. | l33tman wrote: | The analogue from stable diffusion would be ControlNet, where | you can train a superimposed model on auxiliary data, this | should be possible to do with chords for example, just like you | can do with human poses, 3D depth maps etc in stable diffusion | using controlnet | emadm wrote: | It's coming | _sys49152 wrote: | gamechanging stuff for sample based rap producers. havent been | able to log-in yet but i think a good benchmark to start off with | is to see if it can replicate the 'al green' sound from the early | 70s - very distinct sounding production - drumless and | instrumental. | | you dont need 45 or 90 straight seconds of a coherent song | rendered. just need to dip in the 45 sec clip and cut out 4 | seconds here, another 4 there. reroll those cuts through stable | audio, keep rolling, keep rolling. cut up and get a pile of clips | together. arrange, layer, voila - you saved money on paying | royalties for sampling. | | the lofi melodic sample on the stability page was passable. | thought the bluegrass one sounded great actually. imagine being | able to program bluegrass like rap. | | edit: oof. fully trained on a licensed commercial dataset from | AudioSparx. muzak in, muzak out. | randcraw wrote: | I wonder if it makes sense to generate a combo of instruments | rather than individual voices and then combine those with an | arranger DNN. I would think it'd be much easier to capture each | instrument's transients and dynamics that way, much less allow | more subtlety in how they combine, like allowing the lead voice | to shift among instruments, or even let the listener choose how | each voice expresses stylistically and how they should combine. | | Trying to do all of that in a single DNN, much less parameterize | it useably seems overly ambitious (or will be of more limited | value ultimately). | kherud wrote: | Thank you for sharing! On a tangent: I'm wondering if there are | any good open source models/libraries to reconstruct audio | quality. I'm thinking about an end-to-end open source alternative | to something like Adobe Podcast [1] to make noisy recordings | sound professional. Anecdotally it's supposed to be very good. In | a recent search, I haven't found anything convincing. In my naive | view this tasks seems much simpler than audio generation and the | demand far bigger, since not everyone has a professional audio | setup ready at all times. | | [1] https://podcast.adobe.com/ | cosmok wrote: | I have had a lot of success with this: | https://ultimatevocalremover.com/ for de-noising | joshspankit wrote: | There seems to have been a fork in the road: | | On one side the tech for literal denoising has stagnated a bit. | It's a very hard problem to remove all noise while keeping | things like transients. | | On the other side, AI is being rapidly developed for it's | ability to denoise by recreating the recording, just without | the noise. | earthnail wrote: | In our denoiser (see other comment), we worked on combining | these two forks. That's how we can mathematically guarantee | great audio quality. | | This combination was non-trivial as training old school DSP | denoisers is not easily possible. We'll describe the math | needed in our paper. We hope our publication will help the | wider community work not just on denoising but also tasks | like automatic mixing. | earthnail wrote: | We've been researching an audio denoiser for music that we will | present at the AES conference in October. Description page: | https://tape.it/denoising | | We'll also publish a webapp where you can use the denoiser for | free. Mail me if you want beta access to it (email in profile). | | It won't be open-source though, although the paper will of | course be public. It will also only reduce noise, and not | reconstruct other aspects of audio quality. However, it can do | so on any audio (in particular music), not just speech like | Adobe Podcast, and it fully preserves the audio quality. It's | designed exactly for the use case you want: to make noisy | recordings sound professional. | white_beach wrote: | denoising seems to fail in the guitar and vocals example | earthnail wrote: | Can you clarify where it fails? It's designed to remove | stationary noise only, and removes it very well in the | guitar and vocals example. | | Generally speaking, if you have other sounds that you don't | want in the audio, we don't remove them - it's hard to | decide from a musical point of view whether you want a | certain sound or not. To give an extreme example: a barking | dog probably doesn't belong into a Zoom conference, but it | may very well belong into your audio recording. Removing | such elements would be a creative decision. | | The guitar and vocals example has certain clicks in the | background that we don't remove - but the stationary noise | is gone. Existing professional (and complex) audio | restoration tools like iZotope RX don't remove those | clicks, either. It's a conservative approach, sure, but in | return you can throw any audio at it and it always improves | it. | haywirez wrote: | Are you sure the demo sound files are correct on the website? | Couldn't appreciate any glaringly obvious differences between | the original and denoised with studio grade headphones here. | Or, the originals aren't noisy enough. | whywhywhywhy wrote: | It's not open but Nvidia has RTX Voice for free if you have and | Nvidia card. | | Only weird thing it's designed to be used real time but I've | had some luck on cleaning up voice recordings replayed back | through it via audio routing. | hubraumhugo wrote: | Now imagine Spotify using this to generate individual earworms | for everybody based on their personal tastes (likes, playlists). | | Yes, AI is partly hype, but had someone told me this even two | years ago, I wouldn't have believed it. | joshspankit wrote: | This is why it's _vital_ that AI is openly available. Imagine a | world where Spotify is the only company that can do that, and | they use it to make sure they never pay royalties again. | bee_rider wrote: | How is Spotify for finding new music based on your tastes? | I've only used Amazon and Pandora; Amazon is quite poor, | Pandora is pretty good. I suspect (although, without proof) | that if a service can't suggest new music, it will have | trouble generating new music as well. | | Anyway, I very much would rather run this sort of thing | locally. You could just manually set your taste profile. | Plus, music can be quite personal, imagine you start | listening to too much music inspired by The Cure and suddenly | Amazon starts advertising black makeup and antidepressants or | something like that, it would be too disconcerting. | magicalhippo wrote: | > How is Spotify for finding new music based on your | tastes? | | I haven't tried any alternatives really, but so far for me | I'd say decent. I put on an album, and once over it'll play | similarish stuff. If I don't like a song I'll skip it and | it seems to incorporate that feedback. | | Only thing is that it doesn't seem to be too adventurous | and it adheres rather strictly to the local context. | Meaning, if I played a stoner rock track, it'll continue | suggesting stoner rock and not much else, even though I | have quite varied music favorited in my library. | | Overall though I've found a lot of new bands I enjoy that | way so, positive experience for me. | | edit: as an example, here are the two most recent ones it | suggested where I ended up buying the albums on Bandcamp. | Both have quite few monthly listeners (1-2k), so not what | I'd call mainstream. | | SUIR https://open.spotify.com/artist/6zOeQ2hyNfqi9UMHtyTSlF | | Mount Hush | https://open.spotify.com/artist/13clfeXxTPsDsqzSlLIBZJ | seanw444 wrote: | I'll +1. I'm generally a FOSS guy, so if Spotify hadn't | helped so much in discovering new music that I really | enjoy, I'd be potentially acquiring music via | questionable means and just playing them as audio files | directly. | | There are two companies that have done well by Gabe | Newell's "piracy is a service problem - not a price | problem" position: Valve/Steam (who also contribute to | FOSS through Proton and SteamOS which I heavily | appreciate), and Spotify. Spotify makes discovering and | aggregating music so easy that the alternatives don't | seem appealing. | chankstein38 wrote: | Similarly to others, I haven't really used other things in | a LONG time but Spotify's Discover Weekly playlist is | usually a list of bops that I enjoy a lot. I frequently end | up adding a huge portion of them to my liked songs and to | regularly used playlists! | riskable wrote: | I've tried Amazon and Pandora. Spotify is so vastly | superior at finding music I like it's in an entirely | different league. | | Having said that, starting about a year ago (maybe ~1.5 | years?) Spotify started inserting _obviously_ paid | promotion tracks into my auto-generated "Daily Mix _n_ " | playlists and it seriously bothered me. When my playlists | are made up of very specific genres of EDM and suddenly a | pop song plays from a famous person I get seriously angry. | | It hasn't happened in many months though so maybe they | learned their lesson. I was so mad I seriously considered | ending my Premium subscription right then and there when | that track played. | K5EiS wrote: | I find the discover weekly playlists that are made for me | are pretty hit or miss, overall I have found many news | songs I like with their help. | chankstein38 wrote: | Same! New songs, new artists, new genres, it's pretty | cool! I agree it can be hit or miss but it feels like the | longer I use it and the older, and more mature the | platform gets, the more hit discover weekly ends up | being! | viraptor wrote: | Similar - I found many good new things. I like the | discovery playlists, even with the tracks I don't like. | If they never missed, how could they ever suggest | something actually different and exciting? It's | "discover" not "average of what you already like". | hospitalJail wrote: | I really want this. I have a band that I like, and I want more! | | Or I'd like to take a song I like, and make it educational, | like make it include the period table of elements. | [deleted] | liotier wrote: | Machine-generated music might be functionally equivalent to | human-generated music, but that ignores the cultural role of | art as a shared human experience - witness the liturgy of live | music. That can't happen with music tailored to each listener, | it can't happen without tracks that are fixed in time and can | be referred to. I can imagine it well-accepted for dynamic | music such as gaming soundtracks, but I suppose that machine | generation will be mostly a production technique resulting in | branded pieces. | jimmygrapes wrote: | I know quite a few technically talented musicians who have | next to no creativity in actually writing the music (aside | from jazz style improv sessions). Most of them never really | play live unless it's part of a similarly uncreative band of | college friends. I wonder if having a catchy/complex AI | generated song created for them to play live might be | interesting to them. Gonna check in and see what they think. | broast wrote: | > such as gaming soundtracks | | These days I also feel like my workout playlists might as | well be randomly generated dance music. | Jeff_Brown wrote: | Saying that's impossible makes me immediately wonder whether | it's not. There are already headphone dance parties. What if | a musical act's output was being interpreted through genre | lenses specific to each listener? | ragazzina wrote: | Spotify does not need to generate a tailored earworm for me. It | could already suggest songs that I like based on my personal | taste out of their 100-million-songs catalog - and it's | absolutely unable to do it. | zachthewf wrote: | Building a tailored earworm might actually be easier. | cesaref wrote: | The death metal example reminded me of the continuous streaming | death metal here: | | https://www.youtube.com/watch?v=MwtVkPKx3RA | shon wrote: | Ed Newton-Rex, VP of Audio at Stability, is speaking about how | this was built at The AI Conference in 2 weeks. | https://aiconference.com | iandanforth wrote: | The solo piano was interesting because of how clean it is. I can | imagine going from that sample to a score without too much | difficulty. Once it's in a symbolic format it becomes much more | flexible and re-usable. | | While this does _not_ seem to be the trend I hope more gen ai in | the audio and visual realms start to produce more structured / | symbolic output. For example, if I were Adobe I would be training | models, not to output full images, but either layers or brush | strokes and tool pallet usage. Same for organizations that have | all the component tracks of music to work with. | Jeff_Brown wrote: | That raises an interesting difference between cleaning AI- | generated sound and cleaning ordinary recordings. In an | ordinary recording, there is an objective reality to discover | -- a certain collection of voices was summed to create a | signal. With (most? the best?) existing AI audio generation, | the waveform is created from whole cloth, and extracting voices | from it is an act of creation, not just discovery. | | I've come across AI-generated music that outputs something like | MIDI and controls synthesizers. Its audio quality was crystal- | clear, but the music was boring. That's not to say the approach | is a dead-end, of course -- and indeed, as a musician, the idea | of that kind of output is exciting. But getting good data to | train something that outputs separate MIDI-ish voices seems | much harder than getting raw audio signals. | fnordpiglet wrote: | Generative models can certainly create midi, but no one has | done it yet. Given the technique is making video, audio, | images, and language, all you need to do is train and build a | model with an appropriate architecture. | | It's easy to forget this is all pretty new stuff and it still | costs a lot to make the base models. But the techniques are | (more or less) well documented and implementable with open | source tools. | jskherman wrote: | I believe Spotify's Basic Pitch[0] is already some work | towards building something like this. | | [0]: https://basicpitch.spotify.com/about | MrCheeze wrote: | It has been done - first by OpenAI (MuseNet, which is no | longer available) and later by Stanford (Anticipatory Music | Transformer): | https://nitter.net/jwthickstun/status/1669726326956371971 | TheActualWalko wrote: | We've done it! wavtool.com | fnordpiglet wrote: | That's really neat. How long have you been working on | this? | radarsat1 wrote: | > Generative models can certainly create midi, but no one | has done it yet. | | Note sequence generation from statistical models has a long | history, at least as long if not longer than text | generation. | | Have a look at section 2.1 of this survey paper [0] that | cites a paper from 1957 as the first work that applies | Markov models to music generation. | | And, of course, plenty of follow-up work 6 decades later on | GANs, LSTMs, and transformers. | | [0]: https://www.researchgate.net/publication/345915209_A_C | ompreh... | fnordpiglet wrote: | Yes, in fact I think at some point everyone has written | their own Markov generators or at least run dissociative | press. But we've really only seen meaningfully high | quality output over the last few years. | fassssst wrote: | Do you know if anyone has tried training a text-to-music | or text-to-midi model where the training data includes | things like emotion labels for each note interval or | chord progression? | Jeff_Brown wrote: | That sounds expensive and inefficient. Peoples' | interpretations of music (and abstract art more | generally) can be shockingly different; I suspect the | model would not get a clear signal from the result. | | But that makes me wonder to what extent labeling can be | programmed -- extracting chord changes, dynamics changes, | tempo, gross timbral characteristics, etc. | fassssst wrote: | And maybe even labels like popularity/play count/etc so | it has a better sense of what "sounds good" to certain | groups | fnordpiglet wrote: | There are a lot of Lora models that are being made to generate | textures, maps, diagrams, backgrounds, etc. You don't need to | wait for adobe, open source models like stable diffusion let | you do whatever you think is useful. I'd look to the open | source world for creative innovation. Adobe is just doing | what's on the product management roadmap. | miohtama wrote: | Having music editable for human post production is necessary | for most professional adoption. Generating MIDIs would make | much more sense than generating raw audio. | | This is what we do with AI images: you can fix them in | Photoshop, etc. You cannot do this for raw audio due to how | music is produced. | waffletower wrote: | Build or seek out a MIDI generating model. I hope Stable | Audio is _never_ the place for that. MIDI is deeply lossy and | it would be tragedy if it was the only music representation. | Imagine if instead of phonographs, compact disks and | streaming audio we only had piano rolls. What a loss indeed. | iainctduncan wrote: | Midi is not lossy, midi is symbolic. There's a huge | difference. | Applejinx wrote: | No, it's lossy. It's an event model at a fixed data rate. | You can only do so many things sequentially, even if you | could represent any possible musical concept as a MIDI | event. So even if you're not sticking to note-on, note- | off, it's still extremely lossy. | miohtama wrote: | MIDI 2.0 improves a lot of things, how much dynamicity | and variation you can have. MIDI 1.0 is a standard from | early 80s. It indeed has shortcomings, but also the | upside is editability. | | It's then for the remaster / musician / actual | interpretation / post production to make the score to | something less of an event model. | iainctduncan wrote: | well ok, you could say it's a lossy format for capturing | physical movement, which it certainly is. My point is | that it is not a format for capturing _music_ any more | than a score on a page is a format for capturing music. | Both are instructions for a performer of music (one | machine, one human), which is a very different beast. | gamblor956 wrote: | MIDI is able to accommodate nearly everything that can be | represented through a musical score and instrumental | performance. What are you hoping to accomplish with AI- | generated waveforms that can't be done with MIDI? | mdp2021 wrote: | > _What_ | | "Intention" (as a tentative term) | | The question becomes: what has impeded the creation of a | MIDI file that can be confused with an actual concert | from Arturo Benedetti Michelangeli. | gamblor956 wrote: | Literally nothing is preventing this other than that | nobody has bothered to take the time to do it. | | The current version of MIDI is capable of replicating any | of his performances, even down to the randomness. | | Note that if you want to replicate the audio quality of | his performances, you will need a high-quality MIDI | instrument; the ones that ship with Windows will not | suffice. These MIDI instruments can range from a few | dollars to thousands of dollars. (See, e.g., Native | Instruments) | mdp2021 wrote: | > _nobody has bothered to take the time to do it_ | | In such case, we have a theoretical suggestion that | <<nothing is preventing this>>, but not an actual proof | based on a "Turing test"-like scenario which would have | specialists fooled, to corroborate that the new MIDI 2 | would suffice. | gamblor956 wrote: | I gave you the knowledge to do this research yourself, | but since you are unwilling to do so, here is an example | of the performance possible using MIDI instruments: | https://www.youtube.com/watch?v=CvaChiq6gf0 | | As it is clear that you intend to keep shifting the | goalposts to make a point that can't be made, I will | withdraw from further participation in this thread. | waffletower wrote: | This is absurd. Sure, someone below posits that MIDI | could perhaps represent a piano performance by Arturo | Benedetti Michelangeli. I think it has been able to do a | passing job at that, _when you provide a decent piano_. | Regardless, piano rolls have been able to come close | since the early 20th century. But how well does MIDI | represent music performed by John Coltrane? Jimi Hendrix? | It falls on its face. The long fetishized Western music | notation abstraction, which MIDI poorly simulates, | completely fails for many important examples of music. I | would even venture to say that MIDI fails for _most_ of | them. But yes, MIDI is well-optimized for piano music | where an acceptable piano or simulation is available. | waffletower wrote: | The problems have long been known and articulated: | http://www.music.mcgill.ca/~gary/courses/papers/Moore- | Dysfun... | gamblor956 wrote: | Your example of the failures of MIDI are based on a | 35-year old paper (from 1988!!!) about an earlier version | of MIDI? | | When that paper was written it took several _weeks_ and | many millions of dollars of equipment to render | primitive, mono-color 3d graphics. Desktop computers had | 512 _kilobytes_ of RAM and the highest-end desktops 32 MB | of hard drive storage space. Computer screens had two | colors: black and green. Audio cards capable of making | beeps and clicks were the cutting-edge. WIFI was still a | decade away. | waffletower wrote: | You clearly didn't read the article, and clearly don't | understand how prevalent the MIDI 1.0 specification is | today. MIDI 2.0 is a very recent development (this year | LOL!) and has yet to be commercially adopted. The 1984 | design is what is largely in use today. At the time of | initial development, commercial synthesizers, not sound | cards, were the intended generators of sound utilizing | MIDI: https://www.vintagesynth.com/roland/juno106.php. | waffletower wrote: | MIDI is an extraordinarily lossy music representation. | Even Claude Shannon would facepalm at the assertion that | it could, in theory, represent audio faithfully. It is | not its purpose, it is decidedly not its practice, and it | is a ludicrously irrelevant example of pedantry to say | otherwise. The false equivalency asserted by the commons | can be aggravating :D | iainctduncan wrote: | MIDI is not a lossy format for audio because it's not a | representation of audio, period. It's a format for | conveying the motion of a piano keyboard, meant from the | beginning to be usable for various forms of audio. | Jeff_Brown wrote: | MIDI is not _inherently_ lossy. You could encode anything | in it, just as you can encode any novel as an integer. | | In practice, though, transformations from audio to MIDI | discard an enormous amount of important information, with | the possible exception of transcriptions of performances | on piano (where volume, frequency, duration and a good | physical model of a piano are enough to reconstruct | everything important about the signal) and similar | instruments. | schazers wrote: | I strongly agree about generating "editables" rather than | finalized media. In fact, that's why text generators are more | useful than current media generators: text is editable by | default. Here's a tweetstorm about it: | https://x.com/jsonriggs/status/1694490308220964999?s=20 | waffletower wrote: | Audio is definitely editable. While generative audio is new I | am hopeful that a host of interesting applications will | emerge (audio2audio etc.) within its ecosystem. Promising | signal separation (audio to STEMs) and pitch detection tools | already exist for raw audio signals. If you want to force | Stability to focus on symbolic representations (such as | severely lossy MIDI) I hope you can instead first try | adapting to tools that work fundamentally with rich audio | signals. Perhaps there will be room for symbolic music AI and | perhaps Stability will even develop additional models that | generate schematic music, but please please don't sacrifice | audio generality for piano roll thinking alone. LORAs will | undoubtedly be usable to generate more schematic audio via | the Stable Audio model -- I imagine they could be easily | purposedly to develop sample libraries compatible with DAW | (digital audio workstation), sequencer and tracker production | workflows. | visarga wrote: | Train the model with midi notes as text in the prompt and | the audio as target. It will learn to interpret notes. | waffletower wrote: | Not all music is well represented with notes, nor are | audio datasets with high-quality note representations | readily available. But I guess if you work hard enough | you can get close: | https://www.youtube.com/watch?v=o5aeuhad3OM My example | still sounds like the chiptune simulation that it is, | however. | tech_ken wrote: | I was wondering the same thing, definitely seems like | generating the raw waveform runs into all kinds of weird issues | (like they touched on in this post). I would imagine that | training data would be a serious chokepoint here. Given how | much discourse is currently kicking off around the intellectual | property rights of just the final product (the mastered track), | I can't imagine many musicians would be eager to share what is | effectively the "proof of ownership" (track stems or MIDIs). | dylan604 wrote: | >For example, if I were Adobe I would be training models, not | to output full images, but either layers or brush strokes and | tool pallet usage. Same for organizations that have all the | component tracks of music to work with. | | I really like this idea. Creating new tools for artists to use | to create rather than whatever we're accepting as use now. The | use of current full image creation is boring to me in the same | way the choice of invisibility as a super power is. The | invisibility is ultimately going to slide into pervy | tendencies, just like deep fakes will slide in the same way or | some other inappropriate use. | waffletower wrote: | Hopefully, the entire industry will _NOT_ move in such a | schematic and lossy direction. Use separate tools to analyze | audio streams please. Don 't throw the timbre baby out with the | bathwater. MusicGen utilizes a tokenized transformer model for | music, which is attractive for symbolic translation use cases. | However, the overall audio quality is far more lossy than the | examples you hear from Stable Audio. I believe that symbolic | representation should not be a foundational approach to | adequately represent and generate rich audio signals. | gabereiser wrote: | Yessssss! I thought about MusicGAN and Markov chains last night | thinking "Why can't we just codify all chords and use a GAN to | generate markov chains on chords of a key and have AI generate | instruments and waveform from those chains?" IANA researcher | but in my head, that sounded logical. | TylerE wrote: | That's existed for decades. It's called Band in a Box. It's | also cheezy as hell. | gabereiser wrote: | lol, no. Not autogenerate midi (although their latest | versions of BiaB are pretty darn good now) but generate | waveforms together. It would be similar to having AI | generate whole scores of music but ensuring it's all in | sync and in key. Not taking sample database of 88 sound | files and triggering them when the midi-note strikes. | TylerE wrote: | That's not how BiaB works at all. It has all kinds of | patterns built into it. So, it knows, how to generate, | say, a bluegrass bassline in a given key. There are | plenty of ways to play back MIDI with high sound quality, | including feeding it into an AI-driven VST like | NotePerformer. | gabereiser wrote: | Then explain why you categorize it as cheesy? Sounds like | it's pretty cool. | 93po wrote: | he's saying the output is cheesy. it sounds like stuff | you'd hear on a demo track for a kid's toy piano | TylerE wrote: | It's not THAT bad, it's just that for a programming | targetting jazz the playback is rather...square. | gabereiser wrote: | Yeah I definitely imagine certain genres sounding off due | to the rhythmic nature of computer timing. Jazz doesn't | follow rules like that so good jazz has these timing | idiosyncrasies that make it sound the way it sounds. That | and a ridiculous obsession with adding half step and | quarter tone intervals. | dylan604 wrote: | Just like all examples of generative "AI" I've seen, there's | always some bit of uncanny valley vibe present. In the audio | examples, there's always this weird distortion like a really | poorly compression sources were used as training data. The sounds | are muddled together, and rarely do I hear clean musical voices. | It's just a smear of sounds coming together that our brains try | really hard to say "oh, that's a _____" situation. While the | samples in the TFA are probably the closest I've heard to date, | the issue is still present. | | I guess the thing that strikes me so odd about the generative | thing is all of the press releases on people presenting things | like it's a final product, yet it's clearly pre-release beta at | best but more likely alpha versions of code in the results in | quality. If a non-AI product released something that was so | clearly not finished, it would be panned to no end for not | working. | [deleted] ___________________________________________________________________ (page generated 2023-09-13 23:00 UTC)