[HN Gopher] Stable Audio: Fast Timing-Conditioned Latent Audio D...
       ___________________________________________________________________
        
       Stable Audio: Fast Timing-Conditioned Latent Audio Diffusion
        
       Author : JonathanFly
       Score  : 313 points
       Date   : 2023-09-13 10:00 UTC (13 hours ago)
        
 (HTM) web link (stability.ai)
 (TXT) w3m dump (stability.ai)
        
       | colesantiago wrote:
       | This is yet another amazing release from Stability AI.
       | 
       | Will be adding this to my SaaS side grift and introduce generated
       | music you can listen to while you're chatting with your PDFs.
       | 
       | Can't wait for the next one.
        
       | stainablesteel wrote:
       | i want something that can take in a song and transform it into a
       | different genre
        
       | jncfhnb wrote:
       | The bluegrass one is super weird. I can't identify exactly why.
        
         | zzbzq wrote:
         | I can identify a bunch of things. The chord structure jumps all
         | over randomly in a genre that usually does the opposite. The
         | banjo is clearly not an actual banjo being strummed/frailed,
         | but a weird agglomeration of bright toned instruments including
         | both frailed/scruggs banjo and dobro, and maybe harmonica and
         | fiddle creeping in. The AI doesn't know it's making a
         | combination of instruments, so where it's trained on
         | instruments blending, it thinks it can produce pre-blended
         | sounds. I guess maybe this is more like a return to being a
         | child hearing music for the first time with no preconceptions
         | or expectations.
        
         | smat wrote:
         | You are right it feels off.
         | 
         | The position of the guitar in stereo is all over the place,
         | higher frequency elements appear to come from the left while
         | other parts are more centered.
        
         | ewan251 wrote:
         | I think the super weird part is that it's not great? I
         | understand this is most likely very impressive technologically
         | but musically it is disjointed, inconsistent and fake sounding.
         | Most of the "music" examples have weird phrasing and confusing
         | harmonic rhythm.
         | 
         | Kudos to stability.ai for achieving this as I am sure it took a
         | lot of effort and this is a huge leap forward in terms of
         | generation of audio by generative AI.
         | 
         | However as a musician (BMus and MMus at 2 different
         | conservatoires) I think it's important to say that the job risk
         | being experienced by creative writers will not be extending to
         | musicians... yet.
        
           | jncfhnb wrote:
           | I feel like music composition is a fundamentally hard task
           | for AI. Music production seems like it should be a lot easier
           | but I haven't seen that
        
             | viraptor wrote:
             | From what I've seen in the generated tracks so far (this
             | one and others), they're pretty good locally, but just
             | ignore the overall composition. For example any generated
             | blues tracks will have the vague blues feel, but won't keep
             | the 12 bar style. The bluegrass example here doesn't even
             | seem to keep to 4/4 (or is extremely fluid about it...).
             | Maybe one day someone will add a higher level "what's the
             | current section, how far are you into it" inputs to that
             | model to get something better - literally preparing the
             | structure first and then filling it in. That should get
             | much better results for context like "you're playing blues
             | in A with quick change and generating bars 3-4, match the
             | previous bars in style".
             | 
             | I mean, chatgpt knows how to plan this out https://chat.ope
             | nai.com/share/976077c0-138b-4363-8065-3c8eed... Painting in
             | that picture should be much easier than generating
             | something freeflowing. Generating a good structure isn't
             | that hard for most styles, because you can literally use
             | the same pattern and do a few random changes that keep the
             | key. (See lots of pop songs using the same 3/4 chord
             | progression)
        
             | Jeff_Brown wrote:
             | Yes, and surprisingly so. I _never_ would have guessed we
             | 'd have AI stock photographs before AI muzak.
        
         | stef25 wrote:
         | Same for the death metal
        
         | benesing wrote:
         | Also, the music is not bluegrass as much as it is old-time, a
         | confusion that continually irritates old-time players.
        
         | jnwatson wrote:
         | The AI seems to understand 4/4 time but doesn't understand
         | groupings of 4 measures into phrases. It definitely doesn't
         | understand ABABACA or even the basic parts of a song.
         | 
         | It is the musical equivalent of a meandering paragraph.
        
           | Jeff_Brown wrote:
           | Absolutely. And all AI for music I have seen suffers that
           | problem.
           | 
           | It makes me wonder whether the music generation should be
           | stratified -- a coarse model lays out where parts like verse
           | and chorus are, what distinguishes them, how to transition,
           | etc., and then a finer-grained model fills in the details.
        
       | skilled wrote:
       | Relevant,
       | 
       | https://news.ycombinator.com/item?id=37493741
       | 
       | https://www.stableaudio.com/
        
       | coldcode wrote:
       | It's interesting tech but none of the musical pieces impressed me
       | (I play multiple instruments and have written and arranged
       | music), most sounded too repetitive and not very imaginative.
       | This is also an issue with diffusion based art AI in general, its
       | good at a limited set of things but gets rather repetitive after
       | a while. I could see using this as background music where quality
       | is not important, like in games, though I doubt you could run the
       | AI generator inside a game, you could generate it as a asset.
       | People like Hans Zimmer and Ludwig Goransson have nothing to
       | worry about.
       | 
       | Singing would an interesting experiment, but I don't see that
       | here.
        
         | Art9681 wrote:
         | The real test is when this stuff is out in the wild and no one
         | tells you it's AI and the thought doesn't cross your mind. Of
         | course it's not impressive nor surprising when the answer was
         | given up front.
        
         | AuryGlenz wrote:
         | I disagree with your diffusion based art assessment, and I
         | think it's probably colored by what most people seem to want to
         | make with it. Just as with regular art, you need some sort of
         | vision to go beyond what everyone else does. Prompting "pretty
         | girl wearing sexy clothing" for the umpteenth time isn't new.
         | 
         | AI art gets rid of the technical skill step but the rest is
         | there, although you may luck in to something at random. If
         | you're using ControlNet on Stable Diffusion or training your
         | own models you have a lot of control over the output as well.
        
         | zone411 wrote:
         | Yes, while working on my AI Melodies Assistant project, it
         | quickly became clear that generating a pleasant but boring
         | music isn't too difficult. To create a catchy tune, an element
         | of surprise is essential. In the end, I was able to use it as
         | an assistant to compose 60 melodies that I'm happy with
         | (https://www.melodies.ai/)
        
         | dylan604 wrote:
         | Yeah, the "rock drums" example was like a student in a
         | practice. I'll be impressed when it can sound like Danny Carey.
         | 
         | From all of the hype, I want to be impressed with results.
         | Instead, we get these mediocre at best examples of what it can
         | do. They are not good sales pitches to me.
        
           | hoosieree wrote:
           | Hate to break it to you but there is a vast market for
           | mediocre content, and even Danny Carey has a bad day once in
           | a while.
        
             | dylan604 wrote:
             | I think you've confused the vast market's forced acceptance
             | of mediocre because that's what's available with their
             | wanting the mediocrity. Proof of that is seen with the
             | emergence of places for affordably licensed royalty free
             | options. The quality of production and styles today
             | compared to those from the 90s/00s is amazing. There's been
             | a few options on these sites that I would play in a set as
             | DJ. This ain't yo momma's needle drop selections.
        
           | chpatrick wrote:
           | Sir, your dog can talk!
           | 
           | Yes, but not very well.
        
             | dylan604 wrote:
             | You joke, but even those videos of people saying their dog
             | can talk is just like this. It's cute because it's a real
             | dog making sounds we really want to believe when it's just
             | them mimicking sounds because they get pettin's and treats.
             | 
             | What I want is "AI" to do something impressive. Why are we
             | trying to make the system generate the sounds itself? We
             | don't make artists do that, we give them instruments. Give
             | the models actual instruments, and then have it play them
             | like a real artist. I will be much more impressed with an
             | AI that understands composition and scoring, use of musical
             | voices, key signatures. That would still be generative. I
             | guess I just don't understand the point of the direction
             | being taken. It's like a solution looking for a problem.
        
               | Jeff_Brown wrote:
               | We work with what we have. We don't have a lot of
               | recordings of the physical movements of musicians; we
               | have recordings.
               | 
               | Similarly, we don't have recordings of the actions of
               | painters; we have finished paintings -- but if you're not
               | impressed with what AI can do in the visual sphere, your
               | standards are, to put it mildly, high.
        
               | dylan604 wrote:
               | I'm not really sure how to take this. We absolutely have
               | recordings of instruments. You can buy them as complete
               | sets. You train on complete recordings, and then tell it
               | how to use the sampled instruments to compose a song in
               | the style of the trained data. Building something to make
               | a waveform that looks like another waveform just seems
               | like a very odd direction to take.
               | 
               | Yes, my standards are if it isn't at least as good as
               | what's available now, what's the point.
        
               | chpatrick wrote:
               | Well, we found with Midjourney et al that these models
               | can work very well despite having no pre-conceived or
               | symbolic notions of composition, color theory,
               | perspective or anything. Yet they can produce really good
               | results in the image generation space. It's the same idea
               | here, except much earlier days.
               | 
               | In the same way, many successful musicians can't read
               | sheet music or know music theory, they just know how to
               | produce something that sounds good.
        
               | dylan604 wrote:
               | >In the same way, many successful musicians can't read
               | sheet music or know music theory, they just know how to
               | produce something that sounds good.
               | 
               | Right, because they can operate the instruments that make
               | the sound with natural talent, but they don't have to
               | draw the waveforms. Audio generation is much different
               | than image generation. It's just very odd to me.
        
         | cwillu wrote:
         | I thought "Prompt: 116 BPM rock drums loop clean production"
         | wasn't bad, but I'll grant that most of the rest would be an
         | excellent way of showing (for instance) a death metal fan what
         | their favoured music sounds like to those who don't have an ear
         | for it :D
        
         | Maschinesky wrote:
         | I'm surprised that you even leap to people like Hans Zimmer and
         | others.
         | 
         | The people we need to worry about are aallll of the people
         | earning a living for everything else like background music for
         | Indi games, Ambient music etc.
        
         | mpsprd wrote:
         | >where quality is not important, like in games
         | 
         | I don't believe that's a good example. Video game music is an
         | important part of the gaming experience, but its often taken
         | for granted or overlooked.
        
         | PatronBernard wrote:
         | What I haven't seen done well by these generative AI's so far
         | is structure (having a chorus, a verse, a bridge, ...) and
         | harmonic movement/progressions (except for maybe a V-I or I -
         | VI - ii - V) And those two be things are exactly what makes a
         | song interesting and non-repetitive.
        
           | Jeff_Brown wrote:
           | OpenAI's jukebox -- now 3 years old -- is creative and non-
           | repetitive. Witness, for instance, its jam on Uptown Funk
           | here:
           | 
           | https://www.youtube.com/watch?v=KCaya74_NHw
           | 
           | Or the changes shortly after 1:15, 2:15 and 2:40 in these
           | extensions of Take On Me:
           | 
           | https://www.youtube.com/watch?v=_3yOrUJ0SzY
        
         | riskable wrote:
         | I thought the, "epic trailer music intense tribal percussion
         | and brass" was pretty good. Rather, good _enough_ for something
         | like a video game where the game engine is dynamically
         | generating music based on the present situation in the game.
         | 
         | I could easily find that music entertaining if it started
         | playing the moment my character triggered a trap and suddenly,
         | "the floor is lava" or my character enters a scene with the
         | quest of winning over one of the love interests =)
        
       | Cloudef wrote:
       | Is the extreme metal music lacking from the training set? Why do
       | the extreme metal examples always sound horrible?
        
         | rafaelero wrote:
         | Because metal sounds horrible.
        
         | Jeff_Brown wrote:
         | Metal is especially hard to mix in a way that keeps the voices
         | distinct and clear. Maybe the training catalog includes a lot
         | of low-budget metal.
        
       | jasbur wrote:
       | It's interesting that the Death Metal was the hardest to
       | reproduce. I conclude that it's the most fundamentally human of
       | all genres.
        
         | Ninovdmark wrote:
         | The sound sample seemed to fit the 'vibe', but lacked any
         | discernible definition. Could it be that it's too sonically
         | dense to easily reproduce? Perhaps this could be improved with
         | a more tailored training set.
        
         | awestroke wrote:
         | I think it was just the genre that was the least represented in
         | the training data
        
         | chankstein38 wrote:
         | It sounds more like break core haha
        
         | dontreact wrote:
         | Well... they hardly tried all genres :)
         | 
         | It sounds like it can't handle lyrics or semantics that well so
         | I suspect any genre where the lyricism is important would also
         | be quite mushy and recognizably AI
        
           | Jeff_Brown wrote:
           | The Beatles seemed to be the hardest music for JukeBox to
           | emulate.
        
       | jacooper wrote:
       | Everything looks very convincing apart from the airpane pilot and
       | the sound effects. They sound very weird as if one is
       | hallucinating
        
         | wiz21c wrote:
         | The airplane is super convincing as an encrypted Empire
         | communication :-) (see Star Wars episode 5 IIRC)
        
         | sebzim4500 wrote:
         | The airplane one just sounds like a foreign language over a bad
         | intercom, I think that could still be useful for some stuff.
        
           | xpe wrote:
           | Perhaps because generating good white noise requires
           | randomness without autocorrelation or detectable patterns.
        
       | stared wrote:
       | I would love to use it for background music when I am working. I
       | have specific tastes that depend on the task, mood, energy level,
       | and ambiance.
        
         | k12sosse wrote:
         | If you're not attuned to Cryo Chamber (label), check them out.
         | Maybe not fitting all use-cases, but a strong and deep
         | catalogue.
        
       | naillo wrote:
       | I keep thinking back to when we didn't have stabilityai and it
       | was just google and meta teasing us with mouth watering papers
       | but never letting us touch them. I'm so thankful stability
       | exists.
        
         | [deleted]
        
         | Tenoke wrote:
         | Stability is great but Meta's MusicGen is available with code
         | and weights while this isn't so that's a really odd place to
         | make that comparison and complaint.
        
           | waffletower wrote:
           | Unfortunately MusicGen's output quality isn't strong enough.
           | I applaud Meta for open sourcing it. The audio samples
           | released for Stable Audio show much more promise. I look
           | forward to code and model releases. I built out a Cog model
           | for MusicGen and took it for a fairly extensive test drive
           | and came back disappointed.
        
           | Taek wrote:
           | Before stable diffusion, nobody released weights at all. Meta
           | et al only started sharing their models with the world when
           | they realized how fast a developer ecosystem was building
           | around the best models.
           | 
           | Without stability, all of AI would still be closed and
           | opaque.
        
             | Tenoke wrote:
             | >Before stable diffusion, nobody released weights at all.
             | 
             | That's not true. There's been a lot of models with weights
             | from every player before Stability.
             | 
             | >Without stability, all of AI would still be closed and
             | opaque.
             | 
             | Most GANs (the practically spiritual predecessor to
             | diffusion models) for example were available. Huggingface
             | existed and has realistically done more to keep AI open.
             | And again, this specific release we are talking about by
             | Stability is _not_ Open.
             | 
             | Stability is great but you are re-writting history and
             | doing it on the release where it makes least sense to do
             | so.
        
               | refulgentis wrote:
               | Nah. Dunno where this is coming from but infamously no AI
               | models were released by big players for years. Rewind 18
               | months and all you got is GPT-3.0 that no one seems to
               | care about and Disco Diffusion-y type stuff.
        
               | thomashop wrote:
               | You are looking at a very short part of recent history.
               | It has not been like that at all.
        
               | refulgentis wrote:
               | I'm all ears. I was "in the room" from 2019 on. Can't
               | name one art model you could run on your GPU from a FAANG
               | or OpenAI before SD, and can't name one LLM with public
               | access before ChatGPT, much less weights available till
               | LLaMA 1.
               | 
               | But please, do share.
        
               | thewataccount wrote:
               | Openai - GPT2 2019 -
               | https://openai.com/research/gpt-2-1-5b-release
               | 
               | Google - T5 - Feb 2020 -
               | https://blog.research.google/2020/02/exploring-transfer-
               | lear...
               | 
               | Both of these were and still are used heavily for on-
               | going research and T5 has been found to be decently
               | useful when fine-tuned.
               | 
               | Weights were available for both.
        
               | refulgentis wrote:
               | See https://news.ycombinator.com/item?id=37501964
        
               | smoldesu wrote:
               | > Can't name one art model you could run on your GPU from
               | a FAANG or OpenAI before SD
               | 
               | Google published dozens to promote Tensorflow:
               | 
               | https://experiments.withgoogle.com/font-map
               | https://experiments.withgoogle.com/sketch-rnn-demo
               | https://experiments.withgoogle.com/curator-table
               | https://experiments.withgoogle.com/nsynth-super
               | https://experiments.withgoogle.com/t-sne-map
               | 
               | The list goes on. Many are source-available with weights
               | too.
               | 
               | > can't name one LLM with public access before ChatGPT,
               | much less weights available till LLaMA 1.
               | 
               | Do any of these ring a bell?
               | 
               | - DistilBERT/MobileBERT/DeBERTa/RoBERTa/ALBERT
               | 
               | - FNet
               | 
               | - GPT2/GPT-Neo/GPT-J
               | 
               | - Marian
               | 
               | - MBart
               | 
               | - M2m100
               | 
               | - NLLB
               | 
               | - Electra
               | 
               | - T5/LongT5/T5-flan
               | 
               | - XLNet
               | 
               | - Reformer
               | 
               | - ProphetNet
               | 
               | - Pegasus
               | 
               | That's not comprehensive but may be enough to jog your
               | memory.
        
               | refulgentis wrote:
               | I understand your point.
               | 
               | The gap in communication is we don't mean _literally_ no
               | one _ever_ open-sourced models. I agree, that would be
               | absurd. [1]
               | 
               | Companies, quite infamously and well-understood, _did_
               | hold back their "real" generative models, even from being
               | available for pay.
               | 
               | Take a stab at a literal definition: - post-GPT2 LLMs
               | (ex. PALM, PALM2) - art like DaLL-E, Imagen, Parti
               | 
               | Loosely, we had Disco Diffusion for art, and GPT-3 for
               | LLMs, and then Dall-E, then Midjourney. That was over an
               | _entire year_, and the floodgates on private ones didn't
               | open till post SD/ChatGPT.
               | 
               | [1] thank you for the lengths you went to highlight the
               | best over a considered span of time, I would have just
               | said something snarky :)
               | 
               | [2] I did not realize FLAN was open-sourced a month
               | before ChatGPT, that's fascinating: we're stretching a
               | bit, beyond that, IMHO: the BERTs aren't recognizable as
               | LLMs.
        
               | smoldesu wrote:
               | All good. I've also been working on LLMs since 2019-ish,
               | so I wanted to toss a hat in the ring for the
               | underrepresented transformer models. They were cool (eg.
               | dumb), fast and worked better than they had any right to.
               | In a lot of ways they are the ancestors of ChatGPT and
               | Llama, so it's important to at least bring them into the
               | discussion.
        
               | hofstee wrote:
               | https://github.com/google/deepdream
        
               | astrange wrote:
               | > Can't name one art model you could run on your GPU from
               | a FAANG or OpenAI before SD
               | 
               | CLIP could be used as an image generator, slowly.
               | 
               | > and can't name one LLM with public access before
               | ChatGPT, much less weights available till LLaMA 1
               | 
               | InstructGPT was available on OpenAI playground for months
               | before ChatGPT and was basically as capable as GPT3,
               | people were really missing out. Don't know any good
               | public models though.
        
             | waffletower wrote:
             | In the image generation space, weights were never released
             | for ImageGen and Dall-e, but yes you can find weights for
             | more specialized generative models like StyleGAN (2, 3
             | etc). Stable Diffusion was arguably one of the most
             | influential open model releases, and I think the
             | substantial investment in StabilityAI is evidence of that.
        
               | astrange wrote:
               | There were open reproductions of DALLE1 like ruDALLE.
        
             | smoldesu wrote:
             | GPT-2, GPT-J, XLNET, BERT, Longformers and T5 were all
             | freely available before Stable Diffusion was even a press
             | release.
        
             | vhold wrote:
             | Stable Diffusion 1 _contains_ a model OpenAI released. The
             | CLIP encoder that was trained on text /image pairs at
             | OpenAI.
             | 
             | https://huggingface.co/runwayml/stable-diffusion-v1-5
             | 
             | https://huggingface.co/runwayml/stable-
             | diffusion-v1-5/blob/m...
             | 
             | Uploaded to Hugging Face Jan 2021
             | 
             | https://huggingface.co/openai/clip-vit-large-patch14
        
           | seydor wrote:
           | Stability def helped push things forward , it even probably
           | showed them that open source is inevitable
        
           | refulgentis wrote:
           | Nah it's not because without the releases of
           | Stability/ChatGPT it'd be the same situation. Cool nihilism
           | though
        
         | Jeff_Brown wrote:
         | Is the source for this available? I found no mention of it on
         | the page.
        
           | fnordpiglet wrote:
           | It says source is coming
        
       | hoosieree wrote:
       | Humans take a long time to get good at art; in the meantime they
       | still have to eat.
       | 
       | So they compete with generative AI for a fixed number of jobs.
       | The AI is cheaper and faster. Humans stop training to become
       | artists.
       | 
       | Without new training data, the generative AI models stagnate.
       | Progress in art stops globally, forever.
       | 
       | But for a brief glorious moment, we were able to say "huh, that's
       | not bad".
        
         | phone8675309 wrote:
         | This is by design - the capital that backs modern art isn't
         | doing it for love of the art but for money.
         | 
         | For fine art, it's a way for them to launder money and keep it
         | out of bank accounts where it can be seized trivially.
         | 
         | For mass art, it's about selling to enough rubes to make a
         | profit.
         | 
         | Neither are impacted by a stagnation in art. If anything,
         | they're aided by it - suddenly the art you bought to launder
         | money retains its value because it's no longer the flavor of
         | the week with the arts crowd.
        
       | Jeff_Brown wrote:
       | I still consider OpenAI's JukeBox (now at least 2 years old!) far
       | and away the most creative music AI. But the combination of
       | coherence, sound quality and creativity of this model is (to my
       | knowledge) easily best in class.
        
         | 4RealFreedom wrote:
         | The sound quality of Jukebox is muddled. There are many
         | inconsistencies. The loudness of vocals and the quality of
         | instruments really stand out and not in a good way. Hard to
         | talk about creativity because it's so subjective but I've found
         | it lacking in all AI music including JukeBox. Don't get me
         | wrong - this tech is amazing.
        
           | Jeff_Brown wrote:
           | It's mushy and inconsistent, absolutely. But it also comes up
           | with wild yet coherent changes that I've seen from nothing
           | else.
           | 
           | At this comment I listed a few instances:
           | 
           | https://news.ycombinator.com/item?id=37499067
        
       | [deleted]
        
       | naillo wrote:
       | This is gonna be great to finetune on. There's only so many
       | boards of canada/aphex twin songs out there but I wish there were
       | more and this will let us generate more.
        
         | 52-6F-62 wrote:
         | This is not the way.
        
           | naillo wrote:
           | Why not? Mostly for private use in my case. SDXL has created
           | some beautiful works of art in my experiments and I would
           | love to have a similar experience in the music world.
        
             | wokwokwok wrote:
             | Come on, be creative and make something new instead of
             | copying someone else.
             | 
             | It's just kind of lame imo.
             | 
             | "Mostly" private use? Mmm. :thumbs down emoji:
        
               | naillo wrote:
               | I meant private use and maybe share with a few friends. I
               | actually agree with you that we probably shouldn't
               | finetune on great artists and try to sell the output
               | without modification or added creativity. Private or
               | close friends sharing is fun and life enriching and
               | inspiring though in my eyes.
        
               | 52-6F-62 wrote:
               | Boards of Canada came to their sound because in their
               | youth, the brothers had to move to Canada for a time.
               | Even though it was only a couple of years their
               | experience made an indelible mark on them--particularly
               | school days watching old National Film Board of Canada
               | tapes on worn VCR heads.
               | 
               | When they moved back to Scotland and started their music
               | they started incorporating both the machinery and the
               | sounds from the tapes in their compositions. And they
               | could play their compositions live. It was quite the rig.
               | 
               | It's not just entertainment. It's communicating a very
               | specific feeling and perspective. Keep learning and
               | create, don't be satisfied with just copying.
               | 
               | The biggest difference here is in the doing. You have to
               | grow into one mode over time and energy spent, the other
               | is immediate gratification with minimal personal energy.
               | 
               | Everything valuable comes during the course of that
               | process of growing and committing energy. And it's so
               | good. Don't deny yourself.
        
               | naillo wrote:
               | I get that. I think it's really cool what they did and
               | when musicians put in time and energy into making amazing
               | tracks. I get enough satisfaction from my normal coding
               | job though, I don't have time to dedicate my life to
               | music like they have. So from that perspective I'm just
               | happy that it's possible to get more music like that
               | type. Just a cool thing that exists in the world now, but
               | I still think working hard to realize an artistic vision
               | is also cool, separately.
        
               | 93po wrote:
               | there is very little creativity in most music already
               | 
               | (Axis of Awesome - 4 Four Chord Song)
               | 
               | https://youtu.be/5pidokakU4I?t=52
        
             | 52-6F-62 wrote:
             | Do you know where the sounds came from that you like so
             | much? I think such a perspective is only reachable if you
             | do not.
             | 
             | I recommend learning about that before deciding it's
             | satisfactory to reduce it to an algorithm suited for
             | copying.
             | 
             | It'll enrich your life. Endless copies will not. They take
             | that music, that emergence of order out of chaos, and
             | return it back to chaos.
             | 
             | It's void.
        
       | TheAceOfHearts wrote:
       | Does this model support / "understand" concepts of spatial audio?
       | For example, something like "an alarm moving around you in a
       | circle".
       | 
       | When AudioGen was announced this was my first question, but from
       | what I've been able to test the model just ignores spatial audio
       | prompts.
       | 
       | Unfortunately I haven't been able to find any discussion or
       | interest in online discussion about the importance / significance
       | of spatial audio. Why not?
        
         | cheald wrote:
         | My guess is that it's not a very interesting problem because
         | it's not particularly difficult to add spatial dimensions to
         | arbitrary audio - after all, it is already commonly done in
         | video games. All you have to do is manipulate the multichannel
         | outputs with an understanding of the spatial positioning of
         | each channel's speaker location relative to the listener and
         | some basic trig.
        
         | Jerrrry wrote:
         | Dolby wouldn't appreciate it.
        
       | gyumjibashyan wrote:
       | This is crazy tech!
        
       | 2Gkashmiri wrote:
       | So.... Wait for llama for audio and train your own voice to
       | having to call you friends by text and the software instead of
       | actually saying the words? This is going to be nice for
       | authentication, proving to a third party that you are yourself
        
       | PcChip wrote:
       | it's funny how they're all very impressive except the death metal
        
       | skybrian wrote:
       | As an amateur musician, I'd be more interested in these tools if,
       | along with the text description, they took as input a melody or
       | chord progression or performance data. Maybe ABC notation or a
       | MIDI track? Anyone doing that?
       | 
       | Other cool things would be a way to generate a sampled instrument
       | from a text description, or to generate a new track given a text
       | description and all the previous tracks for other instruments.
       | There could be a new generation of audio tools that let you
       | generate placeholders or better for everything.
        
         | l33tman wrote:
         | The analogue from stable diffusion would be ControlNet, where
         | you can train a superimposed model on auxiliary data, this
         | should be possible to do with chords for example, just like you
         | can do with human poses, 3D depth maps etc in stable diffusion
         | using controlnet
        
           | emadm wrote:
           | It's coming
        
       | _sys49152 wrote:
       | gamechanging stuff for sample based rap producers. havent been
       | able to log-in yet but i think a good benchmark to start off with
       | is to see if it can replicate the 'al green' sound from the early
       | 70s - very distinct sounding production - drumless and
       | instrumental.
       | 
       | you dont need 45 or 90 straight seconds of a coherent song
       | rendered. just need to dip in the 45 sec clip and cut out 4
       | seconds here, another 4 there. reroll those cuts through stable
       | audio, keep rolling, keep rolling. cut up and get a pile of clips
       | together. arrange, layer, voila - you saved money on paying
       | royalties for sampling.
       | 
       | the lofi melodic sample on the stability page was passable.
       | thought the bluegrass one sounded great actually. imagine being
       | able to program bluegrass like rap.
       | 
       | edit: oof. fully trained on a licensed commercial dataset from
       | AudioSparx. muzak in, muzak out.
        
       | randcraw wrote:
       | I wonder if it makes sense to generate a combo of instruments
       | rather than individual voices and then combine those with an
       | arranger DNN. I would think it'd be much easier to capture each
       | instrument's transients and dynamics that way, much less allow
       | more subtlety in how they combine, like allowing the lead voice
       | to shift among instruments, or even let the listener choose how
       | each voice expresses stylistically and how they should combine.
       | 
       | Trying to do all of that in a single DNN, much less parameterize
       | it useably seems overly ambitious (or will be of more limited
       | value ultimately).
        
       | kherud wrote:
       | Thank you for sharing! On a tangent: I'm wondering if there are
       | any good open source models/libraries to reconstruct audio
       | quality. I'm thinking about an end-to-end open source alternative
       | to something like Adobe Podcast [1] to make noisy recordings
       | sound professional. Anecdotally it's supposed to be very good. In
       | a recent search, I haven't found anything convincing. In my naive
       | view this tasks seems much simpler than audio generation and the
       | demand far bigger, since not everyone has a professional audio
       | setup ready at all times.
       | 
       | [1] https://podcast.adobe.com/
        
         | cosmok wrote:
         | I have had a lot of success with this:
         | https://ultimatevocalremover.com/ for de-noising
        
         | joshspankit wrote:
         | There seems to have been a fork in the road:
         | 
         | On one side the tech for literal denoising has stagnated a bit.
         | It's a very hard problem to remove all noise while keeping
         | things like transients.
         | 
         | On the other side, AI is being rapidly developed for it's
         | ability to denoise by recreating the recording, just without
         | the noise.
        
           | earthnail wrote:
           | In our denoiser (see other comment), we worked on combining
           | these two forks. That's how we can mathematically guarantee
           | great audio quality.
           | 
           | This combination was non-trivial as training old school DSP
           | denoisers is not easily possible. We'll describe the math
           | needed in our paper. We hope our publication will help the
           | wider community work not just on denoising but also tasks
           | like automatic mixing.
        
         | earthnail wrote:
         | We've been researching an audio denoiser for music that we will
         | present at the AES conference in October. Description page:
         | https://tape.it/denoising
         | 
         | We'll also publish a webapp where you can use the denoiser for
         | free. Mail me if you want beta access to it (email in profile).
         | 
         | It won't be open-source though, although the paper will of
         | course be public. It will also only reduce noise, and not
         | reconstruct other aspects of audio quality. However, it can do
         | so on any audio (in particular music), not just speech like
         | Adobe Podcast, and it fully preserves the audio quality. It's
         | designed exactly for the use case you want: to make noisy
         | recordings sound professional.
        
           | white_beach wrote:
           | denoising seems to fail in the guitar and vocals example
        
             | earthnail wrote:
             | Can you clarify where it fails? It's designed to remove
             | stationary noise only, and removes it very well in the
             | guitar and vocals example.
             | 
             | Generally speaking, if you have other sounds that you don't
             | want in the audio, we don't remove them - it's hard to
             | decide from a musical point of view whether you want a
             | certain sound or not. To give an extreme example: a barking
             | dog probably doesn't belong into a Zoom conference, but it
             | may very well belong into your audio recording. Removing
             | such elements would be a creative decision.
             | 
             | The guitar and vocals example has certain clicks in the
             | background that we don't remove - but the stationary noise
             | is gone. Existing professional (and complex) audio
             | restoration tools like iZotope RX don't remove those
             | clicks, either. It's a conservative approach, sure, but in
             | return you can throw any audio at it and it always improves
             | it.
        
           | haywirez wrote:
           | Are you sure the demo sound files are correct on the website?
           | Couldn't appreciate any glaringly obvious differences between
           | the original and denoised with studio grade headphones here.
           | Or, the originals aren't noisy enough.
        
         | whywhywhywhy wrote:
         | It's not open but Nvidia has RTX Voice for free if you have and
         | Nvidia card.
         | 
         | Only weird thing it's designed to be used real time but I've
         | had some luck on cleaning up voice recordings replayed back
         | through it via audio routing.
        
       | hubraumhugo wrote:
       | Now imagine Spotify using this to generate individual earworms
       | for everybody based on their personal tastes (likes, playlists).
       | 
       | Yes, AI is partly hype, but had someone told me this even two
       | years ago, I wouldn't have believed it.
        
         | joshspankit wrote:
         | This is why it's _vital_ that AI is openly available. Imagine a
         | world where Spotify is the only company that can do that, and
         | they use it to make sure they never pay royalties again.
        
           | bee_rider wrote:
           | How is Spotify for finding new music based on your tastes?
           | I've only used Amazon and Pandora; Amazon is quite poor,
           | Pandora is pretty good. I suspect (although, without proof)
           | that if a service can't suggest new music, it will have
           | trouble generating new music as well.
           | 
           | Anyway, I very much would rather run this sort of thing
           | locally. You could just manually set your taste profile.
           | Plus, music can be quite personal, imagine you start
           | listening to too much music inspired by The Cure and suddenly
           | Amazon starts advertising black makeup and antidepressants or
           | something like that, it would be too disconcerting.
        
             | magicalhippo wrote:
             | > How is Spotify for finding new music based on your
             | tastes?
             | 
             | I haven't tried any alternatives really, but so far for me
             | I'd say decent. I put on an album, and once over it'll play
             | similarish stuff. If I don't like a song I'll skip it and
             | it seems to incorporate that feedback.
             | 
             | Only thing is that it doesn't seem to be too adventurous
             | and it adheres rather strictly to the local context.
             | Meaning, if I played a stoner rock track, it'll continue
             | suggesting stoner rock and not much else, even though I
             | have quite varied music favorited in my library.
             | 
             | Overall though I've found a lot of new bands I enjoy that
             | way so, positive experience for me.
             | 
             | edit: as an example, here are the two most recent ones it
             | suggested where I ended up buying the albums on Bandcamp.
             | Both have quite few monthly listeners (1-2k), so not what
             | I'd call mainstream.
             | 
             | SUIR https://open.spotify.com/artist/6zOeQ2hyNfqi9UMHtyTSlF
             | 
             | Mount Hush
             | https://open.spotify.com/artist/13clfeXxTPsDsqzSlLIBZJ
        
               | seanw444 wrote:
               | I'll +1. I'm generally a FOSS guy, so if Spotify hadn't
               | helped so much in discovering new music that I really
               | enjoy, I'd be potentially acquiring music via
               | questionable means and just playing them as audio files
               | directly.
               | 
               | There are two companies that have done well by Gabe
               | Newell's "piracy is a service problem - not a price
               | problem" position: Valve/Steam (who also contribute to
               | FOSS through Proton and SteamOS which I heavily
               | appreciate), and Spotify. Spotify makes discovering and
               | aggregating music so easy that the alternatives don't
               | seem appealing.
        
             | chankstein38 wrote:
             | Similarly to others, I haven't really used other things in
             | a LONG time but Spotify's Discover Weekly playlist is
             | usually a list of bops that I enjoy a lot. I frequently end
             | up adding a huge portion of them to my liked songs and to
             | regularly used playlists!
        
             | riskable wrote:
             | I've tried Amazon and Pandora. Spotify is so vastly
             | superior at finding music I like it's in an entirely
             | different league.
             | 
             | Having said that, starting about a year ago (maybe ~1.5
             | years?) Spotify started inserting _obviously_ paid
             | promotion tracks into my auto-generated  "Daily Mix _n_ "
             | playlists and it seriously bothered me. When my playlists
             | are made up of very specific genres of EDM and suddenly a
             | pop song plays from a famous person I get seriously angry.
             | 
             | It hasn't happened in many months though so maybe they
             | learned their lesson. I was so mad I seriously considered
             | ending my Premium subscription right then and there when
             | that track played.
        
             | K5EiS wrote:
             | I find the discover weekly playlists that are made for me
             | are pretty hit or miss, overall I have found many news
             | songs I like with their help.
        
               | chankstein38 wrote:
               | Same! New songs, new artists, new genres, it's pretty
               | cool! I agree it can be hit or miss but it feels like the
               | longer I use it and the older, and more mature the
               | platform gets, the more hit discover weekly ends up
               | being!
        
               | viraptor wrote:
               | Similar - I found many good new things. I like the
               | discovery playlists, even with the tracks I don't like.
               | If they never missed, how could they ever suggest
               | something actually different and exciting? It's
               | "discover" not "average of what you already like".
        
         | hospitalJail wrote:
         | I really want this. I have a band that I like, and I want more!
         | 
         | Or I'd like to take a song I like, and make it educational,
         | like make it include the period table of elements.
        
         | [deleted]
        
         | liotier wrote:
         | Machine-generated music might be functionally equivalent to
         | human-generated music, but that ignores the cultural role of
         | art as a shared human experience - witness the liturgy of live
         | music. That can't happen with music tailored to each listener,
         | it can't happen without tracks that are fixed in time and can
         | be referred to. I can imagine it well-accepted for dynamic
         | music such as gaming soundtracks, but I suppose that machine
         | generation will be mostly a production technique resulting in
         | branded pieces.
        
           | jimmygrapes wrote:
           | I know quite a few technically talented musicians who have
           | next to no creativity in actually writing the music (aside
           | from jazz style improv sessions). Most of them never really
           | play live unless it's part of a similarly uncreative band of
           | college friends. I wonder if having a catchy/complex AI
           | generated song created for them to play live might be
           | interesting to them. Gonna check in and see what they think.
        
           | broast wrote:
           | > such as gaming soundtracks
           | 
           | These days I also feel like my workout playlists might as
           | well be randomly generated dance music.
        
           | Jeff_Brown wrote:
           | Saying that's impossible makes me immediately wonder whether
           | it's not. There are already headphone dance parties. What if
           | a musical act's output was being interpreted through genre
           | lenses specific to each listener?
        
         | ragazzina wrote:
         | Spotify does not need to generate a tailored earworm for me. It
         | could already suggest songs that I like based on my personal
         | taste out of their 100-million-songs catalog - and it's
         | absolutely unable to do it.
        
           | zachthewf wrote:
           | Building a tailored earworm might actually be easier.
        
       | cesaref wrote:
       | The death metal example reminded me of the continuous streaming
       | death metal here:
       | 
       | https://www.youtube.com/watch?v=MwtVkPKx3RA
        
       | shon wrote:
       | Ed Newton-Rex, VP of Audio at Stability, is speaking about how
       | this was built at The AI Conference in 2 weeks.
       | https://aiconference.com
        
       | iandanforth wrote:
       | The solo piano was interesting because of how clean it is. I can
       | imagine going from that sample to a score without too much
       | difficulty. Once it's in a symbolic format it becomes much more
       | flexible and re-usable.
       | 
       | While this does _not_ seem to be the trend I hope more gen ai in
       | the audio and visual realms start to produce more structured  /
       | symbolic output. For example, if I were Adobe I would be training
       | models, not to output full images, but either layers or brush
       | strokes and tool pallet usage. Same for organizations that have
       | all the component tracks of music to work with.
        
         | Jeff_Brown wrote:
         | That raises an interesting difference between cleaning AI-
         | generated sound and cleaning ordinary recordings. In an
         | ordinary recording, there is an objective reality to discover
         | -- a certain collection of voices was summed to create a
         | signal. With (most? the best?) existing AI audio generation,
         | the waveform is created from whole cloth, and extracting voices
         | from it is an act of creation, not just discovery.
         | 
         | I've come across AI-generated music that outputs something like
         | MIDI and controls synthesizers. Its audio quality was crystal-
         | clear, but the music was boring. That's not to say the approach
         | is a dead-end, of course -- and indeed, as a musician, the idea
         | of that kind of output is exciting. But getting good data to
         | train something that outputs separate MIDI-ish voices seems
         | much harder than getting raw audio signals.
        
           | fnordpiglet wrote:
           | Generative models can certainly create midi, but no one has
           | done it yet. Given the technique is making video, audio,
           | images, and language, all you need to do is train and build a
           | model with an appropriate architecture.
           | 
           | It's easy to forget this is all pretty new stuff and it still
           | costs a lot to make the base models. But the techniques are
           | (more or less) well documented and implementable with open
           | source tools.
        
             | jskherman wrote:
             | I believe Spotify's Basic Pitch[0] is already some work
             | towards building something like this.
             | 
             | [0]: https://basicpitch.spotify.com/about
        
             | MrCheeze wrote:
             | It has been done - first by OpenAI (MuseNet, which is no
             | longer available) and later by Stanford (Anticipatory Music
             | Transformer):
             | https://nitter.net/jwthickstun/status/1669726326956371971
        
             | TheActualWalko wrote:
             | We've done it! wavtool.com
        
               | fnordpiglet wrote:
               | That's really neat. How long have you been working on
               | this?
        
             | radarsat1 wrote:
             | > Generative models can certainly create midi, but no one
             | has done it yet.
             | 
             | Note sequence generation from statistical models has a long
             | history, at least as long if not longer than text
             | generation.
             | 
             | Have a look at section 2.1 of this survey paper [0] that
             | cites a paper from 1957 as the first work that applies
             | Markov models to music generation.
             | 
             | And, of course, plenty of follow-up work 6 decades later on
             | GANs, LSTMs, and transformers.
             | 
             | [0]: https://www.researchgate.net/publication/345915209_A_C
             | ompreh...
        
               | fnordpiglet wrote:
               | Yes, in fact I think at some point everyone has written
               | their own Markov generators or at least run dissociative
               | press. But we've really only seen meaningfully high
               | quality output over the last few years.
        
               | fassssst wrote:
               | Do you know if anyone has tried training a text-to-music
               | or text-to-midi model where the training data includes
               | things like emotion labels for each note interval or
               | chord progression?
        
               | Jeff_Brown wrote:
               | That sounds expensive and inefficient. Peoples'
               | interpretations of music (and abstract art more
               | generally) can be shockingly different; I suspect the
               | model would not get a clear signal from the result.
               | 
               | But that makes me wonder to what extent labeling can be
               | programmed -- extracting chord changes, dynamics changes,
               | tempo, gross timbral characteristics, etc.
        
               | fassssst wrote:
               | And maybe even labels like popularity/play count/etc so
               | it has a better sense of what "sounds good" to certain
               | groups
        
         | fnordpiglet wrote:
         | There are a lot of Lora models that are being made to generate
         | textures, maps, diagrams, backgrounds, etc. You don't need to
         | wait for adobe, open source models like stable diffusion let
         | you do whatever you think is useful. I'd look to the open
         | source world for creative innovation. Adobe is just doing
         | what's on the product management roadmap.
        
         | miohtama wrote:
         | Having music editable for human post production is necessary
         | for most professional adoption. Generating MIDIs would make
         | much more sense than generating raw audio.
         | 
         | This is what we do with AI images: you can fix them in
         | Photoshop, etc. You cannot do this for raw audio due to how
         | music is produced.
        
           | waffletower wrote:
           | Build or seek out a MIDI generating model. I hope Stable
           | Audio is _never_ the place for that. MIDI is deeply lossy and
           | it would be tragedy if it was the only music representation.
           | Imagine if instead of phonographs, compact disks and
           | streaming audio we only had piano rolls. What a loss indeed.
        
             | iainctduncan wrote:
             | Midi is not lossy, midi is symbolic. There's a huge
             | difference.
        
               | Applejinx wrote:
               | No, it's lossy. It's an event model at a fixed data rate.
               | You can only do so many things sequentially, even if you
               | could represent any possible musical concept as a MIDI
               | event. So even if you're not sticking to note-on, note-
               | off, it's still extremely lossy.
        
               | miohtama wrote:
               | MIDI 2.0 improves a lot of things, how much dynamicity
               | and variation you can have. MIDI 1.0 is a standard from
               | early 80s. It indeed has shortcomings, but also the
               | upside is editability.
               | 
               | It's then for the remaster / musician / actual
               | interpretation / post production to make the score to
               | something less of an event model.
        
               | iainctduncan wrote:
               | well ok, you could say it's a lossy format for capturing
               | physical movement, which it certainly is. My point is
               | that it is not a format for capturing _music_ any more
               | than a score on a page is a format for capturing music.
               | Both are instructions for a performer of music (one
               | machine, one human), which is a very different beast.
        
               | gamblor956 wrote:
               | MIDI is able to accommodate nearly everything that can be
               | represented through a musical score and instrumental
               | performance. What are you hoping to accomplish with AI-
               | generated waveforms that can't be done with MIDI?
        
               | mdp2021 wrote:
               | > _What_
               | 
               | "Intention" (as a tentative term)
               | 
               | The question becomes: what has impeded the creation of a
               | MIDI file that can be confused with an actual concert
               | from Arturo Benedetti Michelangeli.
        
               | gamblor956 wrote:
               | Literally nothing is preventing this other than that
               | nobody has bothered to take the time to do it.
               | 
               | The current version of MIDI is capable of replicating any
               | of his performances, even down to the randomness.
               | 
               | Note that if you want to replicate the audio quality of
               | his performances, you will need a high-quality MIDI
               | instrument; the ones that ship with Windows will not
               | suffice. These MIDI instruments can range from a few
               | dollars to thousands of dollars. (See, e.g., Native
               | Instruments)
        
               | mdp2021 wrote:
               | > _nobody has bothered to take the time to do it_
               | 
               | In such case, we have a theoretical suggestion that
               | <<nothing is preventing this>>, but not an actual proof
               | based on a "Turing test"-like scenario which would have
               | specialists fooled, to corroborate that the new MIDI 2
               | would suffice.
        
               | gamblor956 wrote:
               | I gave you the knowledge to do this research yourself,
               | but since you are unwilling to do so, here is an example
               | of the performance possible using MIDI instruments:
               | https://www.youtube.com/watch?v=CvaChiq6gf0
               | 
               | As it is clear that you intend to keep shifting the
               | goalposts to make a point that can't be made, I will
               | withdraw from further participation in this thread.
        
               | waffletower wrote:
               | This is absurd. Sure, someone below posits that MIDI
               | could perhaps represent a piano performance by Arturo
               | Benedetti Michelangeli. I think it has been able to do a
               | passing job at that, _when you provide a decent piano_.
               | Regardless, piano rolls have been able to come close
               | since the early 20th century. But how well does MIDI
               | represent music performed by John Coltrane? Jimi Hendrix?
               | It falls on its face. The long fetishized Western music
               | notation abstraction, which MIDI poorly simulates,
               | completely fails for many important examples of music. I
               | would even venture to say that MIDI fails for _most_ of
               | them. But yes, MIDI is well-optimized for piano music
               | where an acceptable piano or simulation is available.
        
               | waffletower wrote:
               | The problems have long been known and articulated:
               | http://www.music.mcgill.ca/~gary/courses/papers/Moore-
               | Dysfun...
        
               | gamblor956 wrote:
               | Your example of the failures of MIDI are based on a
               | 35-year old paper (from 1988!!!) about an earlier version
               | of MIDI?
               | 
               | When that paper was written it took several _weeks_ and
               | many millions of dollars of equipment to render
               | primitive, mono-color 3d graphics. Desktop computers had
               | 512 _kilobytes_ of RAM and the highest-end desktops 32 MB
               | of hard drive storage space. Computer screens had two
               | colors: black and green. Audio cards capable of making
               | beeps and clicks were the cutting-edge. WIFI was still a
               | decade away.
        
               | waffletower wrote:
               | You clearly didn't read the article, and clearly don't
               | understand how prevalent the MIDI 1.0 specification is
               | today. MIDI 2.0 is a very recent development (this year
               | LOL!) and has yet to be commercially adopted. The 1984
               | design is what is largely in use today. At the time of
               | initial development, commercial synthesizers, not sound
               | cards, were the intended generators of sound utilizing
               | MIDI: https://www.vintagesynth.com/roland/juno106.php.
        
               | waffletower wrote:
               | MIDI is an extraordinarily lossy music representation.
               | Even Claude Shannon would facepalm at the assertion that
               | it could, in theory, represent audio faithfully. It is
               | not its purpose, it is decidedly not its practice, and it
               | is a ludicrously irrelevant example of pedantry to say
               | otherwise. The false equivalency asserted by the commons
               | can be aggravating :D
        
               | iainctduncan wrote:
               | MIDI is not a lossy format for audio because it's not a
               | representation of audio, period. It's a format for
               | conveying the motion of a piano keyboard, meant from the
               | beginning to be usable for various forms of audio.
        
               | Jeff_Brown wrote:
               | MIDI is not _inherently_ lossy. You could encode anything
               | in it, just as you can encode any novel as an integer.
               | 
               | In practice, though, transformations from audio to MIDI
               | discard an enormous amount of important information, with
               | the possible exception of transcriptions of performances
               | on piano (where volume, frequency, duration and a good
               | physical model of a piano are enough to reconstruct
               | everything important about the signal) and similar
               | instruments.
        
         | schazers wrote:
         | I strongly agree about generating "editables" rather than
         | finalized media. In fact, that's why text generators are more
         | useful than current media generators: text is editable by
         | default. Here's a tweetstorm about it:
         | https://x.com/jsonriggs/status/1694490308220964999?s=20
        
           | waffletower wrote:
           | Audio is definitely editable. While generative audio is new I
           | am hopeful that a host of interesting applications will
           | emerge (audio2audio etc.) within its ecosystem. Promising
           | signal separation (audio to STEMs) and pitch detection tools
           | already exist for raw audio signals. If you want to force
           | Stability to focus on symbolic representations (such as
           | severely lossy MIDI) I hope you can instead first try
           | adapting to tools that work fundamentally with rich audio
           | signals. Perhaps there will be room for symbolic music AI and
           | perhaps Stability will even develop additional models that
           | generate schematic music, but please please don't sacrifice
           | audio generality for piano roll thinking alone. LORAs will
           | undoubtedly be usable to generate more schematic audio via
           | the Stable Audio model -- I imagine they could be easily
           | purposedly to develop sample libraries compatible with DAW
           | (digital audio workstation), sequencer and tracker production
           | workflows.
        
             | visarga wrote:
             | Train the model with midi notes as text in the prompt and
             | the audio as target. It will learn to interpret notes.
        
               | waffletower wrote:
               | Not all music is well represented with notes, nor are
               | audio datasets with high-quality note representations
               | readily available. But I guess if you work hard enough
               | you can get close:
               | https://www.youtube.com/watch?v=o5aeuhad3OM My example
               | still sounds like the chiptune simulation that it is,
               | however.
        
         | tech_ken wrote:
         | I was wondering the same thing, definitely seems like
         | generating the raw waveform runs into all kinds of weird issues
         | (like they touched on in this post). I would imagine that
         | training data would be a serious chokepoint here. Given how
         | much discourse is currently kicking off around the intellectual
         | property rights of just the final product (the mastered track),
         | I can't imagine many musicians would be eager to share what is
         | effectively the "proof of ownership" (track stems or MIDIs).
        
         | dylan604 wrote:
         | >For example, if I were Adobe I would be training models, not
         | to output full images, but either layers or brush strokes and
         | tool pallet usage. Same for organizations that have all the
         | component tracks of music to work with.
         | 
         | I really like this idea. Creating new tools for artists to use
         | to create rather than whatever we're accepting as use now. The
         | use of current full image creation is boring to me in the same
         | way the choice of invisibility as a super power is. The
         | invisibility is ultimately going to slide into pervy
         | tendencies, just like deep fakes will slide in the same way or
         | some other inappropriate use.
        
         | waffletower wrote:
         | Hopefully, the entire industry will _NOT_ move in such a
         | schematic and lossy direction. Use separate tools to analyze
         | audio streams please. Don 't throw the timbre baby out with the
         | bathwater. MusicGen utilizes a tokenized transformer model for
         | music, which is attractive for symbolic translation use cases.
         | However, the overall audio quality is far more lossy than the
         | examples you hear from Stable Audio. I believe that symbolic
         | representation should not be a foundational approach to
         | adequately represent and generate rich audio signals.
        
         | gabereiser wrote:
         | Yessssss! I thought about MusicGAN and Markov chains last night
         | thinking "Why can't we just codify all chords and use a GAN to
         | generate markov chains on chords of a key and have AI generate
         | instruments and waveform from those chains?" IANA researcher
         | but in my head, that sounded logical.
        
           | TylerE wrote:
           | That's existed for decades. It's called Band in a Box. It's
           | also cheezy as hell.
        
             | gabereiser wrote:
             | lol, no. Not autogenerate midi (although their latest
             | versions of BiaB are pretty darn good now) but generate
             | waveforms together. It would be similar to having AI
             | generate whole scores of music but ensuring it's all in
             | sync and in key. Not taking sample database of 88 sound
             | files and triggering them when the midi-note strikes.
        
               | TylerE wrote:
               | That's not how BiaB works at all. It has all kinds of
               | patterns built into it. So, it knows, how to generate,
               | say, a bluegrass bassline in a given key. There are
               | plenty of ways to play back MIDI with high sound quality,
               | including feeding it into an AI-driven VST like
               | NotePerformer.
        
               | gabereiser wrote:
               | Then explain why you categorize it as cheesy? Sounds like
               | it's pretty cool.
        
               | 93po wrote:
               | he's saying the output is cheesy. it sounds like stuff
               | you'd hear on a demo track for a kid's toy piano
        
               | TylerE wrote:
               | It's not THAT bad, it's just that for a programming
               | targetting jazz the playback is rather...square.
        
               | gabereiser wrote:
               | Yeah I definitely imagine certain genres sounding off due
               | to the rhythmic nature of computer timing. Jazz doesn't
               | follow rules like that so good jazz has these timing
               | idiosyncrasies that make it sound the way it sounds. That
               | and a ridiculous obsession with adding half step and
               | quarter tone intervals.
        
       | dylan604 wrote:
       | Just like all examples of generative "AI" I've seen, there's
       | always some bit of uncanny valley vibe present. In the audio
       | examples, there's always this weird distortion like a really
       | poorly compression sources were used as training data. The sounds
       | are muddled together, and rarely do I hear clean musical voices.
       | It's just a smear of sounds coming together that our brains try
       | really hard to say "oh, that's a _____" situation. While the
       | samples in the TFA are probably the closest I've heard to date,
       | the issue is still present.
       | 
       | I guess the thing that strikes me so odd about the generative
       | thing is all of the press releases on people presenting things
       | like it's a final product, yet it's clearly pre-release beta at
       | best but more likely alpha versions of code in the results in
       | quality. If a non-AI product released something that was so
       | clearly not finished, it would be panned to no end for not
       | working.
        
         | [deleted]
        
       ___________________________________________________________________
       (page generated 2023-09-13 23:00 UTC)