[HN Gopher] Riffusion - Stable Diffusion fine-tuned to generate ...
       ___________________________________________________________________
        
       Riffusion - Stable Diffusion fine-tuned to generate Music
        
       Author : MitPitt
       Score  : 1481 points
       Date   : 2022-12-15 13:26 UTC (9 hours ago)
        
 (HTM) web link (www.riffusion.com)
 (TXT) w3m dump (www.riffusion.com)
        
       | TuringNYC wrote:
       | I read the article:
       | 
       | "If you have a GPU powerful enough to generate stable diffusion
       | results in under five seconds, you can run the experience locally
       | using our test flask server."
       | 
       | Curious what sort of GPU the author was using or what some of the
       | min requirements might be?
        
         | genewitch wrote:
         | RTX 3070 can generate SD results in under 5 seconds, depending.
         | Euler A 20 samples, 512x512. it can almost do 4 images in 5
         | seconds with those settings.
         | 
         | It's possible a 3060 might work, depending. in my experience
         | the 3060 is about 50% slower than the 3070, but that may be a
         | bad 3060 in our test rig. but a 3060 gets pretty close to 5
         | seconds for an image, so try it, if you have one.
         | 
         | just tested prompt "a test pattern for television" on both
         | cards and 3070 took 1.87s and the 3060 took 2.93s. Similar
         | results for the prompt "an intricate cityscape, like new york"
         | 
         | edit: i should note we're using SD 1.4, not 1.5, although i
         | think that just has to do with the checkpoint of the model, not
         | the algorithm, but i could be wrong.
         | 
         | Also the model is over 14GB, so perhaps the 3070 can't do it
         | after all. i'll test it later as soon as the local admin wakes
         | up and downloads it onto our machine.
        
         | seth_ wrote:
         | Author here: fwiw we are running the app on a10g GPUs, which
         | generally can turn around a 512x512 in 3.5s with 50 inference
         | steps. This time includes converting the image into audio which
         | should be done on the GPU as well for real-time purposes. We
         | did some optimization such as a traced unet, fp16 and removing
         | autocast. There are lots of ways it could be sped up further
         | I'm sure!
        
       | nathias wrote:
       | Very cool! I was wondering why there wasnt any music diffusion
       | apps out there, it seems more useful because music has stricter
       | copyright and all content creators need some background music ...
        
       | corysama wrote:
       | Can anyone confirm/deny my theory that AI audio generation has
       | been lagging behind progress in image generation because it's way
       | easier to get a billion labeled images than a billion labeled
       | audio clips?
        
         | madelyn wrote:
         | Sound is a lot higher fidelity, it's harder to make the
         | information available to a computer without serious
         | downsampling or simplification.
         | 
         | Consider sounds over 12khz. On a spectrogram during a chorus or
         | drop that area is lit up, with so many things changing from
         | millisecond to millisecond. A lot of AI samples really struggle
         | at high frequencies, or even forgo them entirely.
         | 
         | Midi based approaches have been really great though, and an
         | approach like in the OP is fascinating (and impressive).
        
       | motoxpro wrote:
       | This is just insane. Sooooo incredible. Don't really realize how
       | far things have come until it hits a domain you're extremely
       | familiar with. Spent 8-9 in music production and the transition
       | stuff blew me away.
        
       | haykmartiros wrote:
       | Other author here! This got a posted a little earlier than we
       | intended so we didn't have our GPUs scaled up yet. Please hang on
       | and try throughout the day!
       | 
       | Meanwhile, please read our about page http://riffusion.com/about
       | 
       | It's all open source and the code lives at
       | https://github.com/hmartiro/riffusion-app --> if you have a GPU
       | you can run it yourself
       | 
       | This has been our hobby project for the past few months. Seeing
       | the incredible results of stable diffusion, we were curious if we
       | could fine tune the model to output spectrograms and then convert
       | to audio clips. The answer to that was a resounding yes, and we
       | became addicted to generating music from text prompts. There are
       | existing works for generating audio or MIDI from text, but none
       | as simple or general as fine tuning the image-based model. Taking
       | it a step further, we made an interactive experience for
       | generating looping audio from text prompts in real time. To do
       | this we built a web app where you type in prompts like a jukebox,
       | and audio clips are generated on the fly. To make the audio loop
       | and transition smoothly, we implemented a pipeline that does
       | img2img conditioning combined with latent space interpolation.
        
         | jtode wrote:
         | As one of the meatsacks whose job you're about to kill... eh, I
         | got nothin, it's damn impressive. It's gonna hit electronic
         | music like a nuclear bomb, I'd wager.
        
           | godelski wrote:
           | Why do you think this will kill your job? To me this looks
           | like an extension of the hip-hop genre.
        
           | spoiler wrote:
           | As a listener, I think you're probably still safe. Can you
           | use this to help you though? Maybe.
           | 
           | It's impressive what it produces, but I think it probably
           | lacks substance in the same way the visual AI art stuff does.
           | For the most part, it passes what I call the at-a-glanceness
           | test. It's little better than apophenia (the same thing that
           | makes you see shapes in clouds, faces in rocks, or think
           | you've recognised a familiar word in a foreign language; the
           | last one can happen more often though).
           | 
           | So, I think these tools will be used to do background work
           | (ie for visuals maybe help with background tasks in CGI or
           | faraway textures in games). I know less about audio, but I
           | assume it could maybe help a DJ create a transition between
           | two segments they want to combine, as opposed make the whole
           | composition for them, but idk if that example makes sense.
           | 
           | Now, onto a more human point: I think that people often
           | listen to music because it means something to them. Similar
           | for people who appreciate visual art.
           | 
           | I also love interactive and light art, and I love talking to
           | other artists at light festivals who make them because of the
           | stories and journeys behind their art too. Humans and art are
           | a package deal, IMO.
           | 
           | Edit: typos and to add: Also, I think prompt authorship is an
           | art unto itself. I'm amazed what people can craft with it,
           | but I'm more impressed by the craft itself than the outputs.
           | Don't get me wrong, the outputs are darn cool, but not if you
           | look closer. And it's impossible to look beneath the surface
           | altogether, as there is nothing in the output but the pixels.
        
             | ArmandTanzarian wrote:
             | As a musician and listener i'm inclined to agree. There
             | were a couple cool examples i bumped into, but some prompts
             | generate results that don't represent any single word or
             | combination of words that were presented to the AI.
             | 
             | What this means for the future is maybe a little more
             | unsettling however.
        
             | hahajk wrote:
             | I think this type of generative stuff opens up entirely new
             | possibilities. For the longest time I've wanted to host a
             | rowing or treadmill competition, where contestants submit a
             | music track. The tracks are mashed up with weighting based
             | on who is in the lead and by how much.
             | 
             | I don't know of existing tech that can generate actual good
             | mashups in realtime given arbitrary mp3s, but this has
             | promise!
        
             | api wrote:
             | In general all this stuff is chopping the bottom off the
             | market. AI art, code, writing, music, etc. can all generate
             | passable "filler" content, which will decimate all human
             | employment generating same.
             | 
             | I don't think this stuff is a threat to genuinely
             | innovative, thoughtful, meaningful work, but that's the top
             | of the market.
             | 
             | That being said the bottom of the market is how a lot of
             | artists make their living, so this is going to deeply
             | impact all forms of art as a profession. It might soon
             | impact programming too because while GPT-type systems can't
             | do advanced high level reasoning they will chop the bottom
             | off the market and create a glut of employees that will
             | drive wages down across the board.
             | 
             | Basic income or revolution. That's going to be our choice.
        
               | gummydogg wrote:
               | The top of the market started at the bottom. Entry level
               | is requiring higher and higher skills and capabilities.
        
               | astrange wrote:
               | The only thing that affects whether you have a job is the
               | Federal Reserve, not how good productivity tools are. You
               | always have comparative advantage vs an AI, so you always
               | have the qualifications for an entry level job.
               | 
               | There will never be a revolution and there's no such
               | thing as late capitalism. Well, not if the Fed does their
               | job.
        
               | halkony wrote:
               | I see a lot of AI naysayers neglecting the comparative
               | advantage part.
               | 
               | If AI _completely_ eliminates low skill art labour from
               | the job pool, it 's not like those affected by it are
               | gonna disintegrate, riot, and restructure society. They
               | have the choice of filling an art niche an AI can't or
               | they can spend that time learning other, more in-demand
               | skills. This also ignores that fact that some companies
               | would rather reallocate you to more profitable projects
               | even if your art skills don't change.
               | 
               | Selling a product with relative value like a painting or
               | a sculpture will always be an uphill battle. Now that
               | there's more competition from AI, it just gives
               | artists/businesses incentive to find what people want
               | that an AI can't deliver. Worst case scenario, employment
               | rates in this sector are rough while the market
               | recalibrates. Interested to see how these technologies
               | develop.
        
               | Archipelagia wrote:
               | That seems a bit like wishful thinking.
               | 
               | People don't have unlimited ability to learn new skills.
               | Training takes time, and someone who spent several years
               | honoring their craft won't be able to pick up a new skill
               | overnight.
               | 
               | On top of that, people have preferences regarding their
               | work - even if someone has the ability to do a different
               | work, they might find it less meaningful and less
               | satysfing.
               | 
               | Finally, don't ignore the speed at which AI capabilities
               | improve. Compare GPT-1 with the current model, and how
               | quickly we got here. Eventually we'll get to a point
               | where humans just won't be able to catch up quickly
               | enough.
        
               | SoftTalker wrote:
               | I think specifically in the area of creative "products"
               | such as art and music you have to think about the
               | customer as well. I have zero interest in AI-created art
               | or music. None. The value of art is its humanity; its
               | expression of the artist's message, vision, and passion.
               | AI doesn't have that, so it's not of any interest to me.
               | 
               | I don't know how many custoners feel the same way, but I
               | won't be purchasing any AI art or music or knowingly
               | giving it any of my attention.
        
               | astrange wrote:
               | The AI is a tool the human used to make it. Sometimes
               | clumsily, but sometimes they write poems as text prompts
               | and it's an illustration, or things like that.
               | 
               | If an AI is making and selling art by itself, it's
               | probably become sentient and not patronizing it would be
               | speciesism.
        
               | tracerbulletx wrote:
               | If the only people who can have meaningful good paying
               | jobs are thoughtful geniuses we're in a lot of trouble as
               | a society still.
        
           | gravelc wrote:
           | I love simple generative approaches to get ideas, and go from
           | there. This seems like an extension of that (well, it's what
           | I'm going to try - sample the output, make stems, pull MIDI
           | etc). Will make the creative process more interesting for me,
           | not less.
           | 
           | Having said that, it's not my job, and I can see where the
           | issues lay there.
        
         | TOMDM wrote:
         | The audio sounds a bit lossy, would it be possible to create
         | high quality spectograms from music, downsample them, and use
         | that as training data for a spectogram upscaler?
         | 
         | It might be the last step this AI needs to bring some extra
         | clarity to the output.
        
         | nico wrote:
         | Amazing work. Can this be applied to voice?
         | 
         | Example prompt: "deep radio host voice saying 'hello there'"
         | 
         | Kind of like a more expressive TTS?
        
           | seth_ wrote:
           | Author here: It can certainly be applied to voice, but the
           | model would need deeper training to speak intelligibly. If
           | you want to hear more singing, you can try a prompt like
           | "female voice", and increase the denoising parameter in the
           | settings of the app.
           | 
           | That said, our GPUs are still getting slammed today so you
           | might face a delay in getting responses. Working on it!
        
         | sergiotapia wrote:
         | Reach out to the Beatstars CEO. He was looking for an AI play
         | for his music producers marketplace. Probably solid B2B lead
         | there.
        
         | CapsAdmin wrote:
         | When you say fine tuned do you mean fine tuned on an existing
         | stable diffusion checkpoint? If so which?
         | 
         | It would be very interesting to see what the stable diffusion
         | community that is using automatic1111 version would do with
         | this if it were made into an extension.
        
           | haykmartiros wrote:
           | Yes from https://huggingface.co/runwayml/stable-
           | diffusion-v1-5. Our checkpoint works with automatic1111, and
           | if you'd like to make an extension to decode to audio, it
           | should be pretty straightforward:
           | https://github.com/hmartiro/riffusion-
           | inference/blob/main/ri...
        
             | Metus wrote:
             | Can you run this on any hardware already capable of running
             | SD 1.5? I am downloading the model right now, might play
             | with this later.
             | 
             | Guessing at the speed with which AI is developing these
             | days someone is going to have the extension up in two hours
             | at most.
        
               | ronsor wrote:
               | I bet the AUTOMATIC1111 web UI music plugin drops within
               | 48 hours.
        
               | haykmartiros wrote:
               | Yes! Although to have real time playback with our
               | defaults you need to be able to generate 40 steps at
               | 512x512 in under 5 seconds.
        
               | Metus wrote:
               | Good to know. I was just so close with just under 7s
               | using 40 steps and Euler a as sampler.
        
         | abledon wrote:
         | This is groundbreaking! All other attempts at AI generated
         | music have IMO, fallen flat... These results are actually
         | listenable, and enjoyable! This is almost frightening how
         | powerful this can be
        
         | ozten wrote:
         | Amazing work! Did you use CLIP or something like that to train
         | genre + mel-spectrogram? What datasets did you use?
        
         | tartakovsky wrote:
         | Hi Hayk, I see that the inference code and the final model are
         | open source. I am not expecting it, but is the training code
         | and the dataset you used for fine-tuning, and process to
         | generate the dataset open source?
        
         | asdf333 wrote:
         | is classical music harder? noticed you didn't have any
         | classical music tracks. i wonder if it is because it is more
         | structured?
        
         | lisper wrote:
         | Wow, I am blown away. Some of these clips are really good! I
         | love the Arabic Gospel one. John and George would have loved
         | this so much. And the fact that you can make things that
         | _sound_ good by going through _visual_ space feels to me like
         | the discovery of a Deep Truth, one that goes beyond even the
         | Fourier transform because it somehow connects the _aesthetics_
         | of the two domains.
        
           | tbalsam wrote:
           | I can simultaneously burst a bubble and provide fuel for more
           | -- the alignment of the intrinsic manifolds of different
           | domains has been an interesting research topic for zero shot
           | research for a few years. I remember seeing at CVPR 2018 the
           | first zero shot...classifier, I think? That if I recall
           | correctly trained in two domains that were automatically
           | basically aligned with each other enough to provide very good
           | zero shot accuracy.
           | 
           | Calling it a Deep Truth might be a bit of an emotional
           | marketing spin but the concept is very exciting nonetheless I
           | believe.
        
             | lisper wrote:
             | My characterization of it as a Deep Truth might just be a
             | reflection of my ignorance of the current state of the art
             | in AI. But it's still pretty frickin' cool nonetheless.
        
               | logicallee wrote:
               | Alright so this is a pretty amazing new development. I
               | want to tell you something about what the state of the
               | art is in AI. When you wrote that it is a deep truth it
               | was before I actually listened to the pieces. I had just
               | read the descriptions. At the time, I thought that you
               | were probably right because I was thinking that music is
               | only pleasing because of the structure of our brains it's
               | not like vision where originally we are interpreting the
               | world and that's where art comes from. Music is purely
               | sort of abstract or artistic. However, after I listened
               | to the pieces, I realised that they really sound exactly
               | like the instruments that are making the physical noises.
               | For example it really sounds exactly like a physical
               | piano. So I don't know about a deep truth, but it does
               | seem that there is a physical sense that the music
               | represents which it can successfully mimic using this
               | essentially image generating capability. One thing about
               | all of these amazing AI development, is that I still make
               | some long comments by dictating to Google. When it first
               | got to the point that it was able to catch almost
               | everything that I was saying I was absolutely blown away.
               | However, it's really not that good at taking dictation,
               | and I have to go back and replace each and every
               | individual comma and period with the corresponding
               | punctuation mark. Seeing such an amazing developments
               | happening month after month year after year it makes me
               | feel like we are really approaching what some people have
               | called the singularity. When I read about a net positive
               | fusion being announced my first instinct was to think oh
               | of course it's now that that ChatGPT is available of
               | course announcing a major fusion breakthrough would
               | happen within days to weeks it just makes perfect sense
               | that AI's can solve problems that have have confounded
               | scientists for decades. To see just how far we still have
               | to go take a look at how this comment read before I
               | manually corrected it to what I had actually said.
               | 
               | -- [I copied and pasted the below to the above and then
               | corrected it. Below is the original version. This is how
               | I dictate to Google sometimes, on Android. Normally I
               | would have further edited the above but in this case I
               | wanted to show how far basic things like dictation still
               | have to go. By the way I dictated in a completely quiet
               | room. I can't wait for more advanced AI like ChatGPT to
               | take my dictation.]
               | 
               | Alright so this is a pretty amazing our new development
               | period I want to tell you something about out why the
               | state of the heart is is in a i period when you wrote
               | that it is a deep truth it was before I actually listen
               | to The Pieces, I have just read the descriptions period
               | at the time, I thought that you were probably right
               | because I was thinking that music is only pleasing
               | because of the structure of our brains it's not like
               | vision where originally we are interpreting the world and
               | that Where Art comes from music is purely so dove
               | abstract or artistic period however, after I listen to
               | the pieces, I realise that they really sound exactly like
               | the instruments that are making the physical noises
               | period for example it really sounds exactly like a
               | physical piano period so I don't know about out a deep
               | truth karma but it does seem that there is a physical
               | sense that the music are represents which it can
               | successfully mimic using this essentially image
               | generating capability period one thing about all of these
               | amazing AI development, is that I still make some long
               | comments by dictating to Google. When it first got to the
               | point that it was able to catch almost everything then
               | was saying I was absolutely blown away period however,
               | it's really not that good at taking dictation karma and I
               | have to go back and replace each and every individual,
               | and period with with the corresponding punctuation mark
               | period seeing such an amazing developments happening
               | month after month year after year ear makes me feel like
               | we are really approaching what some people have called
               | the singularity period when I read about out net positive
               | fusion being announced my first Instinct was to think oh
               | of course it's now that that chat GPT is available of
               | course announcing a major fusion breakthrough would
               | happen within in days to weeks it just makes perfect
               | sense DJ eyes can solve problems that have have
               | confounded scientists for decades period to see just how
               | far we still have to go take a look at how this comment
               | red before I manually corrected it to what I had actually
               | set
        
             | nerpderp82 wrote:
             | It is a Deep Truth in that the universe is predictable and
             | can be represented (at least the parts we interact with)
             | mathematically. Matrix algebra is a hellova a drug. I could
             | imagine someone developing the ability to listen to
             | spectrograms by looking at them.
        
               | justincormack wrote:
               | There is a whole piece in Godel Escher Bach where they
               | look at vinyl records as alll the soud data is in there.
        
         | haykmartiros wrote:
         | /u/threevox on reddit made a colab for playing with the
         | checkpoint:
         | 
         | https://colab.research.google.com/drive/1FhH3HlN8Ps_Pr9OR6Qc...
        
         | [deleted]
        
         | joelrunyon wrote:
         | The site isn't working for me? Anything I have to fix on my
         | side to make it work?
        
           | plank wrote:
           | Crashes repeatedly on iOS in Firefox (my usual browser), is
           | OK on Safari though, so probably not a webkit thing.
        
         | alsodumb wrote:
         | Hayk! How smart are you! I loved your work on SymForce and
         | Skydio - totally wasn't expecting you to be co-author on this!
         | 
         | On a serious note, I'd really love some advice from you on time
         | management and how you get so much done? I love Skydio and the
         | problems you are solving, especially on the autonomy front, are
         | HARD. You are the VP of Autonomy there and yet also managed to
         | get this done! You are clearly doing something right. Teach us,
         | senpai!
        
         | superkuh wrote:
         | I've compiled/run a dozen different image to sound programs and
         | none of them produce an acceptable sound. This bit of your code
         | alone would be a great application by itself.
         | 
         | It'd be really cool if you could implement an MS paint style
         | spectrum painting or image upload into the web app for more
         | "manual" sound generation.
        
         | poslathian wrote:
         | Super! Makes sense since Skydio is also amazing.
         | 
         | How much data is used for fine tuning? Since spectrograms are
         | (surely?) very out of distribution for the pre training
         | dataset, how much does value does the pre training really
         | bring?
        
           | haykmartiros wrote:
           | To be honest, we're not sure how much value image pre
           | training brings. We have not tried to train from scratch, but
           | it would be interesting.
           | 
           | One thing that's very important though is the language pre-
           | training. The model is able to do some amazing stuff with
           | terms that do not appear in our data set at all. It does this
           | by associating with related words that do appear in the
           | dataset.
        
             | theGnuMe wrote:
             | You can embed images in spectrograms.. might sound weird
             | though
        
         | jablongo wrote:
         | Hello - this is awesome work. Like other commenters, I think
         | the idea that if you are able to transfer a concept into a
         | visual domain (in this case via fft) it becomes viable to model
         | with diffusion is super exciting but maybe an
         | oversimplification. With that in mind, do you think this type
         | of approach might work with panels of time series data?
        
         | newobj wrote:
         | All the AI music I've heard so far has a really unpleasant
         | resonant quality to it. Why is that? Can it be removed?
        
           | recursive wrote:
           | The link is down now, so I don't know about this one. But
           | most generated music is generated in the note domain, rather
           | than the audio domain. Any unpleasant resonance would
           | introduced in the audio synthesis step. And audio synthesis
           | from note data is a very solved problem for any kind of
           | timbre you can conceive of, and some you can't.
        
           | hyperbovine wrote:
           | Presumably for similar reasons that the vast majority of AI
           | generated art and text is off-puttingly hideous or bland. For
           | every stunning example that gets passed around the internet,
           | thousands of others sucked. Generating art that is
           | aesthetically pleasing to humans seems like the Mt. Everest
           | of AI challenges to me.
        
             | blueboo wrote:
             | > For every stunning example that gets passed around the
             | internet, thousands of others sucked
             | 
             | ...implying there may be an art to AI art. Hmm.
             | 
             | Meanwhile, the degree to which it is off-puttingly hideous
             | in general can be seen in the popularity of Midjourney --
             | which is to observe millions of folks (of perhaps dubious
             | aesthetic taste) find the results quite pleasing.
        
             | jameshart wrote:
             | The vast majority of human generated art is hideous or
             | bland. Artists throw away bad ideas or sketches that didn't
             | work all the time. Plus you should see most of the stuff
             | that gets pasted up on the walls at an average middle
             | School.
        
               | ROTMetro wrote:
               | Hard disagree. The average middle school picture will
               | have certain aspects exaggerated giving you insights into
               | the minds eye of the creator, how they see the world,
               | what details they focus on. There is no such minds eye
               | behind AI art so it's incredibly boring and mundane, no
               | matter how good a filter you apply on top of it's
               | fundamental lack of soul or anything interesting to
               | observe in the picture beyond surface level. It's great
               | for making art for assets for businesses to use, it's
               | almost a perfect match, as they are looking to have no
               | controversial soul to the assets they use, but lots of
               | pretty bubblegum polish.
        
               | antipotoad wrote:
               | Perhaps most of the AI art out there (that honestly
               | represents itself as such) is boring and mundane, but
               | after many hours exploring latent space, I assure you
               | that diffusion models can be wielded with creativity and
               | vision.
               | 
               | Prompting is an art and a science in its own right, not
               | to speak of all the ways these tools can be strung
               | together.
               | 
               | In any case, everything is a remix.
        
               | dwringer wrote:
               | I have to agree, the act of coming up with a prompt is
               | one and the same with providing "insights into the minds
               | eye of the creator, how they see the world, what details
               | they focus on" - two people will describe the same scene
               | with _completely_ different prompts.
        
               | jameshart wrote:
               | And the vast majority of professionally produced artwork
               | is for business use. It's packaging design or
               | illustration or corporate graphics or logos or whatever.
               | 
               | I don't get the objection.
        
             | adamsmith143 wrote:
             | Not sure about this. Models like Midjourney seem to put out
             | very consistently good images.
        
             | andybak wrote:
             | I think your comment is off-topic to the post you are
             | replyng to. That wasn't asking about the general aesthetic
             | quality - more about a specific audio artifact.
             | 
             | > For every stunning example that gets passed around the
             | internet, thousands of others sucked.
             | 
             | From personal experience this is simply untrue. I don't
             | want to debate it because you seem to have strong feelings
             | about the topic.
        
               | hyperbovine wrote:
               | Even if you remove the artifact, the exact same comment
               | applies. It generates a somewhat less interesting version
               | of elevator music. This is not to crap on what they did.
               | As I said, they underlying problem is extremely difficult
               | and nobody has managed to solve it.
               | 
               | I don't feel strongly about this topic at all.
        
               | indigochill wrote:
               | > It generates a somewhat less interesting version of
               | elevator music.
               | 
               | This iteration does, but that's an artifact of how it's
               | being generated: small spectograms that mutate without
               | emotional direction (by which I mean we expect things
               | like chord changes and intervals in melodies that we
               | associate with emotional expressions - elevator music
               | also stays in the neutral zone by design).
               | 
               | I expect with some further work, someone could add a
               | layer on top of this that could translate emotional
               | expressions into harmonic and melodic direction for the
               | spectrogram generator. But maybe that would also require
               | more training to get the spectrogram generator to
               | reliably produce results that followed those directions?
        
           | antognini wrote:
           | I've done some work on AI audio synthesis and the artifacts
           | you're hearing in these clips are coming from the algorithm
           | that is used to go from the synthesized spectrogram to the
           | audio (the Griffin-Lim algorithm).
           | 
           | Audio spectrograms have two components: the magnitude and the
           | phase. Most of the information and structure is in the
           | magnitude spectrogram so neural nets generally only
           | synthesize that. If you were to look at a phase spectrogram
           | it looks completely random and neural nets have a very, very
           | difficult time learning how to generate good phases.
           | 
           | When you go from a spectrogram to audio you need both the
           | magnitudes and phases, but if the neural net only generates
           | the magnitudes you have a problem. This is where the Griffin-
           | Lim algorithm comes in. It tries to find a set of phases that
           | works with the magnitudes so that you can generate the audio.
           | It generally works pretty well, but tends to produce that
           | sort of resonant artifact that you're noticing, especially
           | when the magnitude spectrogram is synthesized (and therefore
           | doesn't necessarily have a consistent set of phases).
           | 
           | There are other ways of using neural nets to synthesize the
           | audio directly (Wavenet being the earliest big success), but
           | they tend to be much more expensive than Griffin-Lim. Raw
           | audio data is hard for neural nets to work with because the
           | context size is so large.
        
             | yayr wrote:
             | would it be an approach to use separate color channels for
             | the freq amplitude and freq phase in the same picture?
             | Maybe the network then has a better way of learning the
             | relationships and there would be no need for the
             | postprocessing to generate a phase.
        
             | echelon wrote:
             | Griffin-Lim is slow and is almost certainly not being used.
             | 
             | A neural vocoder such as Hifi-Gan [1] can convert spectra
             | to audio - not just for voices. Spectral inversion works
             | well for any audio domain signal. It's faster and produces
             | much higher quality results.
             | 
             | [1] https://github.com/jik876/hifi-gan
        
               | antognini wrote:
               | If you check their about page they do say they're using
               | Griffin-Lim.
               | 
               | It's definitely a useful approach as an early stage in a
               | project since Griffin-Lim is so easy to implement. But I
               | agree that these days there are other techniques that are
               | as fast or faster and produce higher quality audio.
               | They're just a lot more complicated to run than Griffin-
               | Lim.
        
               | seth_ wrote:
               | Author here: Indeed we are using Griffin-Lim. Would be
               | exciting to swap it out with something faster and better
               | though. In the real-time app we are running the
               | conversion from spectrogram to audio on the GPU as well
               | because it is a nontrivial part of the time it takes to
               | generate a new audio clip. Any speed up there is helpful.
        
             | bckr wrote:
             | RAVE attacks the phase issue by using a second step of
             | training. I don't completely understand it, but it uses a
             | GAN architecture to make the outputs of a VAE sound better.
        
             | kazinator wrote:
             | Phase is crtical for pitch. Here is why. The spectral
             | transformation breaks up the signal into frequency bins.
             | The frequency bins are not accurate enough to convey pitch
             | properly. When a periodic signal is put through a FFT, it
             | will land into a particular frequency bin. Say that the
             | frequency of the signal is right in the middle of that bin.
             | If you vary its pitch a little bit, it will still hand into
             | the same bin. Knowing the amplitude of the bin doesn't give
             | you the exact pitch. The phase information will not give it
             | to you either. However, between successife FFT samples, the
             | phase will rotate. The more off-center the frequency is,
             | the more the phase rotates. If the signal is dead center,
             | then each successive FFT frame will show the same phase.
             | When it is off center, the waveform shifts relative to the
             | window, and so the phase changes for every sample. From the
             | rotating phase, you can determine the pitch of that signal
             | with great accuracy.
        
               | antognini wrote:
               | Yes, this is exactly right and is why Griffin-Lim
               | generated audio often has a sort of warbly quality. If
               | you use a large FFT you can mitigate the issues with
               | pitch because the frequency resolution in your
               | spectrogram is higher, so the phase isn't so critical to
               | getting the right pitch. But the trade-off of a bigger
               | FFT is that the pitches now have to be stationary for
               | longer.
               | 
               | The other place where phase is critical is in impulse
               | sounds like drum beats. A short impulse is essentially
               | just energy over a broad range of frequencies, but the
               | phases have been chosen such that all the frequencies
               | cancel each other out everywhere except for one short
               | duration where they all add constructively. Without the
               | right phases, these kinds of sounds get smeared out in
               | time and sound sort of flat and muffled. The typing
               | example on their demo page is actually a good example of
               | this.
        
             | amelius wrote:
             | I'm curious why, instead of using magnitude and phase, you
             | wouldn't use real and imaginary parts?
        
               | antognini wrote:
               | There have been some attempts at doing this, some of
               | which have been moderately successful. But fundamentally
               | you still have the problem that from the NN's
               | perspective, it's relatively easy for it to learn the
               | magnitude but very hard for it to learn the phase. So
               | it'll guess rough sizes for the real and imaginary parts,
               | but it'll have a hard time learning the correct ratio
               | between the two.
               | 
               | Models which operate directly on the time domain have
               | generally had a lot more success than models that operate
               | on spectrograms. But because time-domain models
               | essentially have to learn their own filterbank, they end
               | up being larger and more expensive to train.
        
             | xnzakg wrote:
             | Considering Stable Diffusion generates 3-channel (RGB)
             | images, maybe it would be possible to train it on amplitude
             | and phase data as two different channels?
        
               | antognini wrote:
               | People have tried that, but the model essentially learns
               | to discard the phase channel because it is too hard for
               | it to learn any useful information from it.
        
               | [deleted]
        
               | haykmartiros wrote:
               | We took a look at encoding phase, but it is very chaotic
               | and looks like Gaussian noise. The lack of spatial
               | patterns is very hard for the model to generate. I think
               | there are tons of promising avenues to improve quality
               | though.
        
           | woah wrote:
           | You're probably talking about the artifacts of converting a
           | low resolution spectrogram to audio.
        
             | wdfx wrote:
             | Can the spectrogram image be AI upscaled before
             | transforming back to the time domain?
        
               | malka wrote:
               | Yes it exists:
               | https://ccrma.stanford.edu/~juhan/super_spec.html
               | 
               | But the issue is not that the spectrogram is low quality.
               | 
               | The issue is that the spectrogram only contains the
               | amplitude information. You also need phase information
               | for generating audio from the spectogram
        
               | mcbuilder wrote:
               | Interesting, can't you quantize and snap to a phase that
               | makes sense to create the most musical resonance?
        
               | waltbosz wrote:
               | What happens if you run one of the spectrogram pictures
               | through an upscaler for images like ESRGAN ?
        
           | syntheweave wrote:
           | It sounds kind of like the visual artifacts that are
           | generated by resampling in two dimensions. Since the whole
           | model is based on compressing image content, whatever it's
           | doing DSP-wise is more-or-less "baked in", and a probable fix
           | would lie in doing it in a less hacky way.
        
           | crubier wrote:
           | I think this is because the generation is done in the
           | frequency domain. Phase retrieval is based on heuristics and
           | not perfect, so it leads to this "compressed audio" feel. I
           | think it should be improvable
        
       | TechTechTech wrote:
       | I got an actual `HTTP 402: PAYMENT_REQUIRED` response (never seen
       | one of those in the wild, according to Mozilla it is
       | experimental). Someone's credit card stopped scaling?
        
         | haykmartiros wrote:
         | LOL. Yes we had to upgrade our Vercel tier:
         | https://twitter.com/sethforsgren/status/1603425188401467392
        
       | PcChip wrote:
       | the problem is it _sounds_ awful, like a 64kbps MP3 or worse
       | 
       | Perhaps AI can be trained to create music in different ways than
       | generating spectrograms and converting them to audio?
        
         | w_for_wumbo wrote:
         | It doesn't need to sound good at all for it to be useful. Like
         | with the AI Art creation, it can be a starting point for
         | artists to play around and rapidly try different concepts, and
         | then interpret the concept using high quality tools to create
         | something really quite remarkable.
         | 
         | It's all about empowering artists to explore more
         | possibilities.
        
           | stevehiehn wrote:
           | Exactly! UX/Pipelines/Integrations are the next logical step.
           | It's my belief that samples will essentially be 'free' very
           | soon. We will see DAW plugins/integrations that contextually
           | offer samples to the composer. I'm confident in this because
           | that's what I'm working on.
        
       | joenot443 wrote:
       | Incredible stuff, Seth & Hayk. I've been thinking nonstop about
       | new things to build using Stable Diffusion and this is the first
       | one that really took my breath away.
        
       | esotericsean wrote:
       | Coming at this from a layman's perspective, would it be possible
       | to generate a different sort of spectrogram that's designed for
       | SD to iterate upon even more easily?
        
       | hoschicz wrote:
       | What did use as training data?
        
       | gedy wrote:
       | This is so good that I wondered if it's fake. Really impressive
       | results from generated spectrographs! Also really interesting
       | that it's not exactly trained on the audio files themselves -
       | wonder if the usual copyright-based objections wild even apply
       | here.
        
         | rbn3 wrote:
         | regarding those usual objections, i'd argue that a spectrograph
         | representation of a given piece of audio is just a different
         | (lossy) encoding of the same content/information, so any
         | hypothetical objections would still apply here.
        
           | Applejinx wrote:
           | You would be absolutely correct. the lossiness is in the
           | resolution of the image (512x512 is pretty terrible) but
           | given enough image resolution it's just an FFT transform, and
           | the only reason that stuff falls short is because people
           | don't give it, in turn, enough resolution. If you did wild
           | overkill of the resolution of an FFT transform you could do
           | anything you wanted with no loss of tone quality. If you
           | turned that to visual images and did diffusion with it you
           | could do AI diffusion at convincing audio quality.
           | 
           | In theory the tone quality is not an objection here. When it
           | sounds bad it's because it's 512x512, because the FFT
           | resolution isn't up to the task, etc. People cling to very
           | inadequate audio standards for digital processing, but you
           | don't have to.
        
         | [deleted]
        
         | kgwgk wrote:
         | Why not? Music copyright was not even about audio recordings
         | originally.
        
       | Applejinx wrote:
       | Some of this is really cool! The 20 step interpolations are very
       | special, because they're concepts that are distinct and novel.
       | 
       | It absolutely sucks at cymbals, though. Everything sounds like
       | realaudio :) composition's lacking, too. It's loop-y.
       | 
       | Set this up to make AI dubtechno or trip-hop. It likes bass and
       | indistinctness and hypnotic repetitiveness. Might also be good at
       | weird atonal stuff, because it doesn't inherently have any notion
       | of what a key or mode is?
       | 
       | As a human musician and producer I'm super interested in the
       | kinds of clarity and sonority we used to get out of classic
       | albums (which the industry has kinda drifted away from for
       | decades) so the way for this to take over for ME would involve a
       | hell of a lot more resolution of the FFT imagery, especially in
       | the highs, plus some way to also do another AI-ification of what
       | different parts of the song exist (like a further layer but it
       | controls abrupt switches of prompt)
       | 
       | It could probably do bad modern production fairly well even now
       | :) exaggeration, but not much, when stuff is really overproduced
       | it starts to get way more indistinct, and this can do indistinct.
       | It's realaudio grade, it needs to be more like 128kbps mp3 grade.
        
         | Metus wrote:
         | > composition's lacking, too. It's loop-y.
         | 
         | Well no wonder, it has absolutely no concept of composition
         | beyond a single 5s loop, if I understand correctly.
         | 
         | > It absolutely sucks at cymbals, though. Everything sounds
         | like realaudio :)
         | 
         | > It could probably do bad modern production fairly well even
         | now :) exaggeration, but not much, when stuff is really
         | overproduced it starts to get way more indistinct, and this can
         | do indistinct. It's realaudio grade, it needs to be more like
         | 128kbps mp3 grade.
         | 
         | I haven't sat down yet to calculate it, but is the output of SD
         | at 512*512px at 24bit enough to generate audio CD quality in
         | theory?
        
           | TheOtherHobbes wrote:
           | No.
           | 
           | And I suspect this will always have phase smearing, because
           | it's not doing any kind of source separation or individual
           | synthesis. It's effectively a form of frequency domain data
           | compression, so it's always going to be lossy.
           | 
           | It's more like a sophisticated timbral morph, done on a
           | complete short loop instead of an individual line.
           | 
           | It would sound better with a much higher data density. CD
           | quality would be 220500 samples for each five second loop.
           | Realtime FFTs with that resolution aren't practical on the
           | current generation of hardware, but they could be done in
           | non-realtime. But there will always be the issue of timbres
           | being distorted because outside of a certain level of
           | familiarity and expectation our brains start hearing gargly
           | disconnected overtones instead of coherent sound objects.
           | 
           | What this is _not_ doing is extracting or understanding
           | musical semantics and reassembling them in interesting ways.
           | The harmonies in some of these clips are pretty weird and
           | dissonant, and not what you 'd get from a human writing
           | accessible music. This matters because outside of TikTok
           | music isn't about 5s loops, and longer structures aren't so
           | amenable to this kind of approach.
           | 
           | This won't be a problem for some applications, but it's a
           | long way short of the musical equivalent of a MidJourney
           | image.
           | 
           | Generally we're a lot more tolerant of visual "bugs" than
           | musical ones.
        
             | TremendousJudge wrote:
             | >and not what you'd get from a human writing accessible
             | music
             | 
             | The timbral qualities of the posted samples remind me of
             | some of the stuff I heard from Aphex Twin, like Alberto
             | Balsalm. Not accessible by a long shot but definitely human
        
             | Metus wrote:
             | I think an approach like this could generate interesting
             | sounds we as humans would never think of. Or meshing two
             | sounds in ways we could barely imagine or implement.
             | 
             | But of course something like this, which only thinks in 5s
             | clips can not generate a larger structure, like even a
             | simple song. Maybe another algorithm could seed the notes
             | and an algorithm like this generates the sounds via
             | img2img.
        
       | nonima wrote:
       | This is really cool but can someone tell me why we are automating
       | art? Who asked for this? The future seems depressing when I look
       | at all this AI generated art.
        
         | TheRealPomax wrote:
         | Everyone asked for this, including artists. If you make a
         | living off of making art, having the best tools to help you do
         | that is a constant, and the tools are finally starting to get
         | properly good. Will "the job" change because of the tools? Of
         | course. Will the nature of what it means for something to be
         | art change? Also of course. Art isn't some static, untouchable
         | thing. It changes as humanity does.
        
         | diydsp wrote:
         | I would say it's not "generated," but interpolated...
         | 
         | It doesn't make anything new or fresh. It doesn't pull any
         | real-life emotions or experiences into a synthesis that a
         | person can relate to. It's more like asking a teenaged comedian
         | to imitate numerous impressions of music styles. e.g. in Clerks
         | when the Russian guy does "metal":
         | https://youtu.be/7gFoHkkCaRE?t=55
         | 
         | Of course the modern conception of music in the West is as an
         | accompaniment to other, mostly drudging, activities, as opposed
         | to something to be paid singularly attention. Therefore, there
         | are many "valuable"(*) occasions to produce "impressions" of
         | music. E.g. in advertisements and social media flexes where
         | identity and attitude are the purpose of music. For these, a
         | shallow interpretation or reflection of loosely amalgamated
         | sound clips will suffice. But we don't just attend concerts or
         | focus sustained energy on sonic impressions. We listen to
         | lyrics and give over our consciousness to composed works
         | because we want to find secrets others give away in dealing
         | with this crazy thing called life- ideas to succeed, admissions
         | of failure, and what the expected emotional arcs of these
         | trajectories looks like. This lofty goal is to date not within
         | the scope of AI stunts.
         | 
         | As Solzheinetysn said, "Too much art is like candy and not
         | bread."
        
         | owlbynight wrote:
         | We're not automating art, we're creating tools that make it
         | easier for humans to create art. These are nothing more than
         | new and exciting tools. The cream will still rise to the top,
         | same as it ever was.
        
         | moonchrome wrote:
         | Because art is the low hanging fruit of "close enough"
         | applications.
        
           | slenocchio wrote:
           | I wonder if this is true for music. Our ears are much more
           | discerning than our eyes when it comes to art it seems.
        
             | moonchrome wrote:
             | I mean listening to samples on the link above I'd hardly
             | call it music so I'd say you're right.
        
         | dangond wrote:
         | Because tons of people want to make art, and a lot of art
         | currently requires years of training to make anything close to
         | "good". Making art more accessible to create is a boon to
         | everyone who's dreamed of being able to make their own
         | paintings and music, but doesn't have the skills required.
        
           | [deleted]
        
           | logarhythmic wrote:
           | That just means there is going to be a whole lot more bad art
           | in the world
        
             | dangond wrote:
             | Not all of this art will be meant to be shared with the
             | whole world though. A lot of it will be people just using
             | it because they enjoy it.
        
         | schwartzworld wrote:
         | You can't automate a live performance or an oil painting with
         | AI in this way. This isn't going to replace musicians and
         | artists. If anything, I think a preponderance of AI art would
         | make people appreciate the real stuff more.
         | 
         | As to why, music is fun to create, and this is just a tool.
        
           | sampo wrote:
           | > You can't automate a live performance or an oil painting
           | with AI in this way.
           | 
           | You'd have to combine it with these guys
           | https://www.youtube.com/watch?v=WqE9zIp0Muk
        
         | hungryforcodes wrote:
         | Actually I agree with you, but HN is not really a place where
         | you will find artists defending themselves. However you will
         | find alot of people defending the automation of art. Generative
         | art has it's place. But ultimately until humans are extinct,
         | human generated art is the only thing which really represents
         | the species. Everything else is an advanced form of puppetry or
         | mimicry.
        
         | bulbosaur123 wrote:
         | > Who asked for this?
         | 
         | I did.
         | 
         | > The future seems depressing when I look at all this AI
         | generated art.
         | 
         | You should talk about your concerns with an AI psychotherapist.
        
       | 451mov wrote:
       | why not use an image of the waveform as input?
        
       | mensetmanusman wrote:
       | This works because songs are images in time. FFT analysis does
       | not care.
        
       | vikp wrote:
       | Producing images of spectrograms is a genius idea. Great
       | implementation!
       | 
       | A couple of ideas that come to mind:
       | 
       | - I wonder if you could separate the audio tracks of each
       | instrument, generate separately, and then combine them. This
       | could give more control over the generation. Alignment might be
       | tough, though.
       | 
       | - If you could at least separate vocals and instrumentals, you
       | could train a separate model for vocals (LLM for text, then text
       | to speech, maybe). The current implementation doesn't seem to
       | handle vocals as well as TTS models.
        
         | btbuildem wrote:
         | I think you'd have to start with separate spectrograms per
         | instrument, then blend the complete track in "post" at the end.
        
       | michpoch wrote:
       | Earlier this year, graphic designers, last month it was software
       | engineers, and now musicians are also feeling the effects.
       | 
       | Who else will AI make looking for a new job?
        
         | goostavos wrote:
         | This was the first AI thing to fill me with a feeling of
         | existential dread.
        
           | TaupeRanger wrote:
           | What is with the hyperbole in this thread? This stuff sounds
           | like incoherent noise. It is noticeably worse than AI audio
           | stuff I heard 5 years ago. What is going on with the
           | responses here?
        
             | wcoenen wrote:
             | Usage of an image generator to produce passable music
             | fragments, even if they sound a bit distorted, is very
             | surprising. That type of novelty is why we come here.
        
         | Applejinx wrote:
         | Musicians were made to get a day job long before you were born
         | ;)
        
           | wpietri wrote:
           | Although I do wonder how much an earlier technology, audio
           | reproduction, contributed to that. My grandmother worked for
           | a time as a piano player as part of a nightclub orchestra. It
           | was a stable job back then. I have to wonder how many
           | musician jobs were killed off by the jukebox and related
           | technologies.
        
         | awestroke wrote:
         | If I was a musician, this post would not make me worry for a
         | second
        
           | wcoenen wrote:
           | If a hack based on an image generator already has promising
           | results for music generation, then imagine what will happen
           | if something dedicated to music is built from the ground up.
        
         | logn wrote:
         | The raw outputs of these tools will be best consumed by
         | experts. Until general AI, these are just better tools for the
         | same workers.
        
           | mensetmanusman wrote:
           | They were killed off by the ability to record the data. Every
           | city used to have their own music stars :)
        
         | 323 wrote:
         | Politicians, bureaucracy.
         | 
         | GPT-3, what policy should we apply to increase tax revenue by
         | 5% given these constraints?
         | 
         | GPT-3, please tell me some populist thing to say to win the
         | next election, or how should I deflect these corruption
         | charges.
        
           | antipotoad wrote:
           | Isn't this the plot of Deus Ex?
        
           | kmeisthax wrote:
           | "We should place a tax on all copyright lawyers and use it to
           | fund GPU manufacturing and AI development. At your next stump
           | speech, mention how the entertainment industry is stealing
           | jobs from construction workers. Your corruption charges won't
           | matter because voters only care about corruption when it's
           | not in their favor."
        
         | bawolff wrote:
         | Honestly none of them should. I think the moral panic around
         | these things is way overstated. They are cool but hardly about
         | to replace anyone's job.
        
       | d7y wrote:
        
       | wmwmwm wrote:
       | This is amazing, and scary (as a musician) but also reliably
       | kills firefox on iOS!
        
       | EZ-Cheeze wrote:
       | "https://en.wikipedia.org/wiki/Spectrogram - can we already do
       | sound via image? probably soon if not already"
       | 
       | Me in the Stable Diffusion discord, 10/24/2022
       | 
       | The ppl saying this was a genius idea should go check out my
       | other ideas
        
         | mdonahoe wrote:
         | If only we had a diffusional model that could take your ideas
         | and turn them into reality!
        
           | EZ-Cheeze wrote:
           | No I want ppl
        
       | Slow_Hand wrote:
       | As a musician, I'll start worrying once an AI can write at the
       | level of sophistication of a Bill Withers song:
       | 
       | https://www.youtube.com/watch?v=nUXgJkkgWCg
       | 
       | Not simply SOUND like a Bill Withers song, but to have the same
       | depth of meaning and feeling.
       | 
       | At that point, even if we lose we win because we'll all be
       | drowning in amazing music. Then we'll have a different class of
       | problem to contend with.
        
       | rbn3 wrote:
       | great stuff, while it comes with the usual smeary iFFT artifacts
       | that AI-generated sound tends to have the results are
       | surprisingly good. i especially love the nonsense vocals it
       | generates in the last example, which remind me of what singing
       | along to foreign songs felt like in my childhood.
        
       | gitfan86 wrote:
       | This is what I've been talking about all year. It is such a
       | relief to see it actually happen.
       | 
       | In summary: The search for AGI is dead. Intelligence was here and
       | more general than we realized this whole time. Humans are not
       | special as far as intelligence goes. Just look how often people
       | predict that an AI cannot X or Y or Z. And then when an AI does
       | one of those things they say, "well it cannot A or B or C".
       | 
       | What is next: This trend is going to accelerate as people realize
       | that AI's power isn't in replacing human tasks with AI agents,
       | but letting the AI operate in latent spaces and domains that we
       | never even thought about trying.
        
         | visarga wrote:
         | Generated contents without filtering/validation are worthless.
         | 
         | I predict some kind of testing, validation or ranking be
         | developed to filter out generated contents. Each domain has its
         | own rules - you need to implement validation for code and math,
         | fact checks for text, contrasting the results from multiple
         | solutions for problem solving, and aesthetic scoring for art.
         | 
         | But validation is probably harder than learning to generate in
         | the first place, probably a situation similar to closing the
         | last percent in self driving.
        
       | [deleted]
        
       | nixpulvis wrote:
       | Sounds a bit "clowny" to me, for lack of a better word.
        
       | Broge wrote:
       | I wonder if it's possible to fine-tune an image upscaling model
       | on spectrograms, in order to clean up the sound?
        
       | andy_ppp wrote:
       | I was thinking about this - what if someone trained a stable
       | diffusion type model on all of the worlds commercial music? This
       | model would probably produce quite amazing music given enough
       | prompting and I'm wondering if the music industry would be
       | allowed to claim copyright on works created with such a model.
       | Would it be illegal or is this just like a musician picking up
       | ideas from hearing the world of music? Is it really right to make
       | learning a crime, even if machines are doing it? I'm conflicted
       | after finding out that for sync licensing the music industry want
       | a percentage of revenue based on your subscription fees,
       | sometimes as high as 15%-20%! I'm surprised such a huge fee isn't
       | considered some kind of protection racket.
        
         | throw78311 wrote:
         | This question has been explored before, see Kologorov Music:
         | https://www.youtube.com/watch?v=Qg3XOfioapI
        
       | dreilide wrote:
       | impressive stuff. reminds me of when ppl started using image
       | classifier networks on spectrograms in order to classify audio. i
       | would not have thought to apply a similar concept for generative
       | models, but it seems obvious in hindsight.
        
       | quux wrote:
       | The vocals in these tracks are so interesting. They sound like
       | vocals, with the right tone, phonemes. and structure for the
       | different styles and languages but no meaning.
       | 
       | Reminds me of the soundtrack to Nier Automata which did a similar
       | thing: https://youtu.be/8jpJM6nc6fE
        
         | int_19h wrote:
         | That's glossolalia, and it's not that uncommon in human-created
         | art.
        
         | qayxc wrote:
         | I think AI would be great at generating similar things. Might
         | be very nice for generating fake languages, too.
        
       | mastax wrote:
       | Wow those examples are shockingly good. It's funny that the
       | lyrics are garbled analogously to text in stable diffusion
       | images.
       | 
       | The audio quality is surprisingly good, but does sound like it's
       | being played through an above-average quality phone line. I bet
       | you could tack on an audio-upres model afterwards. Could train it
       | by turning music into comparable-resolution spectrograms.
        
       | pmontra wrote:
       | A musician friend of mine told me that this is (I freely
       | translate) a perversion, building in frequency and returning
       | time. Don't shoot the messenger.
       | 
       | Personally I like the results. I'm totally untrained and couldn't
       | hear any of the issues many comments are pointing out.
       | 
       | I guess that all of lounge/elevator music and probably most ad
       | jingles will be automated soon, if automation cost less than
       | human authors.
        
         | adamsmith143 wrote:
         | "Horse Carriage Driver says horseless carriages are
         | abominations. More at 12!"
        
       | bufferoverflow wrote:
       | A network trained on spectrograms only should do much better.
        
       | lucidrains wrote:
       | personalized RL agents that finds aesthetic trajectories through
       | the music latent space... soon, i hope :D
        
         | haykmartiros wrote:
         | Love this idea. If I had more time I wanted to make a spaceship
         | game where you are flying around the latent space, and model
         | interrogation is used to provide labels to landmarks as you
         | move around.
        
       | XorNot wrote:
       | So this is slightly bending my mind again. Somehow image
       | generators were more comprehensible compared to getting coherent
       | music out. This is incredible.
        
       | leod wrote:
       | Awesome work.
       | 
       | Would you be willing to share details about the fine-tuning
       | procedure, such as the initialization, learning rate schedule,
       | batch size, etc.? I'd love to learn more.
       | 
       | Background: I've been playing around with generating image
       | sequences from sliding windows of audio. The idea roughly works,
       | but the model training gets stuck due to the difficulty of the
       | task.
        
       | raajg wrote:
       | If such unreasonably good music can be created based on
       | information encoded in an image, I'm wondering what there things
       | we can do with this flow:
       | 
       | 1) Write text to describe the problem 2) Generate an image Y that
       | encodes that information 3) Parse that image Y to do X
       | 
       | Example: Y = blueprint, X = Constructing a building with that
       | blueprint
        
       | newswasboring wrote:
       | If it can do music, can we train better models for different
       | kinds of music? Or different models for different instruments
       | makes more sense? For different instruments we can get better
       | resolution by making the spectrogram represent different
       | frequency ranges. This is terribly exciting, what a time to be
       | alive.
        
       | superb-owl wrote:
       | The interpolation from keyboard typing to jazz is incredible.
       | This is what AI art should be.
        
       | slenocchio wrote:
       | Do you guys think AI creative tools will completely subsume the
       | possibility space of human made music? Or does it open up a new
       | dimension of possibilities orthogonally to it? Hard for me to
       | imagine how AI would be able to create something as unique and
       | human as D'Angelo's Voodoo (esp. before he existed) but maybe it
       | could (eventually).
       | 
       | If I understand these AI algorithms at a high level, they're
       | essentially finding patterns in things that already exist and
       | replicate it w some variation quite well. But a good song is
       | perfect/precise in each moment in time. Maybe we'll only be ever
       | be able to get asymptotically closer but never _quite_ there to
       | something as perfectly crafted a human could make? Maybe there
       | will always be a frontier space only humans can explore?
        
         | ElFitz wrote:
         | > Hard for me to imagine how AI would be able to create
         | something as unique and human as D'Angelo's Voodoo (esp. before
         | he existed)
         | 
         | There's always that immortal randomly typing monkey with a
         | typewriter thing [1]. And, in our case, it seems to be better
         | than random.
         | 
         | So, yes, perhaps. But perhaps we could instead build and create
         | things that are yet unimaginable upon it. We'll see.
         | 
         | [1]: https://en.wikipedia.org/wiki/Infinite_monkey_theorem
        
       | simsspoons wrote:
       | this is just great
        
         | zoytek wrote:
         | It's amazing. They've really got something revolutionary here.
        
       | ZiiS wrote:
       | For the 30 anniversary? https://warp.net/gb/artificial-
       | intelligence
        
         | hoherd wrote:
         | Nice reference! I had never seen that site before, but those
         | albums had a significant impact on my musical journey.
         | 
         | There was a purple victorian house in Colorado Springs where
         | the living room was converted into a record and cd store called
         | Life By Design. I picked up these albums and a ton of other
         | obscure music there. I was so happy to not have to drive all
         | the way up to Wax Trax in Denver to be able to discover new
         | artists.
        
       | xtracto wrote:
       | This looks great and the idea is amazing. I tried with the
       | prompt: "speed metal" and "speed metal with guitar riffs" and got
       | some smooth rock-balad type music. I guess there was no heavy
       | metal in the learning samples haha.
       | 
       | Great work!
        
         | the_third_wave wrote:
         | Gregorian death metal folk also seems to have lacked seed tunes
         | but the thing is just in its infancy so soon we'll be banging
         | our tonsured heads to the folky beats of ...
         | 
         | ...OK, need to create a band name generator to work in tandem
         | with this thing. Let's see what one of its brethren in ML makes
         | of it...
         | 
         |  _- "Echoes of the Past": This name plays on the idea of
         | Gregorian chanting, which is often associated with the distant
         | past, and combines it with the intense and aggressive sound of
         | death metal.
         | 
         | - "The Order of the Black Chant": This name incorporates
         | elements of both the religious connotations of Gregorian
         | chanting and the dark, heavy sound of death metal, creating a
         | sense of mystery and danger.
         | 
         | - "Foretold in Blood": This name evokes both the ancient,
         | mystical nature of Gregorian chanting and the violent themes of
         | death metal, creating a sense of ancient prophecy coming to
         | pass.
         | 
         | - "Crypt of the Silent Choir": This name brings together the
         | eerie, otherworldly sound of Gregorian chanting with the
         | underground, underground feel of death metal, creating a sense
         | of hidden secrets and forbidden knowledge._
         | 
         | "The Order of the Black Chant" it shall be.
        
       | vintermann wrote:
       | Fun! I tried something similar with DCGAN when it first came out,
       | but that didn't exactly make nice noises. The conversion to and
       | from Mel spectrograms was lossy (to put it mildly), and DCGAN,
       | while impressive in its day, is nothing like the stuff we have
       | today.
       | 
       | Interesting that it gets so good results with just fine tuning
       | the regular SD model. I assume most of the images it's trained on
       | are useless for learning how to generate Mel spectrograms from
       | text, so a model trained from scratch could potentially do even
       | better.
       | 
       | There's still the issue of reconstructing sound from the
       | spectrograms. I bet it's responsible for the somewhat tinny sound
       | we get from this otherwise very cool demo.
        
       | knicholes wrote:
       | Does anyone have any good guides/tutorials for how to fine-tune
       | Stable Diffusion? I'm not talking about textual inversion or
       | dreambooth.
        
       | bane wrote:
       | I bet a cool riff on this would be to simply sample an ambient
       | microphone in the workplace and use that the generate and slowly
       | introduce matching background music that fits the current tenor
       | of the environment. Done slowly and subtly enough I'd bet the
       | listener may not even be entirely aware its happening.
       | 
       | If we could measure certain kinds of productivity it might even
       | be useful as a way to "extend" certain highly productive ambient
       | environments a la "music for coding".
        
         | hammock wrote:
         | >in the workplace
         | 
         | Or at a house party, club or restaurant... as more people
         | arrive or leave and the energy level rises or declines..or
         | human rhythms speed up or slow down...so does the music...
        
           | Def_Os wrote:
           | DJs are getting automated away too!
        
         | chrisfrantz wrote:
         | Reactive generative music would so cool
        
       | londons_explore wrote:
       | I think there has to be a better way to make long songs...
       | 
       | For example, you could take half the previous spectrogram, shift
       | it to the left, and then use the inpainting algorithm to make the
       | next bit... Do that repeatedly, while smoothly adjusting the
       | prompt, and I think you'd get pretty good results.
       | 
       | And you could improve on this even more by having a non-linear
       | time scale in the spectrograms. Have 75% of the image be linear,
       | but the remaining 25% represent an exponentially downsampled
       | version of history. That way, the model has access to what was
       | happening seconds, minutes, and hours ago (although less detail
       | for longer time periods ago).
        
         | someguyorother wrote:
         | Perhaps you could do a hierarchical approach somehow, first
         | generating a "zoomed out" structure, then copying parts of it
         | into an otherwise unspecified picture to fill in the details.
         | 
         | But perhaps plain stable diffusion wouldn't work - you might
         | need different neural networks trained on each "zoom level"
         | because the structure would vary: music generally isn't like
         | fractals and doesn't have exact self-similarity.
        
       | talhof8 wrote:
       | Really cool. Can't get this to work on the homepage though.
       | 
       | Might be a traffic thing?
       | 
       | Edit: Works now. A bit laggy but it works. Brilliant!
        
         | scoopertrooper wrote:
         | I'm getting this back when I try to hear cats sing me a rock
         | opera:
         | 
         | {"data":{"success":true,"worklet_output":{"error":"Model
         | version 5qekv1q is not healthy"},"latency_ms":530}}
        
           | Pepe1vo wrote:
           | Same here, servers are overloaded probably. Shame, I was
           | really looking forward to a Wu Tang Clan and Jamiroquai
           | collab
        
         | LoveMortuus wrote:
         | I also don't hear anything, even when my prompt was selected...
        
           | MichaelZuo wrote:
           | Me neither, perhaps the web app is a bit buggy?
        
         | [deleted]
        
         | benplumley wrote:
         | Same earlier, but I can now get it to work very intermittently,
         | with the error "Uh oh! Servers are behind, scaling up..."
        
       | orobinson wrote:
       | I'd been wondering (naively) if we'd reached the point where we
       | can't see any new kinds of music now that electronic synthesis
       | allows us to make any possible sound. Changes in musical styles
       | throughout history tend to have been brought about by people
       | embracing new instruments or technology.
       | 
       | This is the most exciting thing I've seen in ages as it shows we
       | may be on the verge of the next wave of new technology in music
       | that will allow all sorts of weird and wonderful new styles to
       | emerge. I can't wait to see what these tools can do in the hands
       | of artists as they become more mainstream.
        
         | EamonnMR wrote:
         | 'make any possible sound' is less important than 'make x sound
         | easily' by way of tools and accumulated knowledge. Also what's
         | audiences are receptive to matters a lot - you could have made
         | noise rock in the 40s but I can't imagine it would have sold a
         | lot of records.
        
       | lachlan_gray wrote:
       | Wow, diffusion could be a game changer for audio restoration.
        
       | sampo wrote:
       | GPT-3 has 175 billion parameters (says Wikipedia). What is the
       | size of the neural network used in this riffusion project?
        
         | sebzim4500 wrote:
         | Stable Diffusion 'only' has ~1B parameters IIRC.
        
       | seth_ wrote:
       | Authors here: Fun to wake up to this surprise! We are rushing to
       | add GPUs so you can all experience the app in real-time. Will
       | update asap
        
         | SamPatt wrote:
         | Fascinating stuff.
         | 
         | One of the samples had vocals. Could the approach be used to
         | create solely vocals?
         | 
         | Could it be used for speech? If so, could the speech be
         | directed or would it be random?
        
         | [deleted]
        
         | AMICABoard wrote:
         | Awesome, there is another project out there that does it with
         | CPU https://github.com/marcoppasini/musika maybe mix the both,
         | ie take initial output of musika, convert to spectrogram and
         | feed it to riffusion to get more variation...
        
       | ElijahLynn wrote:
       | Wow, I just learned so much about spectograms, had no idea that
       | one could reverse one into audio waves!
        
       | fritzschopen wrote:
       | it seems that SD does cover everything in terms of generative ai.
       | Speaking of music, very interesting paper and demo. Just
       | wondering in terms of license and commercialization, what kind of
       | mess are we expecting here?
        
       | pea wrote:
       | This is amazing! Would it be possible to use it to modify this
       | interpolate between two existing songs (i.e. generate
       | spectrograms from audio and transition between them)?
        
       | CrypticShift wrote:
       | Things similar to the "interpolation" part (not the generative
       | part) are already used extensively especially for game and movie
       | sound design. Kyma [1] is the absolute leader (it requires
       | expensive hardware though). IMO later iterations on this approach
       | may lead to similar or better results.
       | 
       | FYI, other apps that use more classic but still complex
       | Spectral/Granular algos :
       | 
       | https://www.thecargocult.nz/products/envy
       | 
       | https://transformizer.com/products/
       | 
       | https://www.zynaptiq.com/morph/
       | 
       | [1] https://kyma.symbolicsound.com/
        
       | winReInstall wrote:
       | Cant wait to see this in karaoke, you just sing lyrics and it
       | jams along with music.
        
       | serverholic wrote:
       | I'm curious about the limitations of using spectrograms and
       | transient-heavy sounds like drums.
       | 
       | It seems like you'd need very high resolution spectrograms to get
       | a consistently snappy drum sound.
        
         | genewitch wrote:
         | 8GB is enough to do 1080p resolution. the UI i use for SD maxes
         | out at 2048x2048. however, it takes a lot longer than 512x512
         | to generate: 1m40s versus 1.97s.
         | 
         | I'm guessing if one had access to one of those nvidia backplane
         | rackmount devices one could generate 8k or larger resolution
         | images.
        
           | astrange wrote:
           | SD can't generate coherent images if you increase the output
           | size. They're basically always unusable unless you don't need
           | any global architecture to them.
        
       | bawolff wrote:
       | I wonder if this would be applicable to video game music. Be able
       | to make stuff that's less repetitive but also smoothly
       | transitions to specific things with in-game events.
        
       | zone411 wrote:
       | Interesting. I experimented a bit with the approach of using
       | diffusion on whole audio files, but I ultimately discarded it in
       | favor of generating various elements of music separately. I'm
       | happy with the results of my project of composing melodies
       | (https://www.youtube.com/playlist?list=PLoCzMRqh5SkFPG0-RIAR8...)
       | and I still think this is the way to go and but that was before
       | Stable Diffusion came out. These are interesting results though,
       | maybe it can lead to something more.
        
       | ricopags wrote:
       | This is so completely wild. Love the novelty and inventiveness.
       | 
       | Could anyone help me understand whether using SVG instead of
       | bitmap image would be possible? I realize that probably wouldn't
       | be taking advantage of the current diffusion part of Stable-
       | Diffusion, but my intuition is maybe it would be less noisy or
       | offer a cleaner/more compressible path to parsing transitions in
       | the latent space.
       | 
       | Great idea? Off base entirely? Would love some insight either way
       | :D
        
       | jsat wrote:
       | Today's music generation is putting my Pop Ballad Generator to
       | shame: http://jsat.io/blog/2015/03/26/pop-ballad-generator/
        
       | owlbynight wrote:
       | If copyright laws don't catch up, the sampling industry is
       | cooked.
       | 
       | Made this: https://soundcloud.com/obnmusic/ai-sampling-riffusion-
       | waves-...
        
       | fowlkes wrote:
       | Multiple folks have asked here and in other forums but I'm going
       | to reiterate, what data set of paired music-captions was this
       | trained on? It seems strange to put up a splashy demo and repo
       | with model checkpoints but not explain where the model came
       | from... is there something fishy going on?
        
       | needz wrote:
       | This website crashes Firefox on iOS
        
       | Abecid wrote:
       | This is one of the most ingenious thing I've seen in my life
        
       | senko wrote:
       | @haykmartiros, @seth_, thank you for open sourcing this!
       | 
       | Played a bit with the very impressive demos, now waiting in queue
       | for my very own riff to get generate.
       | 
       | Great as this is, I'm imagining what it could do for song
       | crossfades (actual mixing instead of plain crossfade even with
       | beat matching).
        
       | soperj wrote:
       | > https://www.riffusion.com/?&prompt=punk+rock+in+11/8
       | 
       | Tried getting something in an odd timing, but still is 4/4.
        
       | Moosdijk wrote:
       | Wow this is awesome!
        
       | tomrod wrote:
       | This is huge.
       | 
       | This show me that Stable Diffusion can create anything with the
       | following conditions:
       | 
       | 1. Can be represented as as static item on two dimensions (their
       | weaving together notwithstanding, it is still piece-by-piece
       | statically built)
       | 
       | 2. Acceptable with a certain amount of lossiness on the
       | encoding/decoding
       | 
       | 3. Can be presented through a medium that at some point in
       | creation is digitally encoded somewhere.
       | 
       | This presents a lot of very interesting changes for the near
       | term. ID.me and similar security approaches are basically dead.
       | Chain of custody proof will become more and more important.
       | 
       | Can stable diffusion work across more than two dimensions?
        
         | marviel wrote:
         | I would argue that its high-fidelity representations of 3d
         | space, imply that the model's weights are capable of pattern-
         | matching in multiple dimensions, provided the input is embedded
         | into 2d space appropriately.
        
         | Pxtl wrote:
         | Now I'm wondering about feeding Stable Diffusion 2D landscape
         | data with heightmaps and letting generate maps for RTS
         | videogames. I mean, the only wrinkle there is an extra channel
         | or two.
        
       | [deleted]
        
       | naillo wrote:
       | It's interesting if this can be used for longer tracks by
       | inpainting the right half of the spectrogram.
        
       | spyder wrote:
       | Another related audio diffusion model (but without text
       | prompting) here: https://github.com/teticio/audio-diffusion
        
         | kanwisher wrote:
         | oh wow this one works really well
        
       | r3trohack3r wrote:
       | I can't help but see parallels to synesthesia. It's amazing how
       | capable these models are at encoding arbitrary domain knowledge
       | as long as you can represent it visually w/ reasonable noise
       | margins.
        
       | Animats wrote:
       | "Uh oh! Servers are behind, scaling up..." - havent' been able to
       | get past that yet. Anyone getting new output?
       | 
       | This is already better than most techno. I can see DJs using
       | this, typing away.
        
       | up2isomorphism wrote:
       | These are horrible musics, but of course there is nothing to feel
       | shame about it.
        
       | Aardwolf wrote:
       | How comes that the stable diffuse model helps here? Does the fact
       | that it knows what an astronaut on a horse looks like have effect
       | on the audio? Would starting the training from an empty model
       | work too?
        
       | xcambar wrote:
       | I will try it but at least for the name it deserves praise.
        
       | kingcai wrote:
       | Absolutely brilliant!
        
       | NHQ wrote:
       | In the end there was the word.
        
       | dylan604 wrote:
       | The results of this are similar to my nitpicks of AI generated
       | images (well, duh!). There's definitely something recognizable
       | there, but somethings just not quite right about it.
       | 
       | I'm quite impressed that there was enough training data within SD
       | to know what a spectrograph looks like for the different sounds.
        
       | minaguib wrote:
       | Absolutely incredible - from idea to implementation to output.
        
       | jansan wrote:
       | Very impressive. I am quite confident that next years number one
       | Christmas hit will start like "church bells to electronic beats".
        
         | quakeguy wrote:
         | Xmd5a is already a real track.
         | 
         | https://www.youtube.com/watch?v=crcqADcAusg
        
       | valdiorn wrote:
       | This really is unreasonably effective. Spectrograms are a lot
       | less forgiving of minor errors than a painting. Move a brush
       | stroke up or down a few pixels, you probably won't notice. Move a
       | spectral element up or down a bit and you have a completely
       | different sound. I don't understand how this can possibly be
       | precise enough to generate anything close to a cohesive output.
       | 
       | Absolutely blows my mind.
        
         | seth_ wrote:
         | Author here: We were blown away too. This project started with
         | a question in our minds about whether it was even possible for
         | the stable diffusion model architecture to output something
         | with the level of fidelity needed for the resulting audio to
         | sound reasonable.
        
         | TaupeRanger wrote:
         | It's...not effective though. Am I listening to the wrong thing
         | here? Everything I hear from the web app is jumbled nonsense.
        
           | itronitron wrote:
           | I think we're at the point, with these AI generative model
           | thingies, where the practitioners are mesmerized by the
           | mechatronic aspect like a clock maker who wants to recreate
           | the world with gears, so they make a mechanized puppet or
           | diorama and revel in their ingenuity.
        
           | jefftk wrote:
           | I think the progression from church bells to electronic beats
           | is especially good: https://www.riffusion.com/about/church_be
           | lls_to_electronic_b...
        
         | hyperbovine wrote:
         | Wasn't this Fraunhofer's big insight that led to the
         | development of MP3? Human perception actually is pretty
         | forgiving of perturbations in the Fourier domain.
        
           | w-m wrote:
           | You probably mean Karlheinz Brandenburg, the developer of
           | MP3, who worked on psychoacoustics. Not completely off
           | though, as he did the research at a Fraunhofer research
           | institute, which takes its name from Joseph von Fraunhofer,
           | the inventor of the spectroscope.
        
             | th0ma5 wrote:
             | Does the institute not also claim that work?
        
               | w-m wrote:
               | Fair enough. But for me, when talking about `having an
               | insight`, I don't imagine a non-human entity doing that.
               | And to be pedantic (talking about Germans doing research,
               | I hope everyone would expect me to be), the institute is
               | called Fraunhofer IIS. `Fraunhofer` would colloquially
               | refer to the society, which is an organization with 76
               | institutes total. Although, of course, the society will
               | also claim the work...
        
               | humanistbot wrote:
               | It's an interesting question, one I hadn't thought of
               | before. But in common language, it sometimes makes sense
               | to credit the institution, others just the individuals. I
               | think may be more based around how much the institution
               | collectively presents itself as the author and speaks on
               | behalf of the project versus the individuals involved.
               | Here is my own general intuition for a few contrasting
               | cases:
               | 
               | Random forests: Ko and Breiman's, not really Bell Labs
               | and UC-Berkeley
               | 
               | Transistors: Bardeen, Brattain, and Shockley, not really
               | Bell Labs (thank the Nobel Prize for that)
               | 
               | UNIX: Primarily Bell Labs, but also Ken Thompson and
               | Dennis Richie (this is a hard one)
               | 
               | GPT-n: OpenAI, not really any individual, and I can't
               | seem to even recall any named individual from memory
        
               | WanderPanda wrote:
               | Bringing the right people together and having the right
               | environment that gives rise to ,,having an insight" can
               | be a big part as well.
        
               | killerpopiller wrote:
               | btw, it is public funded non-profit organisation
        
           | ComplexSystems wrote:
           | In very limited situations. You can move a frequency around
           | (or drop it entirely) if it's being masked by a nearby loud
           | frequency. Otherwise, you would be amazed at the sensitivity
           | of pitch perception.
        
         | 323 wrote:
         | You can also add another neural-network to "smooth" the
         | spectrogram, increase the resolution and remove artefacts, just
         | like they do for image generation.
        
           | bckr wrote:
           | Pretty sure that's how RAVE works
        
       | evo_9 wrote:
       | Pretty nice, I was just talking to a friend about needing a music
       | version of chatgpt, so thank you for this.
       | 
       | Wondering if it would be possible to create a version of this
       | that you can point at a person SoundsCloud and have it emulate
       | their style / create more music in the style of the original
       | artist. I have a couple albums worth of downtempo electronic
       | music I would love to point something like this at and see what
       | it comes up with.
        
         | trekkie1024 wrote:
         | https://mubert.com/ might be what you're looking for.
        
       | bheadmaster wrote:
       | This is a genius idea. Using an already-existing and well-
       | performing image model, and just encoding input/output as a
       | spectrogram... It's elegant, it's obvious in retrospective, it's
       | just pure genius.
       | 
       | I can't wait to hear some serious AI music-making a few years
       | from now.
        
         | dangom wrote:
         | This idea is presented by Jeremy Howard on literally their
         | first Deep Learning for Coders class (most recent edition). A
         | student wanted to classify sounds, but only knew how to do
         | vision, so they converted sounds to spectrograms, fine tuned
         | the model on the labelled spectra, and the classification
         | worked pretty well on test data. That of course does not take
         | the merit away from the Riffusion authors though.
        
           | Analog24 wrote:
           | The idea of connecting CV to audio via spectrograms pre dates
           | Jeremy Howard's course by quite a bit. That's not really the
           | interesting part here though. The fact that a simple
           | extension of an image generation pipeline produces such
           | impressive results with generative audio is what is
           | interesting. It really emphasizes how useful the idea of
           | stable diffusion is.
           | 
           | edit: added a bit more to the thought
        
         | superpope99 wrote:
         | I'm super excited about the Audio AI space, as it seems
         | permanently a few years behind image stuff - so I think we're
         | going to see a lot more of this.
         | 
         | If you're interested, the idea of applying Image processing
         | techniques to Spectrograms of audio is explored in brief in the
         | first lesson of one of the most recommended AI courses on HN:
         | Practical Deep Learning for Coders
         | https://youtu.be/8SF_h3xF3cE?t=1632
        
         | rco8786 wrote:
         | > I can't wait to hear some serious AI music-making a few years
         | from now.
         | 
         | I think this will be particularly useful for musical
         | compositions in movies and film, where the producer can
         | "instruct" the AI about what to play, when, and how to
         | transition so that the music matches the scene progression.
        
           | adamhp wrote:
           | Not only that but sampling. I'd say there's at least one
           | sample from something in most modern music. This can
           | essentially create "sounds" that you're looking for as an
           | artist. I need a sort of high pitched drone here... Rather
           | than dig through sample libraries you just generate a few
           | dozen results from a diffusion model with some varying inputs
           | and you'd have a small sample set on the exact thing you're
           | looking for. There's already so much processing of samples
           | after the fact, the actual quality or resolution of the
           | sample is inconsequential. In a lot of music, you're just
           | going after the texture and tonality and timbre of
           | something... This can be seen in some Hans Zimmer videos of
           | how he slows down certain sounds massively to arrive at new
           | sounds... or in granular synthesis... This is going to open
           | up a lot of cool new doors.
        
           | tstrimple wrote:
           | I was thinking gaming where music can and should dynamically
           | shift based on different environmental and player conditions.
        
         | sebzim4500 wrote:
         | I suspect that if you had tried this with previous image models
         | the results would have been terrible. This only works since
         | image models are so good now.
        
         | josalhor wrote:
         | Makes me wonder if we will see a generalization of this idea.
         | Just like in a CPU 90%+ of want you want to do can be modeled
         | with very few instructions (mov, add, jmp..) we could see a set
         | of very refined models (Stable difussion, GPT, etc) and all of
         | their abstractions on top (ChatGPT, Rifussion, etc).
        
           | DarmokJalad1701 wrote:
           | Maybe next up is a model that generates Piet code
           | 
           | https://www.dangermouse.net/esoteric/piet.html
        
           | amelius wrote:
           | Perhaps GPT could run on top of Stable-diffusion, generating
           | output in the form of written text (glyphs).
        
           | danuker wrote:
           | Indeed, I think this would be a cost-effective way to go
           | forward.
        
         | Tenoke wrote:
         | For what is worth, people were trying the same thing with GANs
         | (I also played with doing it with stylegan a bit) but the
         | results weren't as good.
         | 
         | The amazing thing is that the current diffusion models are so
         | good that the spectograms are actually reasonable enough
         | despite the small room for error.
        
         | munificent wrote:
         | As someone who loves making music and loves listening to music
         | made by other humans with intention, it just makes me sad.
         | 
         | Sure, AI can do lots of things well. But would you rather live
         | in a world where humans get to do things they love (and are
         | able to afford a comfortable life while doing so) or a world
         | where machines do the things humans love and humans are
         | relegated to the remaining tasks that machines happened to be
         | poorly suited for?
        
           | bheadmaster wrote:
           | As someone who loves making music and loves listening to
           | music (regardless of its origins, in my case), it doesn't
           | make me that sad. Sure, at first, I had an uncomfortable
           | feeling that AI could make this sacred magic thing that only
           | I and other fellow humans know how to do... But then I
           | realized same thing is happening with visual art, so I
           | applied the same counterarguments that've been cooking in my
           | head.
           | 
           | I think that kind of attitude is defeatist - it's implying
           | that humans will be stopped from making music if AI learns
           | how to do it too. I don't think that will happen. Humans will
           | continue making music, as they always have. When Kraftwerk
           | started using computers to make music back in the 70s, people
           | were also scared of what that will do to musicians. To be
           | fair, live music _has_ died out a bit (in a sense that there
           | aren 't that many god-on-earth-level rockstars), but it's
           | still out there, people are performing, and others who want
           | to listen can go and listen.
           | 
           | Maybe consumers will start consuming more and more AI music,
           | instead of human music [0], but the worst thing that can
           | happen is that music will no longer be a profitable activity.
           | But then again, today's music industry already has some
           | elements of the automation - washed-out rhythms, sexual
           | thematics over and over again, re-hashing same old songs in
           | different packages... So nothing's gonna change in the grand
           | scheme of things.
           | 
           | [0] https://www.youtube.com/watch?v=S1jWdeRKvvk
        
           | bawolff wrote:
           | I'd rather live in the world where humans do things that are
           | actually unique and interesting, and aren't essentially being
           | artificially propped up by limiting competition.
           | 
           | I don't see this as a threat to human ingenuity in the
           | slightest.
        
           | int_19h wrote:
           | I would rather live in a world where humans get to do things
           | they love _because they can_ (and not because they have to
           | earn their bread), and machines get to do basically
           | everything that _needs_ to be done but no human is willing to
           | do it.
           | 
           | Advancing AI capabilities in no way detracts from this. You
           | talk about humans being "relegated to the remaining tasks" -
           | but that's a consequence of our socioeconomic system, not of
           | our technology.
        
         | hackernewds wrote:
         | You already hear a ton of them. Lofi music on these massively
         | popular channels are basically auto-generated "music" + auto
         | generated artwork.
        
           | cjtrowbridge wrote:
           | Do you have any sources for more information about this?
        
       | farmin wrote:
       | That church bell one is amazing. Very creative transition.
        
       | flaviuspopan wrote:
       | I'm floored, the typing to jazz demo is WILD! Please keep pushing
       | this space, you've got something real special here.
        
       | bulbosaur123 wrote:
       | Anyone interested in joining an unofficial Riffusion Discord,
       | let's organize here: https://discord.gg/DevkvXMJaa
       | 
       | Would be nice to have a channel where people can share Riffs they
       | come up with.
        
       | ElijahLynn wrote:
       | I was confused because I must not have read good that the working
       | webapp is at https://www.riffusion.com/. Go to
       | https://www.riffusion.com/ and press the play button to see it in
       | action!
        
       | rmetzler wrote:
       | "Jamaican rap" - usually the genre (e.g. Sean Paul) is called
       | Dancehall.
        
       | woeirua wrote:
       | Very cool, but the music still has a very "rough", almost
       | abrasive tinge to it. My hunch, is that it has to do with the
       | phase estimates being off.
       | 
       | Who's going to be first to take this approach and use it to
       | generate human speech instead?
        
       | ubj wrote:
       | This happened earlier than I expected, and using a much different
       | technique than I expected.
       | 
       | Bracing myself for when major record labels enter the copyright
       | brawl that diffusion technology is sparking.
        
       | adzm wrote:
       | Really fascinating. I'd be interested to know more about how it
       | was trained, with what data exactly.
        
       | rslice wrote:
       | deleted
        
         | dangond wrote:
         | I think you'll find plenty of people who find that DAWs and
         | music theory help them better find self-expression and
         | celebrate life through their music. Any tool or framework that
         | opens up new modes of achieving that self-expression should be
         | celebrated, not shunned because it isn't as "pure" as more time
         | and labor intensive methods. Would you rather someone be forced
         | to dedicate a significant amount of time to studying music and
         | art creation just to be able to find that self-expression?
        
           | [deleted]
        
       | phneutral26 wrote:
       | Right now it still seems to lack the horsepower for this many
       | users. Hope it gets in a better state soon, but I am bookmarking
       | this right now!
        
       | bluebit wrote:
       | And we broke it.
        
       | logn wrote:
       | Congratulations this is an amazing application of technology and
       | truly innovative. This could be leveraged by a wide range of
       | applications that I hope you'll capitalize on.
        
       | bogwog wrote:
       | Damn this is insane. I wonder what other things can be encoded as
       | images and generated with SD?
        
       | Pepe1vo wrote:
       | I find it really cool that the "uncanny valley" that's audible on
       | nearly every sample is exactly as I would imagine that the visual
       | artifacts would sound that crop up in most generated art. Not
       | really surprising I guess, but still cool that there's such a
       | direct correlation between completely different mediums!
        
       | isoprophlex wrote:
       | I wonder how they got their train data..! The spectrogram trick
       | is genius, but not much useful without high quality, diverse data
       | to train on
        
       | m3kw9 wrote:
       | They've got a looooong way to go man
        
         | ihatepython wrote:
         | I agree but it's better than listening to Ed Sheeran
         | 
         | Edit: To be honest, I find something like 'Band In A Box' to be
         | more impressive and actually useful, I don't understand how I
         | would ever use this or listen to this. To me, it's further
         | proof that Stable Diffusion really just doesn't work that well
        
       | GaggiX wrote:
       | You can train/finetuned a Stable Diffusion model on an arbitrary
       | aspect ratio/resolution and then the model starts creating
       | coherent images, would be cool to try finetuning/training this
       | model on entire songs by extending the time dimension (also the
       | attention layer at the usual 64x64 resolution should be removed
       | or it would eat too much memory)
        
       | londons_explore wrote:
       | I propose that while you are GPU limited, you make these changes:
       | 
       | * Don't do the alpha fade to the next prompt - just jump straight
       | to alpha=1.0.
       | 
       | * Pause the playback if the server hasn't responded in time,
       | rather than looping.
        
       | birdyrooster wrote:
       | I know it sounds like I am going to be sarcastic, but I mean all
       | of this in earnest and with good intention. Everything this
       | generates is somehow worse than the thing it generated before it.
       | Like the uncanny valley of audio had never been traversed in such
       | high fidelity. Great work!
        
       | fernandohur wrote:
       | I found this awesome podcast that goes into several AI & music
       | related topics
       | https://open.spotify.com/show/2wwpj4AacVoL4hmxdsNLIo?si=IAaJ...
       | 
       | They even talk specifically about about applying stable diffusion
       | and spectrograms.
        
       | stevehiehn wrote:
       | Really great! I've been using diffusion as well to create sample
       | libraries. My angle is to train models strictly on chord
       | progression annotated data as opposed to the human descriptions
       | so they can be integrated into a DAW plugin. Check it out:
       | https://signalsandsorcery.org/
        
       | 2devnull wrote:
       | Was just watching an interview of Billy Corgan (smashing
       | pumpkins) on Rick Beato's YouTube[1] last night where billy was
       | lamenting the inevitable future where the "psychopaths" in the
       | music biz will use ai and auto tune to churn out three chord non-
       | music mumble rap for the youth of tomorrow, or something to that
       | effect. It was funny because it's the sad truth. It's already
       | here but new tech will allow them to cut costs even more, and
       | increase their margins. No need for musicians. Really cool on one
       | hand, in the same way fentanyl is cool -- or the cotton gin, but
       | a bit depressing on the other, if you care about musicians. I and
       | a few others will always pay to go the symphony, so good players
       | will find a way get paid, but this is what kids will listen to,
       | because of the profit margin alone.
       | 
       | [1] https://m.youtube.com/watch?v=nAfkxHcqWKI
        
         | visarga wrote:
         | > new tech will allow them to cut costs even more, and increase
         | their margins
         | 
         | How, when everyone and their dog can generate such music? It's
         | gonna be like stock photography in the age of SD.
        
       | gardenhedge wrote:
       | impressive. and this is a hobby project.. amazing
        
       | aquanext wrote:
       | Someone please train it on John Coltrane.
        
       | MagicMoonlight wrote:
       | Plug this into a video game and you could have GTA 6 where the
       | NPCs have full dialogue with the appearance of sentience,
       | concerts where imaginary bands play their imaginary catalogue
       | live to you and all kinds of other dynamically generated content.
        
       | wwarner wrote:
       | BOOM! Yes!
        
       | lftl wrote:
       | I just wanted to say you guys did an amazing job packaging this
       | up. I managed to get a local instance up and running against my
       | local GPU in less than 10 minutes.
        
       | Raed667 wrote:
       | Seems to be victim of its own success:
       | 
       | - No new result available, looping previous clip
       | 
       | - Uh oh! Servers are behind, scaling up
       | 
       | I hope Vercel people can give you some free credits to scale it
       | up.
        
       | [deleted]
        
       ___________________________________________________________________
       (page generated 2022-12-15 23:00 UTC)