[HN Gopher] Expressive text-to-image generation with rich text
       ___________________________________________________________________
        
       Expressive text-to-image generation with rich text
        
       Author : plurby
       Score  : 39 points
       Date   : 2023-10-04 19:21 UTC (3 hours ago)
        
 (HTM) web link (rich-text-to-image.github.io)
 (TXT) w3m dump (rich-text-to-image.github.io)
        
       | pugworthy wrote:
       | I would love to experiment with the idea of font interpretation.
       | People can and do anthropomorphize fonts, but then they have
       | names with meanings which might or might not be useful.
       | 
       | For example, I'm wondering if a prompt written in Comic Sans
       | should be turned into a comic-style illustration, or does it come
       | out as a simplistic and childish drawing? Is a gothic font meant
       | to imply a style of architecture, old Germanic peoples, or goth
       | music and style?
        
       | gorenb wrote:
       | my god, i think midjourney and dalle should do this now
        
       | 90-00-09 wrote:
       | I like this idea. It could be handy to be able to focus on
       | individual descriptions in complex prompts. Is this then mostly a
       | "UI" feature that is being translated to a traditional prompt?
       | 
       | (As a side note: using decorative typefaces was an unconvincing
       | example.)
        
       | minimaxir wrote:
       | A relatively functionally similar approach is prompt term
       | weighting with libraries such as compel:
       | https://github.com/damian0815/compel
       | 
       | Prompt weighting alone can fix undesired aspects of an output,
       | especially with SDXL and its dual text encoders.
        
       | Der_Einzige wrote:
       | I LOVE this.
       | 
       | All of the techniques that they are showing have already existed
       | for awhile in places like Automatic1111/ComfyUI or its extensions
       | (i.e. regional prompting, attention weights). Having it connect
       | so seamlessly with rich text is awesome and is a cool UI trick
       | that might make normies notice it.
       | 
       | Also, related, but NLP is extremely undertooled on the prompt
       | engineering side. Most of the techniques here would work just
       | fine on any LLM. If you don't believe me, read this:
       | https://gist.github.com/Hellisotherpeople/45c619ee22aac6865c...
        
       | LASR wrote:
       | How well does this work with LLMs? Anyone tried this? I am
       | curious about the references and footnotes approach the most.
        
       | simbolit wrote:
       | I looked at this, and thought about it, and then I waited for an
       | hour, and now I looked at it again, and I can't help but think
       | this is useless.
       | 
       | We can already weigh parts of prompts, we can already specify
       | colors or styles for parts of the images. And even if we could
       | not, none of this needs rich text.
       | 
       | In the beginning I even think their comparisons are dishonest.
       | They compare "plaintext" prompts with "rich text" prompts, but
       | the rich text prompts contain more information. What? Like,
       | seriously, who is surprised the following two prompts give
       | different images?
       | 
       | (1) "A girl with long hair sitting in a cafe, by a table with
       | coffee1 on it, best quality, ultra detailed, dynamic pose."
       | 
       | (2) "A girl with long [Richtext:orange] hair sitting in a cafe,
       | by a table with coffee on it, best quality, ultra detailed,
       | dynamic pose. [Footnote:The ceramic coffee cup with intricate
       | design, a dance of earthy browns and delicate gold accents. The
       | dark, velvety latte is in it.]"
       | 
       | the worst part is "Font style indicates the styles of local
       | regions". In the comparison with other methods section they
       | actually have to specify in parentheses what each font means
       | style-wise, because nobody knows and (let's be frank) nobody
       | wants to learn.
       | 
       | So why not just use these plaintext parentheses in the prompt?
       | 
       | I really stopped myself from immediately posting my (rather
       | negative) opinion, but after over an hour, it hasn't changed. As
       | far as i can see, this isn't useful, rich text prompts are a
       | gimmick.
        
         | aenvoker wrote:
         | The rich text presentation is merely cute. But, the underlying
         | feature is very nice. Being able to focus details on a specific
         | aspect of an image without worrying about it leaking into other
         | aspects would be greatly appreciated.
         | 
         | How about a plain-text interface like this?
         | 
         | > A girl with [long hair](orange) sitting in a cafe, by a table
         | with [coffee](^1) on it, best quality, ultra detailed, dynamic
         | pose. [^1](Ceramic coffee cup with intricate design, a dance of
         | earthy browns and delicate gold accents. The dark, velvety
         | latte is in it.)
        
           | phil-martin wrote:
           | It feels like that is where the real value is. Imagine
           | describing all the assets of a game, story, or something
           | larger than just a single image as mainly "what"
           | descriptions, referring to broad styles of things. And then a
           | second body of text detailing those styles in detail.
           | 
           | It could be a text description of a fighter or noble wearing
           | coats or armour. And then substitute in different style
           | description of coats and armour depending on the family,
           | class, race or other attributes suitable for the world you're
           | trying to generate.
        
         | EL_Loco wrote:
         | I had the same thought. The gothic church one, for example. Why
         | wouldn't I just write "A pink gothic church in the sunset"
         | instead of writing "A gothic church" and then having to do the
         | extra steps to turn the word "church" into pink? Of course, I'm
         | very ignorant of the uses of such tech, so there's probably
         | some usefulness in this.
        
           | 90-00-09 wrote:
           | The value I see is in constructing more complex prompts.
           | Agree with your example but could see myself using this
           | feature for prompts with multiple objects/aspects that
           | require specific details. Probably not much different from
           | inlining all details, just a nice separation of concerns: you
           | can describe the high level requirement first, and then add
           | and tweak individual details.
        
       ___________________________________________________________________
       (page generated 2023-10-04 23:00 UTC)