[HN Gopher] Expressive text-to-image generation with rich text ___________________________________________________________________ Expressive text-to-image generation with rich text Author : plurby Score : 39 points Date : 2023-10-04 19:21 UTC (3 hours ago) (HTM) web link (rich-text-to-image.github.io) (TXT) w3m dump (rich-text-to-image.github.io) | pugworthy wrote: | I would love to experiment with the idea of font interpretation. | People can and do anthropomorphize fonts, but then they have | names with meanings which might or might not be useful. | | For example, I'm wondering if a prompt written in Comic Sans | should be turned into a comic-style illustration, or does it come | out as a simplistic and childish drawing? Is a gothic font meant | to imply a style of architecture, old Germanic peoples, or goth | music and style? | gorenb wrote: | my god, i think midjourney and dalle should do this now | 90-00-09 wrote: | I like this idea. It could be handy to be able to focus on | individual descriptions in complex prompts. Is this then mostly a | "UI" feature that is being translated to a traditional prompt? | | (As a side note: using decorative typefaces was an unconvincing | example.) | minimaxir wrote: | A relatively functionally similar approach is prompt term | weighting with libraries such as compel: | https://github.com/damian0815/compel | | Prompt weighting alone can fix undesired aspects of an output, | especially with SDXL and its dual text encoders. | Der_Einzige wrote: | I LOVE this. | | All of the techniques that they are showing have already existed | for awhile in places like Automatic1111/ComfyUI or its extensions | (i.e. regional prompting, attention weights). Having it connect | so seamlessly with rich text is awesome and is a cool UI trick | that might make normies notice it. | | Also, related, but NLP is extremely undertooled on the prompt | engineering side. Most of the techniques here would work just | fine on any LLM. If you don't believe me, read this: | https://gist.github.com/Hellisotherpeople/45c619ee22aac6865c... | LASR wrote: | How well does this work with LLMs? Anyone tried this? I am | curious about the references and footnotes approach the most. | simbolit wrote: | I looked at this, and thought about it, and then I waited for an | hour, and now I looked at it again, and I can't help but think | this is useless. | | We can already weigh parts of prompts, we can already specify | colors or styles for parts of the images. And even if we could | not, none of this needs rich text. | | In the beginning I even think their comparisons are dishonest. | They compare "plaintext" prompts with "rich text" prompts, but | the rich text prompts contain more information. What? Like, | seriously, who is surprised the following two prompts give | different images? | | (1) "A girl with long hair sitting in a cafe, by a table with | coffee1 on it, best quality, ultra detailed, dynamic pose." | | (2) "A girl with long [Richtext:orange] hair sitting in a cafe, | by a table with coffee on it, best quality, ultra detailed, | dynamic pose. [Footnote:The ceramic coffee cup with intricate | design, a dance of earthy browns and delicate gold accents. The | dark, velvety latte is in it.]" | | the worst part is "Font style indicates the styles of local | regions". In the comparison with other methods section they | actually have to specify in parentheses what each font means | style-wise, because nobody knows and (let's be frank) nobody | wants to learn. | | So why not just use these plaintext parentheses in the prompt? | | I really stopped myself from immediately posting my (rather | negative) opinion, but after over an hour, it hasn't changed. As | far as i can see, this isn't useful, rich text prompts are a | gimmick. | aenvoker wrote: | The rich text presentation is merely cute. But, the underlying | feature is very nice. Being able to focus details on a specific | aspect of an image without worrying about it leaking into other | aspects would be greatly appreciated. | | How about a plain-text interface like this? | | > A girl with [long hair](orange) sitting in a cafe, by a table | with [coffee](^1) on it, best quality, ultra detailed, dynamic | pose. [^1](Ceramic coffee cup with intricate design, a dance of | earthy browns and delicate gold accents. The dark, velvety | latte is in it.) | phil-martin wrote: | It feels like that is where the real value is. Imagine | describing all the assets of a game, story, or something | larger than just a single image as mainly "what" | descriptions, referring to broad styles of things. And then a | second body of text detailing those styles in detail. | | It could be a text description of a fighter or noble wearing | coats or armour. And then substitute in different style | description of coats and armour depending on the family, | class, race or other attributes suitable for the world you're | trying to generate. | EL_Loco wrote: | I had the same thought. The gothic church one, for example. Why | wouldn't I just write "A pink gothic church in the sunset" | instead of writing "A gothic church" and then having to do the | extra steps to turn the word "church" into pink? Of course, I'm | very ignorant of the uses of such tech, so there's probably | some usefulness in this. | 90-00-09 wrote: | The value I see is in constructing more complex prompts. | Agree with your example but could see myself using this | feature for prompts with multiple objects/aspects that | require specific details. Probably not much different from | inlining all details, just a nice separation of concerns: you | can describe the high level requirement first, and then add | and tweak individual details. ___________________________________________________________________ (page generated 2023-10-04 23:00 UTC)