Building The World's Best Image Diffusion Model

Transcript

Speaker 0:

I think we thought the product was gonna be one way, and then we literally ripped it all up in a month and a half or so before release. We were sort of, like, lost in the jungle for a moment,.

Speaker 1:

like a bit of a panic. There's a lot of unsolved problems, basically. I mean, the you know, even this version of it, you know, people are gonna try it, and then.

Speaker 0:

they might be blown away by it, but, like, the next one's gonna be even crazier. To get to Soda, you basically have to be maniacal about, like, every detail. There There could be some people that train their models and they get cool text generation, but the kerning is off. Are you the kind of person that will care about the kerning being off? Or are you the kind of person that is okay with it?

Like, or you don't even notice.

Speaker 1:

Welcome back to another episode of the Light Cone. I'm Gary. This is Jared, Harge and Diana. And collectively, we have funded companies worth hundreds of billions of dollars, usually just with one or two people just starting out. And we're in the middle of this crazy AI revolution.

And so we thought we would invite our friend Suhail Doshi, founder and CEO of Playground, which is the state of the art image generation model with also a state of the art user experience, and it just launched. So how you feeling, Suhail?

Speaker 0:

Very under pressure right now.

Speaker 1:

Good to start like a startup founder. Yes. Which is normal. Yeah. Maybe the best way to start off is to look at some examples of the images that you were able to generate. And this is stuff sort of right off the presses. Mhmm. So at Y Combinator, I also am one of the group partners.

So I fund a number of companies every batch. I funded about 15 for the summer batch. And so what we're looking at here is one of the t shirt designs I made. As you can see, there's a GPU, and it was based on one of the core templates in your library. I like metal, so this very much spoke to me.

This one was off of a sticker design, and I guess I just really liked that sword, and what I was able to do is add GPU fans. Love it. I love it. And so that's one of the noteworthy things about Playground. You can upload an image, it'll sort of extract the essence of, like, sort of the aesthetic and some of the features of it. This one feels like a feels.

Speaker 2:

like a tattoo. Yeah. Exactly.

Speaker 3:

Do you remember what you prompted it with to get those? Oh, yeah. I I basically.

Speaker 1:

so the cool thing about Playground to create this was I I picked a a default template that I liked, and I think it only had the sword and sort of this ribbon. And I said, make it say house tan on the ribbon, and add a GPU with two fans. I was very specific. I wanted a two fan GPU. And that's one of the things that you'll see in all these designs.

This is actually the t shirt that house tan itself actually chose.

Speaker 2:

So,.

Speaker 1:

you know, it's a very summery vibe. I think this was based on something around summer and surfing, and we replaced the surfboard with the GPU. I feel like you used a preset that we had. I did. Yeah. All of these are some presets. They're pretty good.

I think the noteworthy thing that I was able to do is I didn't have to, like, prompt and re prompt and re prompt and sort of keep trying to refine the same text prompt. Like, I actually could just talk to a designer and it would just give me what I wanted. Going from left to right, for instance, by default, I think the template had this yellowish background, and I said, make it on white.

And so that that was like a very unusual interaction that I, you know, I'm not used to. Like, usually, you're used to either Discord with Midjourney or you're sort of used to a chat interface or like prompt and then twiddle things and re prompt and re prompt and re prompt. Whereas this felt much more natural language.

I could just talk to, you know, a machine designer that would take my feedback into account.

Speaker 0:

Yeah. Normally when you make these kinds of images, you have to describe all of it. Right? You'd have to say, I want it on this beige background, and I want this orange sunset. And then you'd have to even describe the lines of the sun. You know, or or you don't describe very much. And then every time you try, it's like totally different from the other thing.

So you usually, you know, you either have to learn like a magical incantation of words or versus being able to like pick something that you start from. And then also with these images, Gary, did you add this text in post processing? Or is the model actually like.

Speaker 1:

incorporating the text organically? Oh, the the model will both take your direction on what should be there, what its size is. It can you can actually specify where in the design. You could say, want it in the middle. I want it at the top. Could we use a font that's bigger or smaller, you know, better letting? Could you curn it a little bit?

Like, you could just speak to it in plain English, and I'd never seen that in any image battle to That's crazy because the text is flawless. Yep. And anyone who's used DALL E knows that if you try to get it to write text, the text comes out like garbled and zombie like. Yeah. It's pretty incredible having just accurate text and then being able to position the text exactly where you want.

That is very cool. It is really soda in terms of text adherence.

Speaker 0:

and coherence in terms of following prompts, which is really cool. One thing that we think is really cool is it's inventing fonts. Nice. Like, I don't know what font that is. Yeah. It might be a real font, but I think that there are all these circumstances where it's actually just it's like extrapolating from many different kinds of fonts and actually inventing new things,.

Speaker 1:

which is really, really cool. Okay. And these are just a couple other versions. You know, I saw some old timey thing and I was like, okay, could you do a vector y version of a GPU on the left? And then on the right, you know, there was a very sort of Japanese art house aesthetic. These are great.

And then this one, if left to my own devices, I was gonna print this one because I I really like The right side one? Yeah. And and what I could do is actually tell it, like, make it even even more sort of prototypically like Japanese art. You know? Like, I want more waves. I'm like, more sun. And, you know, it it basically kept doing it. Think I know this preset.

I remember making this preset, like, like a month and a half ago. Think it's called like Mythic Ink or something like that. That's how the app works. You know, you open the app, you select a preset, or you can upload your own design that you're like really into, and then it will seemingly extract the vibe of that particular thing. Like, you know, it won't it's not gonna be a copy.

It will be a remix.

Speaker 3:

Did you purposely design it to be so good with text? Or is that like an emergent property of just how you architected everything? We definitely focus on making text accuracy.

Speaker 0:

really good. I think it's been kind of our number one focus. And part of part of it is is text to us is so interrelated with actually the utility of graphics and design because a lot of things without text just mostly feel like art. But yeah, text was an extraordinary high priority and it was really hard actually. There was like a maybe a point there where like our text accuracy was 45%.

We were sort of, like, lost in the jungle for a moment.

Speaker 2:

Like, a bit of a panic, but we figured it out. I think one of the remarkable things on all these designs is that a lot of I was playing a lot with it as well. A lot of the outputs are very utilitarian and useful because I play with mid journey and all of those, and I think they're fun, but they're more like toys, more like art. Mhmm.

But it's really hard to work with it if you actually wanted to design logos, t shirts, font sizes. I could totally see this replacing Adobe Illustrator.

Speaker 0:

Right? Right. Yeah. Yeah. I think that, you know, part of it's kinda funny. It's like the reason why we're I'm partly so excited about graphic design is because actually when I was younger, when I was in high school, I used to do logo contests and I would try to win them. I think there's this site called like sitepoint.

net or something and I just trying to make like a little bit of money before college, before going to college. And and so I did all these like logo designs and did all these tutorials trying to win them. And and so during the training of this model, I tested it for logos and I started to be like, wow, it's actually way better than anything I could have made.

And then I've also made like my own company logos typically, which are also very bad. And so it just feels to me like if you can get text and you can get these other kinds of use cases, you're probably going to be able to beat the, like, mid at least the midpoint designer, graphic designer that's in Illustrator.

And then I think over time, we should be able to get to, the ninetieth percentile designer graphic designer. So this is actually a really different.

Speaker 1:

use case that really hasn't been addressed. You know, I haven't seen image models try to design graphics or illustrations. It's less, you know, generating really cool images that would replace stock art or something like that. It's more literally allowing you to create Canva type things Right. Whenever you want. And you don't have to mess around with it. It's plain English.

Just talk to the model or the model's gonna create what you want. I've never seen anything like that.

Speaker 0:

Yeah. I think we we're just sort of like looking at what are the use cases for graphic design. And it's, you know, when it actually interestingly has a lot of real world like physical impact, physical world impact because they're like bumper stickers and then t shirt.

I think it was at Outside Lands the other weekend and I was just looking at everyone's t shirts, looking at what they they have on them and then I I saw a bunch of women at Outside Lands had this thing, this t shirt that said, I feel like 02/2007 Britney. I just thought that was such a cool shirt and so we made the template for it and put it in the product.

There's just so much cool real world impact and I think that the world I often sometimes think that, I'm almost a little disappointed that MySpace doesn't exist for those that were on MySpace because it was such an expressive social network and I feel like humans really deeply care about that form of expression.

Speaker 2:

And so it's really cool to be able to make a model that's really focused on all those kinds of things. But you're actually building a product. It's not just research because when Yeah. With all these designs in Playground, you can actually go and purchase them, like the stickers, the t shirts, right? Can you tell us about this marketplace that you're building? Yeah.

Speaker 0:

So I think that one thing that we learned was that it's kind of hard for people to prompt. And because it's hard to prompt, it's hard we found also it's hard to teach people how to prompt. And the truth is is that when you make these models, it's not like we even know how it works. We are also discovering with the community how the model kind of works.

And so one of the things that we decided to do was me and our designer, we decided that one core belief was that the product should be visual first, not text first, which is a huge departure from language models and ChatGPT. Because our product is so visual, why should it not be?

And so in order to make it visual first and to make it so that you don't have to learn how to prompt, we decided that we would start from something like a template, which is something people already understand in a tool like Canva. Right? It's not something that we necessarily invented. Like there's templates everywhere.

But I think that if you could start from a template and then we could make it really easy to modify that template, then it feels like we've already taken you like 80% of the journey. If it was like I feel like 02/2007 Britney, but then you wanted to change the celebrity in the year to a different person, then you totally could. You wanted to make that very easy.

But it also required a lot of integration with research because how do you make these changes? How do you make them coherent? How do keep things, like, similar? It's not as simple as, you know, just 75, 70 seven tokens that you put into stable diffusion. The existing open source models aren't really capable of that. So it required kind of,.

Speaker 2:

yeah, like the marrying of, like, what a good product should feel like and what and and then, you know, enabling that with research, which is not always possible. I think that's what Gary was getting at with you building the state of the art UX, the UI for all these models. Because up to this point, people just get raw access.

It feels like kinda back in the days, you would just SSH to a computer Yeah. And kinda work with it. That's how people interact with these models. But you kinda basically build a whole new browser into it. Nobody has done it, you've done it really well. Can you talk about this idea of departing.

Speaker 0:

from raw model access? Yeah. I think I think just we observed the users over eighteen months, like, failing. You know? So AI is a little bit weird right now because there's such a big novelty factor, I would say. And it's exciting because we're able to do things we've never been able to do before. And so as a result, can easily get millions of users using your product.

And that's totally what happened to us. And so it feels almost like, oh, maybe I've got the product. But then when you actually go look at the data and how people are using it, there's just this constant failure of people using the product. And so Yeah. You're talking about the.

Speaker 1:

prior version of Playground. The prior version of Playground, yeah. So it didn't have this type of model. It didn't.

Speaker 0:

was really quite a setting. Used open source models. And then we started training some of our own that are very similar to stable diffusion as a way to ramp up to where we are now. When we watched users prompt this model, obviously the two pieces of feedback were this is fun, it's cool, I can get like a cat drinking beer. And then you post it to Twitter and it's exciting.

But then, but why would people come back is one big question. And then the second part is that people are using our service a lot, but they're not always using our service a lot because it's a useful thing. It's because they're not getting what they want, so they have to keep retrying. Yeah. Know where like Google's trying to get you off the website, you know, that sort of feeling?

Like it's almost bad that people are using it too much in some in some sense. And, you know, they just keep re we call it re rolling. Right? It just keeps re rolling to get a different image or slightly better image or fix like a paw or a tail that's off. You know? And then the other thing that happened was that our model can take an extremely long prompt.

Like most of these models, you can only write 75 tokens. But with our model, it's like 8,000. And most people, you're never really gonna go over a thousand right now. I say that now, but we'll see. Thousand tokens is a lot. And and our model lets you be extremely descriptive. And so you can you can really describe the the texture of the table, skin texture.

We have all those, like, puzzle prompts where it's like green triangle next to an orange cube, you know, And it works. Like spatial reasoning is all there actually,.

Speaker 1:

including text That's totally novel and really I'd never seen that before. Yeah. You know, the first generation of these models, almost immediately what you do is say like generate me a green sphere on top of a blue, you know, triangle. Right. Yeah. And it just wouldn't do it. It's like, you know, there'd be those elements, but it would just be all jumbled up because Right. It was using CLIP.

It did not have contextual reasoning or understanding.

Speaker 0:

Yeah. And Eclip Eclip was trained with a lot of error actually because it's just using kind of the alt tags of the images that are scraped on the internet, which could be like anything. We sort of decided that what we were gonna spend our time on was prompt understanding and text generation accuracy because we also felt like aesthetics were kind of saturating.

Like they're getting better, but they're also just kinda like not getting better at a fast enough rate. And users even vote and say, even in the mid journey Discord, you know, they'll poll their users and they'll say, do you want to be better? And aesthetics is going lower and lower on the rank of things that people care about.

So we wanted to try to leap on something that really mattered to users, which was prompt understanding and text generation accuracy for those kinds of use cases. But when you have a very long prompt, it's not really feasible to ask anybody like, are you gonna write like an essay?

And so we started to realize that actually the prompt is it's almost like a it's kinda like HTML for graphics, which I think is so cool. I think you've done a lot here because you completely have a novel architecture.

Speaker 2:

that really gets to magical prompting because experience of using Playground is feel as if you're talking to a designer. It it it has that coherence. It listens to you because with with other I don't know, with mid journey, if wanna move the text or or that, it doesn't. Mhmm. The positional awareness is not there.

I guess one of the insights you had when we chatted a bit earlier, one of the problems you learned, to create good designs, you have to have a lot of description Mhmm. For the images. Yeah. And users are basically lazy. Right? Right. They might just tell you, I want a nature scene. And if you input this into mid journey, what would it give you?

It's like Yeah. It'll give you like this very beautiful, very rich, high contrast nature scene. But you've done something very interesting. We wanna talk a bit about how you've done kinda aiding users.

Speaker 0:

and expanding upon the prompt to actually build something much better. The first thing to kind of, like, improve our prompt understanding was just, like, making your data better. Pretty much, it's it's actually just that simple. And so one of the first things we wanted to do was we wanted to have extremely detailed prompts. So when we train the model, we we train on very, very detailed prompts.

But we also want users to feel like they could just say nature scene. And so sort of what you see here is just how detailed we can get. And actually, we're actually even more detailed than this these days. When we train the next model, it'll be even more than this. But once you get to this level of detail, I mean we're just teaching the model to represent all of these concepts correctly.

Whether something is in the center or whether there's a background blur. One thing that we want to get better at, and I think we're actually already pretty good at this, but is emotional expression. Is he, another thing? Like, we have this, like, image of Elon Musk, and he's, like, disgusted. He's anxious. He's happy. He's sad.

He's confident in, like, trying to see his expression in all these different ways. And so that's just one thing that we want to make sure is represented in these prompts. There's obviously a lot more, spatial location. And so by doing this, we can ensure that the model could be a good experience if you raw prompted it as a user, if you just said nothing.

And then most of the time users are not really writing more than maybe caption three here or something like that. I mean, even that's kind of a lot. That's a lot. I think when I was playing, was mostly on five and six. Yeah, yeah, exactly. When you're playing around, the norm norm the normies are kinda doing five and six. Yeah.

And then the, like, hardcore prompters are, like, copying each other's prompts, then they end up more like one. But they don't even look like one. And one is a very unnatural way of typing. You know? Like, nobody's writing these essays and paragraphs of text. It's too much work. And that was one thing that we didn't we knew we were gonna probably fail if we expected users to do this.

So this kind of led us to a more visual approach where you're like picking something you already like in the world that we understand how that's represented in our model. And and then we can, make those changes and edits and stuff like that. Is the benefit of, like, expanding the prompts this way that you're more likely to get what the user wants at the first go?

Or is it that it just makes it easier for them to iterate on it to get to what they want? I don't even know that we necessarily needed to do this. But I think the reason why we did it was because initially, we didn't know how good the model would be. And so we needed to serve users in the way that they already use the existing models. And so we didn't exactly know the breakthrough interface.

We hadn't gotten there yet. And so in order to make sure that we would work the way everyone is happy with, we wanted to do this segmented out it's almost like lossy prompting. And so that's why we do it. But I think, you know, it's not even that it's not as necessary.

But I think the and then the other reason to do it do it this way is once the prompts get extremely detailed, it's hard to have too much like variation between the images because you're locking in on your image. Yep. And so by having ambiguity in the prompt, you can get more variational abilities. So there's like, we call it image diversity.

So that way you say squash dish, but it's really different each time. Yep. I guess the cool thing about your product, you basically remove all of the prompt engineering with zero guess because you do it behind the scenes Yeah. With expanding and exploding into this multi caption level system. Mhmm. Right?

I guess what comes to mind is sort of back in the day, if you needed to navigate a website through the command terminal, maybe you curl and do get some post literally, like typing the commands until you had like a browser to actually have the right UI. Right? And what I was telling my team was I said, yeah, we should be doing the prompt engineering for users.

It should not be like the users are the prompt engineers and then they like or the prompt graphic designers, if you will, here. But like, it shouldn't be like the users have to go like, we can't write a what are we gonna do? Write a manual on how to do this? It's just too tricky. Like 1% of humanity will understand that manual and the rest will be like, don't know how to use this.

It's too difficult. So I think it's really valuable that I told my team I think it's very important we do all of that work. We should have an extremely strong sense of how the model works rather than putting that on the users where I think it's infeasible.

And then the other thing that we do is we now work with creators to help us construct these different templates and different prompts around these templates and stuff like that. And they might be the one percent of humanity that's willing to learn this on behalf of users. And this is totally normal. That's what y that's what YC does.

We, you know, we, like, build these great companies that, you know, ever like, billions of people in humanity use as a result of that. Yeah. I guess there's two things out of this that come out. One is you might be creating this whole new set of profession.

Speaker 2:

Mhmm. Sort of a back in the day with design, you have B hands where people hire designers. Right. Now people will,.

Speaker 0:

through Playground, hire like AI designers that are this Right. Top 1%. We're doing it, actually. So we are hiring them. You're hiring them? Yes, we're hiring them. We're going to launch a creator program soon, actually. And the goal is to bring on creators that have good tastes.

That still matters. There is this image of a squash dish, but it's not a very beautiful image. And there is I think taste is still real in the world and it's also in in design. You know, in LLMs you get to like measure how well you did on a biology test and that's like a pretty objective thing. But for design, it's constantly evolving.

Like design from ten years ago can look dated unless you're like Dieter Rams. But but I think, you know, more fundamentally we wanna bring on creators that are gonna help create graphics that other people can then use,.

Speaker 2:

and we're actually paying them. I guess one thing that's cool, the second thing, because of this, you actually are state of the art on many aspects for this model. So much of it was driven by a product because now in order to get the good captioning, you probably are beating GPT four o. Right? In terms of image We are beating yeah. We now have a new, like, soda captioner. Yeah. To generate these.

And that was not just to be, like, a benchmark, but actually a very practical purpose to build the model. Yeah. Can you maybe tell us a bit of what's underneath? Because PG v three, Playground v three, right, is all all in house and state of the art in many aspects. Yeah. So the whole architecture of the model, we we had to rip everything out.

Speaker 0:

So the normal stable diffusion architecture that people know about is there's a variation auto coder, a VAE, and then there's CLIP, and then there's this U Net architecture architecture for people that are in the know. And then since then, it's kind of evolved to using more transformers.

There's this great paper by, I think it was William Peebles, that did DIT, which I think is like what people believe Sora is based on as well. And then so there's some new models that are using that. We actually don't use any of those new architectures either. We did something completely from scratch.

But one of the reasons why we had to kind of blow everything up was because you can't really get this kind of prompt understanding using CLIP because there's just so much error in CLIP. And it's also just bounded by just the architecture of that model. And then the second thing is we also needed the text accuracy to be really high.

So you can't just use the off the shelf VAE from stable diffusion because it cannot reconstruct small details. I don't know if you guys ever noticed Like the hands and the logos. Hands, zoomed out faces. Yeah, you need something that you also need a state of the art VAE or something like a VAE that's better than the existing one. The existing one's like four channel.

And so there's all these pieces, and they all interact. And they can all bound the overall performance of your model. And so we basically looked at every single piece. And then I think like four months ago, was a I think with the team, there was literally we were at the whiteboard with the research team.

And there was the non risky architecture, which was kind of more similar to some of the state of the art open source models that are out these days, like Flux and stuff. And then there was this other architecture that shall not be named. And we were like, well that's that's that's like the risky one where we don't even know if it'll work.

And if we try it for two or three months we'll like waste compute. And if it didn't, it might just like blow up and then we'll be behind. And we just like put everything in that basket. Nice. We decided that we had no choice. You know? It was like we were just going to fail if we didn't do it anyway. I think what's remarkable, you are order magnitude.

Speaker 2:

on text and in a lot of all these aspects. You're basically soda. I think that's really impressive. Can we maybe talk a bit about, as much as you can, how you beat the text encoder? I mean, you tease that out a bit. You basically don't use clip because a traditional stable diffusion just uses the last layer. Right?

But you guys have done something completely new where you allowed a basically almost infinite context window because mid journey is only 256. The prompt had adherence, like, can actually talk to it like a designer. So tell us what you can talk about that.

Speaker 0:

Yeah. As much as you can tell us about it. I think it's fair to ask the question. Share as much as you want. I think that to kind of get here, there's some obvious things that you would do. The most obvious thing that you would do, you know, is not use CLIP, but the second most obvious thing is kind of like using the tailwinds of what's already happening in language.

You know, like the language models already so deeply understand everything about text. And so there's some models that use this, you know, they use like T5XXL, which has this, it's like another embedding, but it's like a much more rich embedding of language understanding. I kind of feel like language is just the first it's like the first thing that happened.

And there's a whole bunch of AI companies that are going to come about, whether they train their models or not, that are just going to benefit from everything that's going on in language and in open source language.

And so I think our model is able to have such great prompt understanding in part because of the big boom in language and all of the stuff that you whether it's Google or Meta or what have you is doing. And so we're just we can be slightly behind in terms of language for our prompt understanding because the language stuff is already just so good.

And it will just continue to get better, and our models will also continue to get better. So that might be my one small hint.

Speaker 2:

Maybe the analogy playing with a lot of this and from chatting with you, the current state of the art stable diffusion models, their language understanding feels like in the NLP land, like Word2Vec. Right? Word2Vec was this paper that came out from Google in 02/2013, and it didn't really understand text per se. It was more of the latent space.

The famous example was that it could take the vector of king. Mhmm. And then you would subtract the vector of man and then add the vector of woman. Yeah. And the output would be the vector of queen. Right. Yeah. Which is like but very basic.

I mean, it's still very cool, which I think is what kind of stable deficient current current models before you are. But playing with your model, you basically the leap to an for the audience, the leap is that you basically got GPT level of understanding.

Speaker 0:

It's like sort of the word effect to GPT was, I don't know, like Yeah. Would say it's like ten years later. Yeah. I would say it's like GPT three level image model, like sort of prompt understanding now. Yeah. And I think there's there's much more leap. There's another leap to go. Many more, actually, I would say.

And that's impressive. It's safe to say that this is the worst the model will ever be. For sure. Yeah, for sure. I mean, there's small things that we already want to fix, like we wish that the model understood concepts like film grain. I mean, could still be better at spatial positioning. Even the model has issues with the idea of left and right. Like put the bear to the left.

What is left? Is it your left? Is it the bear's left? So there's still lots of interesting problems that I think are really fun to probably we're gonna have to figure out. But what we hear from users is that they feel a strong sense of control now. Like it has really good prompt adherence.

And actually there was this really funny thing when we, you know, I think like a week or two ago that we realized about the model, which is when we started to do evals for aesthetics. And the way we do this is we just show like two it's an AB test. We show users two images, one from maybe arrival of ours and then another image from our model.

And we're constantly doing evals and constantly asking our users what they think so that we can get better. And anyway, one of the things that we realized was that there's this new thing that I don't think has been talked about, but I apologize to the audience if this has been talked about.

But there's a problem with we have entanglement issues, which is that if the model adheres too well to the prompt, it can adjust, it can have an effect on aesthetics. So when we compare ourselves to say something like mid journey, which we've actually evaled it, has great aesthetics, best in the world at that, one of the problems is that we will get dinged because our model is adhering more.

So I'll give an example. We have an image and it's like an image of a woman and it's like kind of like a split plane, like she's on this side and on this side. So it's like a composite. And mid journey doesn't respect that. It just shows the woman one frame. The users will always pick that because it's more aesthetically pleasing compositionally versus this split pane thing.

But our model is adhering to that prompt. Right? And so the users ding us and then we get a lower aesthetic score. Do me. Because it's not listening. And so there's this entanglement problem. Like, what do you do? We had another image that was like hand painted palm trees or something.

And the users chose the other model because they were less hand painted looking. And the hand painted ones do look less aesthetic, but our model is adhering. So we have this entanglement problem and we don't know how to measure ourselves for aesthetics now. And there's no I don't I'm not aware of any if anyone has any literature, please send it to me.

But I'm not aware of any literature on this and so we don't know what to do. I think what it sounds to me is basically your model is too soda that the current evals don't work because it's actually following the rules. Yeah. We're trying to figure out what we we have to make a new a new eval, basically. You're too advanced. It's like broke the test. Yeah. You kinda broke the test.

And so now it's a little weird externally. We don't, you know, it's like obviously we want to portray to the world, hey, we have this great thing and okay, we lose here but not really. So I think we're gonna But it does what you want. Yeah. But it does what you want.

And so I think we're gonna try to we're we're gonna talk about this in more detail, this kind of entanglement problem, because it's actually like a very interesting, more fundamental insight. Yeah. It sounds like you're just building a completely different kind of company.

Like, the thread that comes up hearing everyone here is it's using Playground feels like you're talking to a graphic designer Mhmm. Which then in my head actually buckets you into just the companies in YC that are really taking off are the ones that are replacing some form of labor Mhmm.

Speaker 3:

Which is just different to how people talk about mid journey. Right? It's not just like a tool to play around with. This is actually just going to be the replacement for hiring a graphic design team potentially, which is way more commercial.

Speaker 0:

Right. Yeah. Yeah. I mean, we've been searching for where is the utility? How are people using things like Midjourney? And I think that for me it's actually it's even simpler. It's just that I think we're just enabling the person to have more control over the whole thing.

I always feel bothered when you're like, I produce music and so if I make a song I have to go to a designer and say can you make me album art? And then I only get like four variations of it and then I feel badly asking for a fifth if I don't like any of the four.

But the more you just like put in control the person that's actually making the thing, they'll always they'll be able to connect exactly the thing that they're looking for with, you know, the core product or.

Speaker 1:

song or whatever they're making. So at YC, we're always telling founders, hey, you should talk to your users more or, know, and what you did was you had so many users, you couldn't just talk to them. You needed to look at how were they actually using it. And at some point you realized, somewhat uncomfortably,.

Speaker 0:

that they were generating near porn. Near porn, yeah. We get a lot of near porn porn.

Speaker 1:

And then I think people, they're exploring a space, often run into that situation. Like what happens when the users that you're getting aren't the users you actually want? Yeah.

Speaker 0:

Me and my COO talked about this. We're like, if we listened to what the users wanted, we would have to build a porn company essentially, which is not something that I think my wife would be happy with or my mother. It was kind of this tricky thing where you're like, listen to your users, talk to your users. And look, I'm not saying everybody does that with image models. For sure they don't.

But a lot of them do. And so we had to kind of go ask ourselves, then what can you do with these things? And the answer was like not much else. Nothing big and commercial enough. We could make a cool website that people use. And the problem is all the image generator sites are plagued with this problem and we all know it. We all know. And they're huge safety problems.

And it, you know, it turned out to be just like a business we didn't like. And that's a hard like, that's like a hard thing after, you know, twelve, eighteen months of working on something. And you're just like, well, I don't really like this that much. And now what? And when we looked around for use cases, we're like, oh, all the use cases have text. All the big ones. Practically all of them.

Logos, posters, t shirts, bumper stickers, everything. Everything has text because text is also a way to communicate with humans. That's why I became number one, like the number one priority.

Speaker 1:

I mean, this isn't the first time that you've sort of confronted this issue before. You know, in your prior startup Mixpanel, which you built into a company that, you know, makes hundreds of millions of dollars a year, 1 of the leaders in analytics from a really young age. You know, I think you started it when you were 19, and I remember because I met you when you first started it.

That was another moment where here's this brand new technology, and they're sort of very commercial use cases that you could build a real business on. And then there were other use cases. In that case, I think it was sort of fly by night gaming operations that would come and sort of pop up on Facebook, steal a bunch of users, and then disappear.

And you had to make some choices about who you wanted your users to be. Like, do you want it to be people who can actually pay you money for a real product over the long haul? Or sort of, oh yeah, they're here and they're gone and we can make our graph go up. Like it's sort of a quandary that a lot of founders are facing. How did you approach that?

Speaker 0:

Yeah. I mean, that one's burnt into my memory actually. The simple story was just that we got all these gaming companies back in the gaming heyday of Zynga and Rock U and Slide and all this stuff. And we were making so much money off of them. But then they would die because they had bad retention or games just have a decay factor.

Speaker 1:

You could tell that they were going to die because the retention was Oh, knew. We.

Speaker 0:

saw it happening. It's like all real time data on it. And so one day I go to one of my mentors, Max Levchin, and I interned for him at his other company. And I was just like, hey, know, this thing is happening. And we have all these competitors that are building gaming analytics tools or products, and I don't really know how to compete.

It feels a little weird to just go after gaming when it's like this weird thing that's churning. And he just looked at me and he was just kinda like he's like, what do you think is the biggest market? And I was like, well, probably not gaming, probably like the rest of the Internet, you know? And and mobile was like just starting.

And we didn't really know the top free app in the app store was a mirror on mobile. So it was kind of like is mobile gonna maybe it'll be there next year, I hope. But anyway, the rest of the internet.

And he said, well, if you're, you know, there's the name of our competitor and he was just like, if your competitor gets bought for $100,000,000 tomorrow, you know, let's focus on gaming, don't cry about it then. Just go after the biggest market. And and that's what we we did that and then mobile went huge.

It went so big and it completely we got rid of all the gaming all the gaming stuff and that was like a 100% the right decision. So I think it's just, being kind of, like, ruthless almost about where the value is with your own users, with what you're doing. Like, all those things I think are, like is very, very important.

Speaker 1:

I mean, it sounds like you had to close a door, and then God came and opened a window for you.

Speaker 0:

Yeah. Yeah. I mean, I think we kind of have a similar problem where current user base that we have is not exactly an exciting thing we want to do as a team. And so then we're hunting for where the rest of the value is.

Speaker 1:

Yeah. That's a really important lesson. I I guess that's the super big lesson here is you can choose your users or your customers. Yep. You know, often your customers or users choose you. And if you don't want them, it is a choice that you can make. And sometimes it actually allows you to find the, you know, global maximum instead of just the local maximum.

Speaker 0:

Yeah. It's you know, we're kinda faced the same decision. It's like a real time decision almost. It's fun to talk about things when they work, when your decisions are right. So we'll see years later if this is right.

But I think it's tough because Midjourney is doing 203 hundred million dollars But the biggest market in graphic design is probably Canva doing $2,300,000,000 And so we're just kind of like, well, forget it. Let's go after the biggest, most valuable thing in the world. And not a lot of people know a lot about Canva, I find, in Silicon Valley.

Most people know about Figma, but Canva makes vastly much more money than Figma. So by enabling everyone, you say if you have this amazing, you know, AI graphics designer of sorts enabling like so much more of humanity, I find I mean, I think a lot of people believe this, but I I do I also believe this, which I think AI will certainly it feels like it's expanding the pie of all these markets.

Not like they're not the same size, I think, most of the time. Right? Like, you're enabling more people to write code that otherwise couldn't, you know, that kind of thing. I guess the interesting thing about.

Speaker 2:

Playground, it was also a previous more radical pivot you had because you had gone through IC twice. Yeah. So you went through with Mixpanel, which became this successful company making hundreds of millions of dollars. Then you went through it with Mighty. Mighty. Mhmm. Can you tell us about that second time going through YC? And then what was it?

And then you pivoted into Yeah. So we I did this comp browser.

Speaker 0:

company called Mighty where our goal was to try to, like, stream a browser. And the real goal was to try to make a new kind of computer. And we basically did it, but the problem was is that, you know, we hit this wall where it was like, we didn't I didn't believe that it was gonna be a new kind of computer anymore.

I just couldn't make it more than two times faster, and I just didn't feel like if I couldn't get like a 10 x or five x on this thing, like, and or at least see that it could get a 10 x, That it just it wasn't a company that I wanted to work on anymore. And it's I remember. I I had invested before I came back Yeah. Of course. To YC.

Speaker 1:

And one of the big ideas that really got me was that, actually, our MacBook Pros were really sucking at the time. Yeah. They were. Yeah. Was no m one at the time. Yeah. And we'd actually I don't think we even knew that Apple was going to We had no idea. Release silicon yet.

I mean, it's interesting. I think that in Silicon Valley, we maybe underestimate how valuable strategy actually is, mainly because strategy is so fun and so interesting. And the MBAs who come into our sector, like, immediately seize on that and just want you know? It's like, you need a strategy person as, like, the you know, as a cofounder. And it's like, no. No. No.

We don't actually need But that's not to say that strategy is not necessary. In this particular case, like, I think that we were trying to solve a real problem, which was our our browsers really sucked, and the cloud was getting very, very good. And then suddenly, you know, the maze changed when when Apple released Silicon. Well, clearly thought so too. So, you know,.

Speaker 0:

strategy was right in some sense. Like, the overarching problem of trying to make our computers faster, they were able to make a chip. Yeah. But but still, you know, even you know, even the even the face of the M1, we had kind of convinced ourselves like, well, it doesn't really matter. Like, the Mac only has like 8. 3% market share, desktop market share. The rest is Windows.

And I even met the prior CEO of Intel, Bob Swan, and talked to him about why is Intel behind here and all that. And I was trying to figure out why is AMD and Intel behind? Where is it going? Is anyone even gonna get close to the m one or not? And so I think one problem is that, like, wanting them to be behind is, like, non ideal for your company. Right? Like Don't bet against macro is the problem.

Yeah. You definitely don't want to bet against And then and then I think the second piece was I sat down with one of the engineers that works on V eight, the browser engine behind Chrome. And then I gave him every imaginable idea me or the team had on figuring out how to speed up the browser. And he had an answer to all of them.

Once I realized that the team is basically focused on 1% improvements and they had already tried everything, that was a very depressing moment. Like, was like out of ideas. You know, the people say when is it the right time to pivot or change or whatever? And I had just run out of ideas. But I I, you know, I really wanted to like stick with it, but I just couldn't figure out another way to get it.

We we went so far as sort of building a computer in the data center. And we had, like, figured out how to use, like, consumer CPUs in a data center legally with the right architecture. And, like, I think PG came over once, and there was just, the sprawl of all these components of when when we were building physical, we were kind of building hardware.

I learned major lessons at Mixpanel, the one major lesson that I learned at Mighty was that it's so valuable to have a tailwind for your company as opposed to a headwind. There were just so many obstacles in our way, you know, whether it was the m one or, you know, there's no real way to, like, change the fundamental architecture of the browser. You know?

Like, JavaScript is just innately run single threaded in a tab. We can't change that. With Playground, it feels like it's all tailwinds all the time. You know? We just, like, wait and things get better. Things get faster, cheaper, better, easier.

Speaker 2:

The thing that's remarkable is you had this really impressive career with building a standard SaaS business with Mixpanel. You give it a try with the browser GPU, and then you now kind of retool yourself and are built this SODA stable diffusion model. What was that journey like? How how do you retool yourself? That was, like, one of those things that is, like, so impressive.

I just started learning. I don't know. I took whatever AI courses were out there that I could take.

Speaker 0:

Unfortunately, the the Karpathy courses didn't exist, back then. But I think, you know, at first, was trying to actually build, like, a better AI address bar in the browser, which now exists. Google just released that, I think. And this was before GPT-four. I think we were talking to OpenAI. They were very helpful because think didn't have ChatGPT didn't exist yet.

And we were trying to figure out how to get that integrated in the address bar at low latency. And so I was learning AI, doing AI, learning AI, how to do AI research and train models before all of that happened. But I think something weird happened, was in doing that, in getting kind of connected with the folks at OpenAI and learning these things, I ended up just getting to see it happen.

I knew it was about to happen earlier than other people. I got kind of lucky, I guess. And a lot of people probably remember the Dolly two moment. That was a crazy moment where image generation really was exciting. And so I just tried to I just kept learning. And then I think stable diffusion came out maybe I don't know. Maybe I got access two weeks before it came out.

And so it's just by being in the mix of this thing, I got to see everything about to start. And so I think we were the first AI image generation, like, website that you could, like, go to and sign up and you didn't have to, like, run it manually on some GPU. And so that I think our our website just took off because of that. That was, like, the easiest thing.

I think mid journey was still in Discord. It's really, what if you make a website?

Speaker 3:

I didn't actually know that story. I mean, that's a great lesson for any technical founder. Essentially, you stumbled on the biggest tailwind of our generation.

Speaker 0:

by just following technical things you found interesting. That's great. Yeah. It was it's a little weird after Mixpanel. I actually tried to intern only at I tried to do like an internship at companies and because I don't know. I was like to do something but not ready to start a company. And I only was trying to talk to AI companies.

And I interviewed at OpenAI and they wanted me to come work five days a week, but I only wanted to do three. Then somehow at the end of that I made this huge mistake in, I guess, 2018 where I had decided that there was nothing interesting going on in AI. Because I was like train and even then I was like training my own models.

I was like trying to help a scooter company keep the scooters to detect if they were on the sidewalk or on the road because the regulation in SF required them to do that. And I learned all this AI stuff, went to all these AI events, then I just, yeah, concluded there was nothing and I started Mighty. And I was just off by three months.

And so I kind of felt like I almost feel redeemed in some sense. I don't know. It's so hard to time these things. How do know whether you're early or late. Yeah. And then for a long time you were behind on the model, right, for for Playground. And I've just felt continuously behind.

But I've now kind of come to realize after like learning the history of like Microsoft and microprocessors, like, I don't know, it might just be like year two. This all still might be really, really early. We really don't know where it's going. How does it feel to run Playground, which is sort of part startup, part research lab versus just pure startup?

Well, one thing we try to do is we try to differentiate on not trying to go after AGI. That's one thing we try to say we're not doing because there's lots of people doing that. It feels really tractable, I guess, the research does, where it's not always clear whether research will be like that. I've kind of learned that you can't you can't do research in a rush.

So one big problem is that when you're building a startup, like, you wanna ship everything. Like, you wanted to ship you wanna ship it today. You wanna fix the bug. You wanna ship the feature. Like, just trying to move on such a fast pace, but that's not tenable with research in the same way. Research is moving fast,.

Speaker 3:

but it's not like you ship you can't ship your new model. You can't build and ship your model in a week. And so I think that's been, like, really challenging, and I've had to kind of adjust my brain for one team versus the other. Yeah.

One thing I think is interesting about successful research labs in the past, if you look at Bell Labs, for example, it's almost like the the CEO of the lab's main responsibility shielding the lab from, like, the commercial interests that are pushing for, like, things now. Yeah.

But as CEO of Playground, you're kind of both, like, protector of the researchers, but you're also the commercial interests. Like, how do you juggle those competing forces? Yeah. I don't know that I've probably mastered it yet by any means. But.

Speaker 0:

I think I asked Sam Altman once, like, you know, to what degree he allowed the researchers at OpenAI to, like, wander, I guess. So I just wasn't really sure. You know? Usually, it's like there's like a task and you do it. But what about wandering? How does wandering make sense in a in a researcher or an engineer engineering team?

And and he said there's like a he's like there's quite a bit of wandering. So I took that to heart and so I let the research team kind of wander and get to a point where they are able to show an impressive result. And then we kind of like start to really accelerate that. But until then, there's not much to do. Well,.

Speaker 1:

not all who wander are lost.

Speaker 0:

I love that. That should be a t shirt in a bra. That's right. Will add that as a Yep. We can link it below in the video. I'll be a creator in the playground marketplace. Love it. You were asking like, how do you like inter almost like how do these two teams integrate in a startup?

And I think that we just have this channel now where we just see so much feedback that now the researchers can actually look into the failure. And they can decide for themself while wondering, do I wanna fix that? That's surprising, why did that happen? And so I want to try to integrate these two because I think that that's a more differentiating factor these days.

I think that the research labs are very lab based and they don't necessarily they're not always deeply looking into real user behavior. What are they really trying to do? But sometimes it's just like, we need to get to this, like we need to get a high score on this eval, and we gotta put it in the paper. And then we gotta, like, get really good score for LLM Arena.

And then there's, like, some KPI, you know, to do that. But then, you know, does that thing matter? Does it correlate? Does the eval that we see out in the world, does it strongly correlate to usefulness to users? Like, I still want the LLMs to, like, help me make rap lyrics, but there's no eval for that. So, you know, who will do that? How will that happen? It's certainly possible to do that.

But I if you notice, I I always pick on this rap lyrics thing because to me, it, like, belies a fundamental problem with how people are evaluating the models because the model should be extremely good at it, but they're not. Maybe the problem is some of these.

Speaker 2:

there's a gap between commercialization of research because all these eval publicly are academic and very different.

Speaker 0:

use case than if you wanted to go beat Canva, let's say. Yeah. I mean, I may be talking out of turn here. Sorry to the LLM folks. But the if you go look at the evals for the language models, they're all, like, you know, math, biology, legal questions. It's no wonder that the biggest use case of ChatGPT is homework.

Because they they you know, we the all the models are, like, basically hit these numbers, right, initially. And maybe they're different now. They're probably more sophisticated now. But it's no wonder that the models are good at homework and that's a huge category. So you made it to soda. People are watching right now and they're just asking like, how do I do it? What's your answer to that?

There's like this feeling that all you need is a lot of data and a lot of compute, and you just and then you just you run you train these models, and you'll get there. You know? They'll just generalize, suddenly everything will be great.

I think there are a lot of smart software engineers, and so they fundamentally understand that these are the core components, ingredients to make, like, this great model. But it's vastly more complex than this. And what I've at least I've I've what I've experienced is that to get to SOTA, you basically have to be maniacal about, like, every detail of, you know, the model's, like, capability.

For example, like, you can, like, look at text generation. There are gonna be some people that train their models and they get cool text generation, but the kerning is off. Are you the kind of person that will care about the kerning being off? Or are you the kind of person that is okay with it? Or you don't even notice it? Do you have this maniacal sense? We look at skin texture.

My eyes feel burnt out, basically, from looking at the smallest little skin texture, smoothing it. We talk about these things as a research team day in and day out. We argue about it. To build these SOTA models, you have to be so you have to care so much about in our world, it's image quality.

And, you know, we even look at, like, little small things, like if there's even a slight film grain and it's missing, we go, oh, the prompt understand the captioning model is bad. Not good enough. We need to be better at this. And I think this maniacal mindset, I think, allows you if you do this a hundred times, the model extrapolate even more.

I think people don't quite internalize, like, extrapolation of all of these dimensions together and how they work together to make everything better. Like you don't know how making one thing better here will impact like another thing over there. We can't. It's hard to understand that. But I think that that's what's required.

Speaker 1:

to get to a So to model. And it is possible. It is possible. It is possible. It's not easy though.

Speaker 0:

It's really hard. Yeah.

Speaker 1:

Well, Suhal, thanks a lot for coming on the LiteCone. That's all we have time for, but you can try Playground right now, playground. com, or in the App Store, Android, iOS, and this is actually the biggest flex. You didn't have any wait lists. It was just available on day one. So go try it out right now, and we'll see you guys next time.

Founder Tools

Need help?

Building The World's Best Image Diffusion Model

Transcript

Related Videos

Explore More Content