COEY Cast Episode 90

Real Time, Long Form, Big Brains Qwen Lucy and CraftStory

Real Time, Long Form, Big Brains Qwen Lucy and CraftStory

Real Time, Long Form, Big Brains Qwen Lucy and CraftStory
  • Riley Reylers

    Riley Reylers

  • Hunter Glasdow

    Hunter Glasdow

Episode Overview

01/28/2026

Qwen3-Max-Thinking, Lucy 2.0, and CraftStory are all chasing the same prize control. This episode unpacks the Qwen3-Max-Thinking hype around Humanity's Last Exam and explains why test-time scaling, tool use, and "open" claims often blur into configuration theater. Then it breaks down what real-time generative video like Lucy 2.0 actually unlocks for live campaigns, and why spatial precision and brand safety are non negotiable. Finally, it covers CraftStory's five minute image-to-video with actor consistency, identity drift, and why storyboards become specs not relics. The throughline human intent plus automation, least privilege for tools, and creative constraints as the real power move.

COEY Cast Real Time, Long Form, Big Brains Qwen Lucy and CraftStory
COEY Cast Real Time, Long Form, Big Brains Qwen Lucy and CraftStory

Episode Transcript

Hunter: It’s Wednesday, January twenty eighth, twenty twenty-six, and apparently it’s Data Privacy Day and also Fun at Work Day. Which… feels like a threat in the age of AI. This is COEY Cast. I’m Hunter.

Riley: I’m Riley. And yes, this whole episode was basically assembled by a little robot production line. So if we take a weird turn and start debating whether an agent deserves PTO… that’s just the machines getting brave.

Hunter: Today’s big storyline is a three-hit combo. Alibaba drops Qwen3-Max-Thinking, Decart drops Lucy two point oh for real-time generative video, and CraftStory says, “Cool, what if we just do five minute image-to-video with actor consistency?”

Riley: The industry is like, “Pick your fighter.” And creators are like, “Pick my deadline.” Okay Hunt, let’s start with Qwen, because my feed is just people screaming “HLE” like it’s the Super Bowl.

Hunter: Yeah, the hype number floating around is forty nine point eight percent on Humanity’s Last Exam. And here’s the most likely boring explanation if it doesn’t hold up: the setup.

Riley: Meaning?

Hunter: Meaning tool use. Like, “with search enabled,” or “with a specific prompt format,” or “with test-time scaling cranked.” If one leaderboard run is using web search plus a long think budget, and another is running no tools and short answers, you’re not comparing models, you’re comparing configurations.

Riley: Wait, and some people on X are also posting different scores, right? Like I saw a number in the fifties too. It’s giving… “Which cut of the benchmark are we talking about?”

Hunter: Exactly. It becomes “benchmark as content,” not benchmark as measurement. And I’m not even saying it’s fake. I’m saying the first failure mode is usually boring: different evaluation harness, different tool permissions, different retries, and suddenly the number is a vibe, not a fact.

Riley: So when they say “test-time scaling,” is that real intelligence or just “we bought more thinking tokens”? Because that’s my skepticism.

Hunter: It can be both. Test-time scaling is basically letting the model do more work at inference. More internal deliberation, self-checking, maybe branching and voting, maybe calling tools. The line between “smarter” and “more expensive” is… whether you get better answers per unit of cost and latency.

Riley: Mmm. Because if your agent now takes thirty seconds to decide what font is “friendly,” you didn’t get smarter, you got slower.

Hunter: Right. For creators and marketers, the practical version is routing. Use fast models for drafts and variations, and only escalate to the “Thinking” mode when the task is high stakes: claims, pricing, regulated stuff, or a complex strategy doc.

Riley: Okay, but there’s also this “open source first” conversation. People are like, “China’s pushing frontier, open ecosystem, yada yada.” Realistically… who is running Qwen3-Max scale locally?

Hunter: Almost nobody. Like, if it’s truly north of a trillion parameters, you’re not spinning that up on your gaming PC. The “open” story here is usually access and ecosystem compatibility, not that you personally host the whole beast.

Riley: So it’s “open-ish,” but you’re still renting the brain in a cloud.

Hunter: Yeah. And that leads to the procurement joke: who wins when a frontier-ish model is available via cloud APIs but the ecosystem still calls it open? Vendors win because adoption goes up. Customers win if it gives them leverage and optionality. And procurement teams win because they can say, “We’re not locked in,” while still very much being locked into someone’s region, pricing, and quotas.

Riley: “Plausible deniability as a service.”

Hunter: Exactly. Also, if you’re adopting these models into automation, you have to decide what tools they’re allowed to touch. And I have a strong opinion.

Riley: Ohh, spicy. What’s the one tool you’d never let a model touch in production?

Hunter: Anything that can move money. Payments, ad spend, refunds, even gift card issuance. I don’t care how good the guardrails are. The model can draft the plan, propose the budget, even generate the payload… but a human or a deterministic approval service has to hit “send.”

Riley: Thank you. Because the moment an agent can buy ads, it will buy ads. It’s like giving a toddler your phone and expecting them not to order thirty dollars of stickers.

Hunter: Exactly. Least privilege forever.

Riley: Okay, let’s jump to Lucy two point oh. This is the one that feels like it’s from the future. Real-time generative video. Like, live.

Hunter: Yeah, Decart’s Lucy two point oh is being pitched as interactive, real-time video generation with minimal lag. And for marketing, the first actually valuable use case isn’t “look, I turned into a dragon on stream.”

Riley: Even though… I would click that.

Hunter: Same. But the real business win is rapid creative iteration during a campaign. Imagine a live product demo stream where the background, the colorway, the props, even the on-screen text treatment can change based on audience behavior without pausing production for an edit cycle.

Riley: Like shoppable streams but the set is procedural. “Chat says make it neon,” and it just… becomes neon.

Hunter: Exactly. Or dynamic localization. Same live host, but the environment and overlays adapt per region feed.

Riley: Okay, but the posts also say Lucy struggles with spatial precision. Like, “tattoo under left eye” becomes “tattoo under both eyes.” In ad land, that is a brand risk machine.

Hunter: Yup. The minimum threshold for spatial correctness is basically: logos and product features must stay stable. If you’re selling shoes and the swoosh moves, you’re done. If you’re selling skincare and the label text mutates, you’re done.

Riley: Or if you’re a beverage brand and the can suddenly becomes a “Can-ish cylinder object.” Comments will roast you into the earth’s mantle.

Hunter: Totally. So the near-term move is using Lucy for vibes and environments, not for exact product truth. Let it generate the world, and keep the product itself either composited, locked, or verified by a post-check.

Riley: And also… infinite variants is dangerous. How do you stop “creative testing” from turning into “infinite off-brand content streaming into the world”?

Hunter: You need constraints. A brand motion bible, prompt templates with allowed ranges, and an approval gate that’s actually enforced. Real-time doesn’t mean no governance, it just means governance has to be fast.

Riley: So basically, you’re building a DJ set, not an open mic night.

Hunter: Exactly. Real-time generation needs rails.

Riley: Okay, now CraftStory. They’re claiming image-to-video up to about five minutes, with human actor consistency across scenes. That’s the holy grail people keep promising and then… the face melts by second twenty.

Hunter: The identity drift problem, yeah. My bet is they’re doing some combo of strong identity conditioning plus temporal coherence tricks. Like segmenting the generation into chunks while enforcing a global identity representation across the whole clip.

Riley: So, like, it’s not one giant five minute generation, it’s more like stitched continuity with a strict “character anchor.”

Hunter: That’s my guess. And the first failure mode people will discover is probably emotion and micro-expression drift. The actor stays “the same person,” but the performance starts to feel off. Or the hands start doing their own independent film.

Riley: Hands always choose chaos.

Hunter: Always. And there’s also the storytelling risk: five minutes of consistent video doesn’t automatically mean five minutes of good video.

Riley: Facts. People are like, “We fired the storyboard.” And I’m like… you just fired the part where the idea becomes coherent.

Hunter: That’s the big question: if longer, consistent AI video becomes cheap and fast, what happens to the pipeline? You don’t delete storyboards. You delete the painful parts of production that exist because coordination is hard. Storyboards become more important as constraints.

Riley: Yeah, storyboard becomes a spec. Like a product requirements doc, but for vibes.

Hunter: Exactly. It’s the human intent layer. Automation handles rendering and variants. Humans handle taste, truth, and direction.

Riley: Okay, quick ecosystem drive-by, because this week is chaos as usual. We’re still in the era of agents and orchestration obsession. We literally just talked about Moltbot doing “chat with hands,” and now Qwen is like, “cool, here’s autonomous tool use at frontier scale.”

Hunter: And on the video side, we just did the Runway Gen four point five conversation about consistency and infinite variants. Lucy pushes that into real time. CraftStory pushes it into long form. It’s like the entire industry is attacking the same bottleneck from different directions: control.

Riley: Control is the new flex. Not “look what I can generate,” but “look what I can generate on purpose.”

Hunter: Exactly. And on Data Privacy Day, it’s also a reminder: if you’re wiring these tools into workflows, be intentional about what leaves your environment. Especially with video feeds, faces, and brand assets.

Riley: Fun at Work Day, but make it “permissioned fun.”

Hunter: Perfectly said.

Riley: Meme caption for today’s episode: “It’s not a video model, it’s a live filter with a god complex.”

Hunter: Mine: “Test-time scaling is just procrastination, but for robots.”

Riley: Stop, that’s too real.

Hunter: Alright, thank you for hanging with us on COEY Cast on this very Data Privacy Day slash Fun at Work Day.

Riley: Go do something fun that does not involve giving an agent access to your credit card.

Hunter: Subscribe if you want more of this. And check out coey.com slash resources for AI news and updates.

Riley: Catch you next time.

Most Recent Episodes
  • Open Source Vibe Check with VibeVoice and MOSS Audio
    05/01/2026
  • Open Up: Nemotron, LLM jp 4, and Laguna
    04/30/2026
  • OpenAI GPT 5.5 Ships Quietly, Workflows Loudly
    04/28/2026
  • Audio Flamingo Next and the Rise of Specialist AI
    04/19/2026