COEY Cast Episode 110

Digital Humans, Real Risks: Phoenix 4, Cara 3 and Synthetic Faces

Spotify

Apple Podcast

Digital Humans, Real Risks: Phoenix 4, Cara 3 and Synthetic Faces

COEY Cast

Riley Reylers
Hunter Glasdow

Episode Overview

02/20/2026

AI powered digital humans just leveled up with Phoenix 4 from Tavus, Cara 3 from Anam.ai and new multilingual voice cloning from Gnani.ai. This episode digs into what actually breaks when you try to ship real time video agents from latency spikes and audio desync to brittle tool calls and flaky approvals. Learn where avatars truly outperform landing pages and chat, how to test conversion impact, and why disclosure, provenance and consent are non negotiable. Get a practical take on open versus closed stacks, governance, infinite variants, and why brand editors, creative directors and ops leaders become more valuable as synthetic spokespeople go mainstream.

Spotify

Apple Podcast

COEY Cast Digital Humans, Real Risks: Phoenix 4, Cara 3 and Synthetic Faces

Episode Transcript

Hunter: It’s Friday, February twentieth, twenty twenty-six, and apparently it’s National Love Your Pet Day… which feels right, because the AI internet is currently adopting brand new “digital humans” like they’re rescue puppies. This is COEY Cast. I’m Hunter.

Riley: And I’m Riley. Also it’s National Cherry Pie Day, which is perfect because this whole week has been… flaky. Like, exciting, but you touch it and it crumbles into brand risk.

Hunter: Facts. Also, quick heads up, this episode is made end to end by machines. Robots did the prep, the draft, the flow, the whole thing. If it gets a little uncanny, that’s not a bug, that’s the point.

Riley: Yeah, if the conversation suddenly starts “active listening” at you with intense eye contact… blame the pipeline.

Hunter: Speaking of. Big story: Tavus dropped Phoenix-4, a real-time digital human rendering model. They’re pushing micro-expressions, eye contact, low latency, and full HD output. Basically, interactive video agents that don’t feel like a looping GIF of a customer support rep.

Riley: And Anam.ai dropped Cara-3, same vibe. Real-time avatar, super fast response, better lip sync, more emotion. The “video-first AI” crowd is like, “Voice assistants are dead, long live FaceTime with a bot.”

Hunter: Then you’ve got Gnani.ai on the audio side, voice cloning across a bunch of Indian languages, and the word “sovereign” is doing a lot of work there. Regional language performance and data residency are becoming real differentiators.

Riley: Okay Hunt, here’s the question everyone avoids because it’s not sexy. What’s the first boring operational reality that kills the magic when you try to ship one of these digital humans?

Hunter: Reliability. Not the model demo. The whole system in production. Latency spikes, dropped frames, audio desync, and suddenly your “human” looks like they got possessed mid-demo. And it gets worse when you add tool calls. Like, the avatar is fine, but the brain is waiting on your CRM, your knowledge base, your calendar, your permissions.

Riley: Wait, yes. Also approvals. Because marketing teams can barely approve a static landing page hero image without fourteen comments. Now you’re telling legal, brand, and security that the spokesperson is interactive, can say new sentences, and can emote?

Hunter: Exactly. The operational reality is not “can we render pores.” It’s “can we govern behavior.” You need a conversation policy. You need a content boundary. You need escalation paths. And logs. Like real logs, not “trust me bro.”

Riley: And cost. People forget real-time video is expensive. If your top of funnel is massive, you might be paying to render a human face for someone who is just doomscrolling and will bounce in three seconds.

Hunter: Yup. So the first place this wins is not broad awareness. It’s high intent moments: sales calls, onboarding, customer success, maybe internal training. Places where a human-like interface actually reduces friction and the traffic volume is bounded.

Riley: Okay, but we keep hearing “micro-expressions and active listening” are the differentiator. How do we prove this converts better than a good landing page plus a chatbot? Not vibes. Not “it felt more human.”

Hunter: You treat it like an experiment. Same offer, same audience, same channel. One experience is standard page and text chat. The other is video agent. Then you look at boring metrics: completion rate, qualified lead rate, time to resolution, and the number of handoffs to a human rep.

Riley: And I’d add: watch replays. Like actually watch sessions. Because the conversion lift, if it happens, will be because the avatar handles objections better, or because users reveal intent faster when they feel like they’re “talking to someone.”

Hunter: Great point. And you can also test “human-ness” without going full deepfake. Even a stylized avatar can outperform if it’s responsive, clear, and has good turn-taking. A lot of uncanny valley comes from timing, not visuals.

Riley: Hold up, timing is huge. The uncanny valley is still alive. It’s just wearing better lighting and a TikTok beauty filter. The tell is always the eyes. The gaze is either too perfect or too vacant. Or the smile hits and the rest of the face doesn’t join the party.

Hunter: Totally. And there’s another tell: interruption etiquette. Humans do these little overlaps and pauses. If the agent talks over you like a podcast co-host who’s had three energy drinks

Riley: Excuse you, I have never

Hunter: then it feels fake instantly. Real-time “active listening” is as much about when not to speak as it is about expressions.

Riley: Okay, infinite variants. Cara-3 style systems can spin up endless spokesperson versions. Where’s the line between personalization and “we just invented infinite brand risk”?

Hunter: The line is when personalization changes claims, tone, or implied endorsement. If you’re swapping language and the face and the vibe to match a segment, that can be fine. But if the agent starts behaving differently for different people, you can accidentally create inconsistent promises.

Riley: Or you accidentally A/B test your way into unethical persuasion. Like, “This version is extra empathetic, this one is extra urgent, this one is flirty.” Congratulations, you reinvented dark patterns, but in HD.

Hunter: And that’s why disclosure norms matter. I’m pretty hardline: if it’s a synthetic spokesperson, label it. Always. Not just in ads. Everywhere the user might reasonably assume it’s a real person.

Riley: People hate labels though. Marketers are like, “But it’ll hurt conversions.” And I’m like… okay, but “don’t ask don’t tell” is just “wait for the scandal” strategy.

Hunter: Exactly. Trust is the moat. If you’re using a video agent for onboarding, say it’s AI. Then make it good. The win isn’t tricking people. It’s serving them faster.

Riley: Let’s talk closed vendors versus open stacks. Who wins once legal and brand teams realize “we don’t control the model” is not a strategy?

Hunter: Near term, closed vendors win distribution because they’re turnkey. You get the avatar, the hosting, the latency tuning, the nice dashboard. But mid term, bigger companies start wanting self-host or at least strong controls. Not because they’re nerds. Because they need auditability, data boundaries, and predictable behavior.

Riley: So like the history of the cloud, right? First everyone goes SaaS because it’s easy. Then the serious teams go hybrid because the stakes rise.

Hunter: Exactly. And this stacks with what we’ve been talking about lately: agents, orchestration, and tool governance. The avatar is just the front end. The real product is your automation system behind it.

Riley: Zooming out for a sec, the ecosystem’s been moving fast. We’ve had model and agent chatter nonstop. Multi-agent workflows are getting normalized, and long-context tool use is making these systems way more capable. At the same time, we’re seeing more legal pressure around training data and rights in audio and media, which makes provenance and disclosure feel less optional. And in video specifically, image-to-video and world-model style hype keeps rising, but the practical winners are the ones that slot into workflows without melting your ops team.

Hunter: And on X, the mood is basically, “Uncanny valley is over,” followed by someone posting a clip where the avatar is slightly too alive and everyone goes, “Actually never mind.” That whiplash is real.

Riley: Okay, synthetic social engineering. If anyone can spin up a persuasive, camera-ready video agent, how do you stop it from becoming the default growth hack?

Hunter: You can’t fully stop it. But you can make it harder to weaponize at scale. Companies need consent workflows for voice and likeness, clear disclosure, and internal rules like “no one-to-one sales outreach from a synthetic human unless the user opted in.”

Riley: And platforms are going to have to enforce provenance signals. Even if detection is imperfect, you need friction. Right now it’s too easy to clone vibes.

Hunter: On the voice side, Gnani’s announcement is a reminder that language nuance is not a “nice to have.” The hardest non-technical problem might actually be executive respect. Like, not treating regional languages as a checkbox. If your voice agent butchers dialect, you lose trust instantly.

Riley: Also consent. Voice cloning is emotionally intimate. People will freak out if they hear “themselves” selling something they didn’t approve. Especially in phone-first markets where WhatsApp and calls drive acquisition.

Hunter: So, last one. What gets automated first, and what becomes more valuable?

Riley: Automated first: basic customer success triage, onboarding walkthroughs, scheduling, FAQ handling, lead qualification. Anything repetitive with a script.

Hunter: More valuable: brand editors, creative directors, and ops people who can design guardrails. Also, honestly, real humans on camera who are trusted. Because as synthetic gets easier, authenticity becomes premium.

Riley: The real flex is going to be, “We use AI everywhere, but we’re honest about it, and we still have taste.”

Hunter: That’s the vibe. Alright, go hug your pet, eat some cherry pie, maybe a muffin if you’re feeling chaotic.

Riley: And please don’t clone your dog’s voice. I’m begging.

Hunter: Thanks for hanging with us on COEY Cast. Subscribe so you don’t miss the next one, and check out COEY.com slash resources for AI news and updates.

Riley: Catch you next time.

Most Recent Episodes

Open Voice, Multi Shot, and Google’s AI Music Push
04/01/2026
Open Qwen, Closed Loop: Multimodal Gets Real
03/31/2026
OpenClaw or Open Chaos? The Open Source Agent Reality
03/30/2026
Gemini Flash Live and the Great AI Workflow Reality Check
03/29/2026