
COEY Cast Episode 125
Open Source Audio Magic and Consistent AI Characters for Marketers
Open Source Audio Magic and Consistent AI Characters for Marketers
Episode Overview
03/10/2026
AI video and audio just picked up some serious new superpowers. Hear how Kling 3.0 makes character consistency real enough for UGC-style ad series without turning every cut into a glitchy horror film. Get the real use cases for HiAR long video and LTX 2.3 open weights plus the hidden costs of running models locally. Learn why Meta’s open source SAM Audio changes podcast cleanup, sonic branding, licensing, and ethics. Explore Ming Omni TTS for brand voice at scale and what interruption friendly, real time AI audio means for interactive ads. Everything is framed around workflows, human approval loops, and using modular media to ship faster.


Episode Transcript
Hunter: It’s Tuesday, March 10th, 2026, and yes, it’s Mario Day. If you hear a faint “it’s-a me” in the background, that’s not your brain melting, that’s our fully automated AI production pipeline doing a little victory lap. I’m Hunter.
Riley: And I’m Riley. Happy Mario Day, besties. If this episode glitches and suddenly we start speaking in a mysterious new codec, no we didn’t get possessed. It’s just… the machines are collaborating.
Hunter: Today’s chaos menu is stacked. AI video updates all over X, people yelling “Kling 3.0 is live” with character consistency, HiAR doing long smooth generations and it’s open source, LTX-2.3 popping up, and then Meta drops SAM Audio, which is basically “Segment Anything” but for sound. And it’s open-sourced.
Riley: I need everyone to understand what that means emotionally. Like, we had “remove background noise,” and now we have “extract literally any sound” from a messy recording with a text prompt. That’s not a plugin, that’s a superpower.
Hunter: Totally. Let’s start with Kling 3.0 being called “live” and “consistent.” Here’s the first real marketing use case where character consistency actually matters: UGC-style ad series. Not one ad. A series. Same “creator,” same vibe, multiple hooks, multiple angles, and you need the face and vibe to stay the same across cuts.
Riley: Yes. Like the classic “I tried this for a week” narrative. If the person’s face drifts every scene, it becomes an accidental horror short.
Hunter: Exactly. Consistency turns it from “cool clip” to “repeatable asset.” You can do a three-scene ad where scene one is problem, scene two is demo, scene three is payoff, and the person doesn’t shape-shift between shots.
Riley: Okay but Huntwhere does it still fall apart?
Hunter: Two places. First, product truth. Logos, packaging details, on-screen UI, anything that has to be exact. Second, the moment you ask for very specific blocking, like “hold the bottle, turn label to camera, point at the ingredient list,” the model starts improvising like it’s in an acting class.
Riley: It’s like, “I heard you, but what if we did jazz hands instead?”
Hunter: Yeah. So the play is: let the model handle vibe and motion, but keep your truth anchors human-controlled. Real product shots, real screenshots, real overlays.
Riley: And this is where I’m gonna challenge you. Everybody’s like “prompt writer owns the creative direction now.” I think the editor owns it. Because once you can feed text, image, and video references, the job becomes “choose the right references and police the outputs.”
Hunter: I agree with you more than you think. The prompt writer is important, but creative direction becomes “reference orchestration.” The model is a very confident intern. The editor is the one saying, “No, not that take. Use the one where the character didn’t teleport.”
Riley: Wait, but what about the model improvising between shots? Like multi-shot clips where it invents transitions you didn’t ask for.
Hunter: That’s the new reality. Creative direction becomes a negotiation. You set constraints with references, the model fills gaps, and the editor picks the winning branch. It’s co-creation, not dictatorship.
Riley: Speaking of branches, can we talk HiAR? X is hyping “very long, smooth, high-quality generations without quality drop-off,” plus open-source demos. Are we close to “generate the whole campaign narrative in one take”?
Hunter: We’re close to generating a long draft, not a long final. Like, you can absolutely get a minute-scale sequence that feels coherent… but the gotcha nobody puts in the demo reel is drift in meaning. The scene might stay visually stable, but the story can subtly change.
Riley: Yes! Like the character remains consistent, but suddenly the product benefit mutates. “Hydrating” becomes “healing.” And now you’re in claims jail.
Hunter: Exactly. Long generation is great for pre-vis and concepting. But for campaign narrative, you still want modularity. Generate scenes as chunks you can approve independently, then stitch with intent. It’s like editing a movie, not praying to the render gods.
Riley: Also, open source long video means everyone’s about to try running this locally and then realize they accidentally adopted a small data center.
Hunter: Which brings us to LTX-2.3 vibes. Open weights, runnable locally, sounds like every brand’s dream. But the line between “cheap internal studio” and “you reinvented VFX production” is thin.
Riley: Say it louder. People think local means “free.” Local means “congrats, you’re now IT and post-production.”
Hunter: Totally. The hidden costs are asset management, approvals, and compute scheduling. If you don’t have a pipeline, you end up with a folder called “final_final_use_this_one” full of unusable clips.
Riley: And the person who becomes the bottleneck is always the same person. It’s the one editor who understands the settings and now they’re basically the wizard of your whole marketing team.
Hunter: Yup. So if you go open weights, treat it like a system. You need templates, naming conventions, and a human review loop that doesn’t melt.
Riley: Okay, now the spicy one. Meta’s SAM Audio. Isolate any sound from complex mixtures using text prompts, visual prompts, or time spans, and it’s open-sourced. Best and worst marketing usesgo.
Hunter: Best use: podcast cleanup and repurposing. Imagine pulling clean dialogue from a noisy remote recording, then making platform-specific edits fast. Another best use is ad audio post-production: extract the voice, swap music beds, create localized versions without re-recording everything.
Riley: Yes. And also pulling sound effects. Like if you have product ASMR momentsclicks, snaps, poursyou can isolate those and build a consistent sonic identity. That’s branding people actually feel.
Hunter: Worst use is obvious: forging reality. If you can isolate “any sound,” you can also recombine sound to imply something happened that didn’t. That’s where ethics and governance have to show up.
Riley: And it’s not even always malicious. Sometimes it’s just… a social team being too clever. “Let’s make it sound like the crowd went crazy.” And suddenly you’ve got a fake testimonial vibe.
Hunter: Right. Which leads to the licensing and releases question. If everything is separable by a text prompt, marketers need to think differently about music licensing, talent releases, and the old myth of “you can’t unmix that.”
Riley: The “you can’t unmix that” era is so over. It’s like when people used to say “you can’t reverse an image filter.” Babe, we can.
Hunter: Practical checklist: keep contracts and releases explicit about stem extraction and derivative edits. Store provenance: what was extracted, from what source, for what usage. And be careful with music. Even if you can isolate an instrument, that doesn’t mean you own the rights to use it.
Riley: And for talent, especially voice, it gets extra touchy. Because if I can isolate your voice from a video and remix it, that’s basically a new kind of likeness risk.
Hunter: Exactly. Which is a perfect bridge to Ming-Omni-TTS. Fine-grained control over emotion, style, prosody. The promise is “brand voice at scale,” but the danger is uncanny spokesperson energy.
Riley: The AI voice that sounds like it’s smiling too hard.
Hunter: My practical playbook is: don’t aim for a single perfect synthetic host voice first. Build a voice palette. A couple approved tones, like “friendly informative,” “high energy,” “calm premium.” Then you route scripts to the tone that matches the placement.
Riley: Yes, and keep humans in the loop for the first passes. Also, your scripts matter more when the voice gets better. Bad script plus good voice equals… a very convincing cringe.
Hunter: Totally. And keep guardrails: banned phrases, pronunciation rules, brand vocabulary, and a human approval step before anything paid goes live.
Riley: Now I have to ask about the X buzzing thing: rumored bidirectional audio, interruption-tolerant, real-time voice interactions. What’s the first interactive ad experience that doesn’t feel like a phone tree with better PR?
Hunter: It’s not “talk to an ad.” It’s “talk to a product concierge.” Like an interactive shopping experience where you can interrupt: “Wait, does it come in black?” and it responds instantly, and then shows you the clip variant or the product angle that answers that.
Riley: So it’s like, the ad is a choose-your-own-adventure, but not annoying. More like TikTok comments, but the video replies back in real time.
Hunter: Exactly. And the secret is constraint. You don’t let it talk about everything. It’s a tight domain: pricing, sizing, key objections, shipping, and the one thing your product actually wins on.
Riley: Which is funny because this is where open source versus enterprise gets spicy. Open source keeps winning mindshare, but enterprise whispers “risk.” What’s your realistic checklist?
Hunter: Simple: do you need privacy control, do you need repeatability, do you have the team to maintain it, and do you have governance. If any of those is a “no,” hosted might be the move. If it’s “yes,” open weights can be a strategic advantage, because you can customize and you’re not trapped.
Riley: Also, can your organization handle updates without breaking everything? Because open source means you’re the adult in the room.
Hunter: Facts. And this is where workflow glue matters. Whether you’re using open source video like LTX-2.3, long gen like HiAR, or SAM Audio, you need an orchestration layer. Otherwise, you’re just collecting tools like Pokémon.
Riley: I feel attacked.
Hunter: You should. But lovingly.
Riley: Okay last thing, because it’s Mario Day and I need a power-up metaphor. The real power-up here is not just that the models are better. It’s that video, audio, and voice are becoming modular. You can generate, separate, swap, and version everything.
Hunter: Exactly. And the workflow advantage goes to teams who treat assets like building blocks. Generate fast, isolate what you need, remix responsibly, and keep humans steering the story.
Riley: And if your system gets weird, keep it weird, but keep it reviewed.
Hunter: That’s the whole vibe. Alright y’all, thanks for hanging with us on COEY Cast.
Riley: Subscribe so you don’t miss the next episode where the machines probably try to pitch us a “cinematic brand narrative” and it’s just six shots of a frog drinking boba.
Hunter: And check out COEY.com slash resources for AI news and updates.
Riley: Go celebrate Mario Day by jumping over one tedious task in your workflow and letting automation take it.
Hunter: Catch you next time.




