COEY Cast Episode 127

Open Source Shock: Nemotron, Llama 4 Scout, and Hume TADA

Spotify

Apple Podcast

Open Source Shock: Nemotron, Llama 4 Scout, and Hume TADA

COEY Cast

Riley Reylers
Hunter Glasdow

Episode Overview

03/12/2026

Nemotron 3 Super, Llama 4 Scout, and Hume TADA are all pushing what open source AI can do for real workflows. This episode digs into when million plus token context actually beats smart retrieval and when it just becomes expensive procrastination. Hear how to test long context models so they do not just summarize nonsense. Learn why open weights do not equal safe ops plus the boring places data still leaks. Then dive into TADA and what zero hallucinations really means for AI voice, strict copy lock, and brand safety. Get practical ideas for modular stacks that mix big context, fast tools, and specialist audio.

Spotify

Apple Podcast

COEY Cast Open Source Shock: Nemotron, Llama 4 Scout, and Hume TADA

Episode Transcript

Hunter: It’s Thursday, March twelfth, twenty twenty six, and you’re listening to COEY Cast. Also, apparently it’s Alfred Hitchcock Day, which feels correct because the AI industry is currently suspense, plot twists, and a little bit of screaming in the shower.

Riley: Wait, that’s kind of perfect. Today’s episode is basically three different ways AI can jump-scare your workflow. I’m Riley.

Hunter: And I’m Hunter. And quick disclaimer in the most Thursday way possible: this episode was assembled by a fully automated little AI factory line. Script, voices, the whole thing. If something gets weird, we’re not cutting it. We’re studying it.

Riley: We’re observing the gremlins in their natural habitat.

Hunter: Okay. Big news: NVIDIA drops Nemotron-3 Super, open weights, positioned for high-throughput agentic workloads, and the headline flex is up to a million token context.

Riley: And then Meta’s Llama 4 Scout is out here like, “Cute. I can do ten million.” Like it’s a purse size comparison on TikTok.

Hunter: Right. And then Hume comes in sideways with TADA, open-source speech that generates text and audio in sync. The chatter is “zero hallucinations” for spoken output, which… if you’ve ever used longform text to speech, you know why people are excited.

Riley: Okay Hunt, let’s start with the million token thing. Everyone on X is hyping Nemotron-3 Super as “agentic” and fast. What’s the first real workflow where that actually beats a smaller model plus good retrieval? And where does it become expensive procrastination?

Hunter: The legit win is when you need continuity across a ton of messy, interdependent stuff. Like, think marketing ops meets legal meets product. A smaller model with retrieval is great when you can fetch the right facts. But the minute the job is “read the whole campaign history, the brand voice guide, the claims registry, the last quarter performance notes, and then propose a new launch plan that doesn’t contradict anything,” long context starts to matter.

Riley: Mmm. So less “summarize this book,” more “keep the whole messy company brain loaded while you work.”

Hunter: Exactly. And the biggest creator workflow I like is what I call the content repurpose audit. You feed it your entire season of podcast transcripts, your best performing shorts, your audience comments, and your offer positioning. Then it outputs a system, not just posts. Like, recurring hooks that worked, formats that tanked, and a fresh batch of scripts that actually match your patterns.

Riley: Wait, but that can still be done with retrieval if you’re disciplined, right?

Hunter: Totally. And that’s where the “expensive procrastination” shows up. People are using long context like a junk drawer. They’re like, “I pasted my whole Notion in, why isn’t my strategy done?” And then the model confidently summarizes nonsense you didn’t mean to include, like outdated offers or that one unhinged brainstorm doc from last summer.

Riley: Yeah, and then you ship it and your audience is like, “Why are you promoting the thing you deleted?”

Hunter: Exactly. Long context doesn’t fix garbage-in. It just lets you include more garbage.

Riley: Okay, but Nemotron is also very NVIDIA-coded. Open weights, sure, but tuned for their hardware, their inference stack, their whole ecosystem. Are we heading toward “open” models that are technically free but practically locked to a GPU vendor?

Hunter: We kind of already are. “Open weights” doesn’t mean “open operations.” If the best speedups come from NVIDIA-specific precision formats and NVIDIA-friendly inference paths, you’re open in theory, but you’re still making a hardware bet.

Riley: So it’s like, congrats, you can self-host… on the same stuff everyone is backordered on.

Hunter: Exactly. The grown-up move for normal orgs is planning for portability. Not like, “we’ll run anywhere tomorrow,” but “we won’t die if we need to switch.” That means model-agnostic orchestration, shared tool schemas, and keeping your core business logic outside the model.

Riley: You mean like prompts as contracts, not as poetry.

Hunter: Yes. And also, don’t bake vendor-specific features into your entire workflow unless you’re fine being married to them. Date the optimizations. Marry the architecture.

Riley: That’s actually a bar.

Hunter: Thank you. I’ll be here all Hitchcock Day.

Riley: Okay, Llama 4 Scout. Ten million token context sounds like “just paste your entire company into the prompt.” What’s the most believable use case you’ve seen… and what’s the failure mode people are quietly admitting?

Hunter: Most believable is codebase reasoning. Like, not just “what does this file do,” but “trace the bug across the repo,” or “find every place we handle billing and tell me where the logic diverges.” That’s genuinely useful.

Riley: Also massive document synthesis. Like, take every customer interview transcript, every survey response, every support ticket cluster, and generate a real voice-of-customer report that isn’t just vibes.

Hunter: Yes, that’s a great one. The failure mode is attention drift. The longer the context, the more the model can lose the plot. You’ll see it anchor on the wrong details, or it’ll over-trust an early incorrect assumption, and then it becomes “confident narrator of a fan fiction version of your company.”

Riley: Context rot. Like it starts treating its own earlier output as canon.

Hunter: Exactly. And you’ll notice it most on simpler tasks. People will be like, “It can handle ten million tokens, why can’t it write a clean subject line?” It’s because you brought a semi-truck to deliver a single avocado.

Riley: Wait, hold up. So if long context is the arms race, why are so many threads also complaining about garbage-in and drift? What should teams test first so they don’t ship a very confident summarizer of nonsense?

Hunter: First test is retrieval fidelity, even if you’re not doing classic retrieval. Do a needle test. Hide a few specific facts in the corpus and see if the model can reliably pull them out and cite where it found them.

Riley: And not just once. Like, run it repeatedly because sometimes it finds the needle and sometimes it’s like, “Needle? Never heard of her.”

Hunter: Exactly. Second test is instruction persistence. Give it constraints like “do not invent numbers,” “only use approved claims,” “flag uncertainty,” and see if it keeps those constraints deep into the session.

Riley: Third test is “boring task sanity.” Like, can it follow basic formatting when the context is huge? Because if it can’t, your automation breaks even if the reasoning is genius.

Hunter: Exactly. It’s not about benchmarks. It’s about, can it behave inside your assembly line.

Riley: Okay, open weights hype. People are like “we can self-host, so we’re safe.” What’s the least sexy, most common way open-source AI projects still leak data or create compliance nightmares?

Hunter: Logging. It’s always logging. People self-host and then ship prompts, outputs, and maybe even the source docs straight into some random observability tool, or they keep debug logs forever in plain text.

Riley: Or they spin up a vector database, forget to lock it down, and now your embeddings are just… vibing on the internet.

Hunter: Yup. Also secrets management. Folks will hardcode API keys for tools the agent calls. Or store credentials in environment variables with wide access. Open weights doesn’t save you from sloppy ops.

Riley: And then the compliance team shows up like Hitchcock’s birds.

Hunter: Exactly.

Riley: Alright, audio time. Hume’s TADA. They’re claiming “zero hallucinations.” What does hallucination even mean for audio? Because like… the audio isn’t factually wrong. It’s sound.

Hunter: In speech models, hallucination is usually the model adding words that weren’t in the script, skipping words, repeating, or drifting into a different sentence. You’ve probably heard it where a voiceover suddenly says an extra phrase or starts stuttering something that wasn’t there.

Riley: Or when it just decides to freestyle at the end like it’s doing jazz.

Hunter: Exactly. TADA’s promise is text and audio are generated in a synchronized stream. So if your script says the legal disclaimer, the audio actually says the legal disclaimer. That matters a lot for ads, regulated industries, and podcasts where accuracy is the product.

Riley: Yeah, for creators too. Imagine you’re doing a long narration for a YouTube doc and the voice adds a sentence you didn’t write. That’s not a “cute bug.” That’s defamation speedrun.

Hunter: Exactly. The first marketing use case where this alignment matters is anything with strict copy lock. Like ad reads, disclaimers, pricing terms, or even a brand tagline you cannot mess up.

Riley: Okay but if TADA reduces content hallucinations, what’s the next boogeyman for brands experimenting with voice agents? Timing, emotion, legal disclaimers, impersonation?

Hunter: Timing and intent. You can be perfectly accurate and still sound wildly inappropriate. Like, the agent says the right words but with the wrong emotional cadence. Also interruption handling. People talk over systems, change their mind, ask side questions.

Riley: And impersonation is huge. Open-source voice gets good, and suddenly everyone’s grandma is getting a call from “you” saying something insane.

Hunter: Which is why mitigation looks like governance, not just model choice. You need consented voices, a voice registry, and friction for sensitive actions. If it’s a voice agent that can place orders or change accounts, you add human confirmation steps or strict authentication.

Riley: So it’s like, congrats, you solved hallucinations. Now solve humans.

Hunter: Pretty much.

Riley: If you had to pick one open-source agent stack direction from the chatter to get to production this year, what is it? Nemotron-style throughput plus tools, Llama-style mega-context, or specialist models like TADA?

Hunter: Throughput plus tools. Because production is mostly lots of small decisions. You need speed, stable outputs, and good tool calling. Mega-context is amazing, but you still need structure. Specialists like TADA are perfect when the modality matters, like audio pipelines.

Riley: I agree, but I’m gonna be annoying: mega-context plus a strict critic layer can be insane for research-heavy teams. Like agencies doing competitive intel, brand audits, and creative strategy. If you can keep it grounded, it’s basically a superhuman intern that reads everything.

Hunter: That’s fair. The key is critics. Like we’ve said on recent episodes: verification-centric agents, abstention when facts are missing, and building a claims registry so the model can’t freelance.

Riley: And for the creator crowd, the most realistic stack is modular. Use the big context model for synthesis, the fast model for variations, and the audio specialist for voice. No one model does it all, and if you try, you get… Hitchcock ending.

Hunter: Birds. Everywhere.

Riley: Okay, last one. Be honest. With all this open-source momentum, what jobs or agency services get quietly deleted first? And what new roles pop up to keep AI from making confident, on-brand disasters?

Hunter: The first thing that gets deleted is low-value content churn. Like, “we’ll write ten variations of the same blog post for you,” or “we’ll manually repurpose your longform into shorts captions.” That becomes a button, if it isn’t already.

Riley: Yeah. The new roles are like… AI wrangler, but make it real. People who design workflows, set guardrails, and build review loops. Also brand systems people. Like, “here’s our voice, claims, disclaimers, and no-go zones,” in a form the machines can actually respect.

Hunter: Exactly. And honestly, taste becomes more valuable. Direction. Editorial judgment. The human part is deciding what should exist, and the machine part is making versions of it fast.

Riley: Human plus machine. With seatbelts.

Hunter: Alright, that’s our Hitchcock Day thriller for the week. Thanks for hanging with us on COEY Cast.

Riley: Subscribe wherever you listen, and if you’re celebrating Alfred Hitchcock Day, maybe don’t watch something scary right before you test a new agent in production.

Hunter: And for AI news and updates, check out COEY.com slash resources. Catch you next time.

Most Recent Episodes

Open Voice, Multi Shot, and Google’s AI Music Push
04/01/2026
Open Qwen, Closed Loop: Multimodal Gets Real
03/31/2026
OpenClaw or Open Chaos? The Open Source Agent Reality
03/30/2026
Gemini Flash Live and the Great AI Workflow Reality Check
03/29/2026