Speculative Decoding and Efficient LLM Inference with Chris Lott - #717

The TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence)

#Artificial Intelligence #Tech News #Artificialintelligence #Machinelearning #Samcharrington #Technology #Thisweekinmachinelearning #Sam Charrington #Thetwimlaipocast #Twimlaipodcast #Tech #News #China #TWIML #Datascience #Science

27:52

We made it— 300 episodes of This Is Woman’s Work ! And we’re marking this milestone by giving you something that could seriously change the game in your business or career: the skill of pitching yourself effectively. Whether you’re dreaming of being a podcast guest, landing a speaking gig, signing a client, or just asking for what you want with confidence—you’re already pitching yourself, every day. But are you doing it well? In this milestone episode, Nicole breaks down exactly how to pitch yourself to be a podcast guest … and actually hear “yes.” With hundreds of pitches landing in her inbox each month, she shares what makes a guest stand out (or get deleted), the biggest mistakes people make, and why podcast guesting is still one of the most powerful ways to grow your reach, authority, and influence. In This Episode, We Cover: ✅ Why we all need to pitch ourselves—and how to do it without feeling gross ✅ The step-by-step process for landing guest spots on podcasts (and more) ✅ A breakdown of the 3 podcast levels: Practice, Peer, and A-List—and how to approach each ✅ The must-haves of a successful podcast pitch (including real examples) ✅ How to craft a pitch that gets read, gets remembered, and gets results Whether you’re new to pitching or want to level up your game, this episode gives you the exact strategy Nicole and her team use to land guest spots on dozens of podcasts every year. Because your voice deserves to be heard. And the world needs what only you can bring. 🎁 Get the FREE Podcast Pitch Checklist + Additional Information on your Practice Group, Peer Group, and A-List Group Strategies: https://nicolekalil.com/podcast 📥 Download The Podcast Pitch Checklist Here Related Podcast Episodes: Shameless and Strategic: How to Brag About Yourself with Tiffany Houser | 298 How To Write & Publish A Book with Michelle Savage | 279 How To Land Your TED Talk and Skyrocket Your Personal Brand with Ashley Stahl | 250 Share the Love: If you found this episode insightful, please share it with a friend, tag us on social media, and leave a review on your favorite podcast platform! 🔗 Subscribe & Review: Apple Podcasts | Spotify | Amazon Music…

大约1年之前 1:16:30

MP3•单集首页

Today, we're joined by Chris Lott, senior director of engineering at Qualcomm AI Research to discuss accelerating large language model inference. We explore the challenges presented by the LLM encoding and decoding (aka generation) and how these interact with various hardware constraints such as FLOPS, memory footprint and memory bandwidth to limit key inference metrics such as time-to-first-token, tokens per second, and tokens per joule. We then dig into a variety of techniques that can be used to accelerate inference such as KV compression, quantization, pruning, speculative decoding, and leveraging small language models (SLMs). We also discuss future directions for enabling on-device agentic experiences such as parallel generation and software tools like Qualcomm AI Orchestrator.

The complete show notes for this episode can be found at https://twimlai.com/go/717.

746集单集

Speculative Decoding and Efficient LLM Inference with Chris Lott - #717

The TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence)

1,751 subscribers

published 大约1年之前

MP3•单集首页

The complete show notes for this episode can be found at https://twimlai.com/go/717.

746集单集

Minden epizód

The TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence)

1
Exploring the Biology of LLMs with Circuit Tracing with Emmanuel Ameisen - #727 1:34:06

4天之前1:34:06

1:34:06

In this episode, Emmanuel Ameisen, a research engineer at Anthropic, returns to discuss two recent papers: "Circuit Tracing: Revealing Language Model Computational Graphs" and "On the Biology of a Large Language Model." Emmanuel explains how his team developed mechanistic interpretability methods to understand the internal workings of Claude by replacing dense neural network components with sparse, interpretable alternatives. The conversation explores several fascinating discoveries about large language models, including how they plan ahead when writing poetry (selecting the rhyming word "rabbit" before crafting the sentence leading to it), perform mathematical calculations using unique algorithms, and process concepts across multiple languages using shared neural representations. Emmanuel details how the team can intervene in model behavior by manipulating specific neural pathways, revealing how concepts are distributed throughout the network's MLPs and attention mechanisms. The discussion highlights both capabilities and limitations of LLMs, showing how hallucinations occur through separate recognition and recall circuits, and demonstrates why chain-of-thought explanations aren't always faithful representations of the model's actual reasoning. This research ultimately supports Anthropic's safety strategy by providing a deeper understanding of how these AI systems actually work. The complete show notes for this episode can be found at https://twimlai.com/go/727 .…

The TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence)

1
Teaching LLMs to Self-Reflect with Reinforcement Learning with Maohao Shen - #726 51:45

11天之前51:45

51:45

Today, we're joined by Maohao Shen, PhD student at MIT to discuss his paper, “Satori: Reinforcement Learning with Chain-of-Action-Thought Enhances LLM Reasoning via Autoregressive Search.” We dig into how Satori leverages reinforcement learning to improve language model reasoning—enabling model self-reflection, self-correction, and exploration of alternative solutions. We explore the Chain-of-Action-Thought (COAT) approach, which uses special tokens—continue, reflect, and explore—to guide the model through distinct reasoning actions, allowing it to navigate complex reasoning tasks without external supervision. We also break down Satori’s two-stage training process: format tuning, which teaches the model to understand and utilize the special action tokens, and reinforcement learning, which optimizes reasoning through trial-and-error self-improvement. We cover key techniques such “restart and explore,” which allows the model to self-correct and generalize beyond its training domain. Finally, Maohao reviews Satori’s performance and how it compares to other models, the reward design, the benchmarks used, and the surprising observations made during the research. The complete show notes for this episode can be found at https://twimlai.com/go/726 .…

The TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence)

1
Waymo's Foundation Model for Autonomous Driving with Drago Anguelov - #725 1:09:07

18天之前1:09:07

1:09:07

Today, we're joined by Drago Anguelov, head of AI foundations at Waymo, for a deep dive into the role of foundation models in autonomous driving. Drago shares how Waymo is leveraging large-scale machine learning, including vision-language models and generative AI techniques to improve perception, planning, and simulation for its self-driving vehicles. The conversation explores the evolution of Waymo’s research stack, their custom “Waymo Foundation Model,” and how they’re incorporating multimodal sensor data like lidar, radar, and camera into advanced AI systems. Drago also discusses how Waymo ensures safety at scale with rigorous validation frameworks, predictive world models, and realistic simulation environments. Finally, we touch on the challenges of generalization across cities, freeway driving, end-to-end learning vs. modular architectures, and the future of AV testing through ML-powered simulation. The complete show notes for this episode can be found at https://twimlai.com/go/725 .…

The TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence)

1
Dynamic Token Merging for Efficient Byte-level Language Models with Julie Kallini - #724 50:32

25天之前50:32

50:32

Today, we're joined by Julie Kallini, PhD student at Stanford University to discuss her recent papers, “MrT5: Dynamic Token Merging for Efficient Byte-level Language Models” and “Mission: Impossible Language Models.” For the MrT5 paper, we explore the importance and failings of tokenization in large language models—including inefficient compression rates for under-resourced languages—and dig into byte-level modeling as an alternative. We discuss the architecture of MrT5, its ability to learn language-specific compression rates, its performance on multilingual benchmarks and character-level manipulation tasks, and its performance and efficiency. For the “Mission: Impossible Language Models” paper, we review the core idea behind the research, the definition and creation of impossible languages, the creation of impossible language training datasets, and explore the bias of language model architectures towards natural language. The complete show notes for this episode can be found at https://twimlai.com/go/724 .…

The TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence)

1
Scaling Up Test-Time Compute with Latent Reasoning with Jonas Geiping - #723 58:38

5 weeks之前58:38

58:38

Today, we're joined by Jonas Geiping, research group leader at Ellis Institute and the Max Planck Institute for Intelligent Systems to discuss his recent paper, “Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach.” This paper proposes a novel language model architecture which uses recurrent depth to enable “thinking in latent space.” We dig into “internal reasoning” versus “verbalized reasoning”—analogous to non-verbalized and verbalized thinking in humans, and discuss how the model searches in latent space to predict the next token and dynamically allocates more compute based on token difficulty. We also explore how the recurrent depth architecture simplifies LLMs, the parallels to diffusion models, the model's performance on reasoning tasks, the challenges of comparing models with varying compute budgets, and architectural advantages such as zero-shot adaptive exits and natural speculative decoding. The complete show notes for this episode can be found at https://twimlai.com/go/723 .…

The TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence)

1
Imagine while Reasoning in Space: Multimodal Visualization-of-Thought with Chengzu Li - #722 42:11

6 weeks之前42:11

42:11

Today, we're joined by Chengzu Li, PhD student at the University of Cambridge to discuss his recent paper, “Imagine while Reasoning in Space: Multimodal Visualization-of-Thought.” We explore the motivations behind MVoT, its connection to prior work like TopViewRS, and its relation to cognitive science principles such as dual coding theory. We dig into the MVoT framework along with its various task environments—maze, mini-behavior, and frozen lake. We explore token discrepancy loss, a technique designed to align language and visual embeddings, ensuring accurate and meaningful visual representations. Additionally, we cover the data collection and training process, reasoning over relative spatial relations between different entities, and dynamic spatial reasoning. Lastly, Chengzu shares insights from experiments with MVoT, focusing on the lessons learned and the potential for applying these models in real-world scenarios like robotics and architectural design. The complete show notes for this episode can be found at https://twimlai.com/go/722 .…

The TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence)

1
Inside s1: An o1-Style Reasoning Model That Cost Under $50 to Train with Niklas Muennighoff - #721 49:29

7 weeks之前49:29

49:29

Today, we're joined by Niklas Muennighoff, a PhD student at Stanford University, to discuss his paper, “S1: Simple Test-Time Scaling.” We explore the motivations behind S1, as well as how it compares to OpenAI's O1 and DeepSeek's R1 models. We dig into the different approaches to test-time scaling, including parallel and sequential scaling, as well as S1’s data curation process, its training recipe, and its use of model distillation from Google Gemini and DeepSeek R1. We explore the novel "budget forcing" technique developed in the paper, allowing it to think longer for harder problems and optimize test-time compute for better performance. Additionally, we cover the evaluation benchmarks used, the comparison between supervised fine-tuning and reinforcement learning, and similar projects like the Hugging Face Open R1 project. Finally, we discuss the open-sourcing of S1 and its future directions. The complete show notes for this episode can be found at https://twimlai.com/go/721 .…

The TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence)

1
Accelerating AI Training and Inference with AWS Trainium2 with Ron Diamant - #720 1:07:05

8 weeks之前1:07:05

1:07:05

Today, we're joined by Ron Diamant, chief architect for Trainium at Amazon Web Services, to discuss hardware acceleration for generative AI and the design and role of the recently released Trainium2 chip. We explore the architectural differences between Trainium and GPUs, highlighting its systolic array-based compute design, and how it balances performance across key dimensions like compute, memory bandwidth, memory capacity, and network bandwidth. We also discuss the Trainium tooling ecosystem including the Neuron SDK, Neuron Compiler, and Neuron Kernel Interface (NKI). We also dig into the various ways Trainum2 is offered, including Trn2 instances, UltraServers, and UltraClusters, and access through managed services like AWS Bedrock. Finally, we cover sparsity optimizations, customer adoption, performance benchmarks, support for Mixture of Experts (MoE) models, and what’s next for Trainium. The complete show notes for this episode can be found at https://twimlai.com/go/720 .…

The TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence)

1
π0: A Foundation Model for Robotics with Sergey Levine - #719 52:30

9 weeks之前52:30

52:30

Today, we're joined by Sergey Levine, associate professor at UC Berkeley and co-founder of Physical Intelligence, to discuss π0 (pi-zero), a general-purpose robotic foundation model. We dig into the model architecture, which pairs a vision language model (VLM) with a diffusion-based action expert, and the model training "recipe," emphasizing the roles of pre-training and post-training with a diverse mixture of real-world data to ensure robust and intelligent robot learning. We review the data collection approach, which uses human operators and teleoperation rigs, the potential of synthetic data and reinforcement learning in enhancing robotic capabilities, and much more. We also introduce the team’s new FAST tokenizer, which opens the door to a fully Transformer-based model and significant improvements in learning and generalization. Finally, we cover the open-sourcing of π0 and future directions for their research. The complete show notes for this episode can be found at https://twimlai.com/go/719 .…

The TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence)

1
AI Trends 2025: AI Agents and Multi-Agent Systems with Victor Dibia - #718 1:44:59

10 weeks之前1:44:59

1:44:59

Today we’re joined by Victor Dibia, principal research software engineer at Microsoft Research, to explore the key trends and advancements in AI agents and multi-agent systems shaping 2025 and beyond. In this episode, we discuss the unique abilities that set AI agents apart from traditional software systems–reasoning, acting, communicating, and adapting. We also examine the rise of agentic foundation models, the emergence of interface agents like Claude with Computer Use and OpenAI Operator, the shift from simple task chains to complex workflows, and the growing range of enterprise use cases. Victor shares insights into emerging design patterns for autonomous multi-agent systems, including graph and message-driven architectures, the advantages of the “actor model” pattern as implemented in Microsoft’s AutoGen, and guidance on how users should approach the ”build vs. buy” decision when working with AI agent frameworks. We also address the challenges of evaluating end-to-end agent performance, the complexities of benchmarking agentic systems, and the implications of our reliance on LLMs as judges. Finally, we look ahead to the future of AI agents in 2025 and beyond, discuss emerging HCI challenges, their potential for impact on the workforce, and how they are poised to reshape fields like software engineering. The complete show notes for this episode can be found at https://twimlai.com/go/718 .…

The TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence)

1
Speculative Decoding and Efficient LLM Inference with Chris Lott - #717 1:16:30

11 weeks之前1:16:30

1:16:30

The TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence)

1
Ensuring Privacy for Any LLM with Patricia Thaine - #716 51:33

11 weeks之前51:33

51:33

Today, we're joined by Patricia Thaine, co-founder and CEO of Private AI to discuss techniques for ensuring privacy, data minimization, and compliance when using 3rd-party large language models (LLMs) and other AI services. We explore the risks of data leakage from LLMs and embeddings, the complexities of identifying and redacting personal information across various data flows, and the approach Private AI has taken to mitigate these risks. We also dig into the challenges of entity recognition in multimodal systems including OCR files, documents, images, and audio, and the importance of data quality and model accuracy. Additionally, Patricia shares insights on the limitations of data anonymization, the benefits of balancing real-world and synthetic data in model training and development, and the relationship between privacy and bias in AI. Finally, we touch on the evolving landscape of AI regulations like GDPR, CPRA, and the EU AI Act, and the future of privacy in artificial intelligence. The complete show notes for this episode can be found at https://twimlai.com/go/716 .…

The TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence)

1
AI Engineering Pitfalls with Chip Huyen - #715 57:37

12 weeks之前57:37

57:37

Today, we're joined by Chip Huyen, independent researcher and writer to discuss her new book, “AI Engineering.” We dig into the definition of AI engineering, its key differences from traditional machine learning engineering, the common pitfalls encountered in engineering AI systems, and strategies to overcome them. We also explore how Chip defines AI agents, their current limitations and capabilities, and the critical role of effective planning and tool utilization in these systems. Additionally, Chip shares insights on the importance of evaluation in AI systems, highlighting the need for systematic processes, human oversight, and rigorous metrics and benchmarks. Finally, we touch on the impact of open-source models, the potential of synthetic data, and Chip’s predictions for the year ahead. The complete show notes for this episode can be found at https://twimlai.com/go/715 .…

The TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence)

1
Evolving MLOps Platforms for Generative AI and Agents with Abhijit Bose - #714 58:08

14 weeks之前58:08

58:08

Today, we're joined by Abhijit Bose, head of enterprise AI and ML platforms at Capital One to discuss the evolution of the company’s approach and insights on Generative AI and platform best practices. In this episode, we dig into the company’s platform-centric approach to AI, and how they’ve been evolving their existing MLOps and data platforms to support the new challenges and opportunities presented by generative AI workloads and AI agents. We explore their use of cloud-based infrastructure—in this case on AWS—to provide a foundation upon which they then layer open-source and proprietary services and tools. We cover their use of Llama 3 and open-weight models, their approach to fine-tuning, their observability tooling for Gen AI applications, their use of inference optimization techniques like quantization, and more. Finally, Abhijit shares the future of agentic workflows in the enterprise, the application of OpenAI o1-style reasoning in models, and the new roles and skillsets required in the evolving GenAI landscape. The complete show notes for this episode can be found at https://twimlai.com/go/714 .…

The TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence)

1
Why Agents Are Stupid & What We Can Do About It with Dan Jeffries - #713 1:08:49

18 weeks之前1:08:49