Deep Papers podcast

#Science #Tech #Math #Business #Arize AI

33:29

Bespoken Spirits isn’t your typical whiskey distillery. Yes, they’re based in the American bourbon heartland of Lexington, Kentucky, and yes, they often make private label whiskeys for clients. But everything from how Bespoken Spirits distills their whiskey to how they market it is done with the help of AI. Jordan Spitzer, their head of flavor, can finish a whiskey in days instead of years—while precisely crafting its taste—using their machine-learning backed approach. And Wane Lindsey, their director of marketing, credits AI tools with helping his tiny team punch way above their weight. The result is a whiskey that may not be traditional, but still tastes great—and in a fraction of the time it would otherwise take. That’s time they can spend on the creative side of their craft and the work that has the most meaning: building brands and bespoke spirits that people will want to drink. On this episode, Jordan and Wane share how AI has helped them explore creative new ways to make and market whiskey—and why, no matter how smart our tools get, there’s still no substitute for human taste. You can learn more about Bespoken Spirits at bespokenspirits.com ~ ~ ~ Working Smarter is brought to you by Dropbox Dash—the AI universal search and knowledge management tool from Dropbox. Learn more at workingsmarter.ai/dash You can listen to more episodes of Working Smarter on Apple Podcasts , Spotify , YouTube Music , Amazon Music , or wherever you get your podcasts. To read more stories and past interviews, visit workingsmarter.ai This show would not be possible without the talented team at Cosmic Standard : producer Dominic Girard , sound engineer Aja Simpson, technical director Jacob Winik, and executive producer Eliza Smith. Special thanks to our illustrators Justin Tran and Fanny Luor , marketing consultant Meggan Ellingboe , and editorial support from Catie Keck. Our theme song was composed by Doug Stuart . Working Smarter is hosted by Matthew Braga. Thanks for listening!…

Deep Papers

系列首页•Feed

Deep Papers is a podcast series featuring deep dives on today’s most important AI papers and research. Hosted by Arize AI founders and engineers, each episode profiles the people and techniques behind cutting-edge breakthroughs in machine learning.

57集单集

Deep Papers

33 subscribers

updated 9天之前

系列首页•Feed

57集单集

#Science #Tech #Math #Business #Arize AI

All episodes

Deep Papers

1
Georgia Tech's Santosh Vempala Explains Why Language Models Hallucinate, His Research With OpenAI 31:24

9天之前31:24

31:24

Santosh Vempala, Frederick Storey II Chair of Computing and Distinguished Professor in the School of Computer Science at Georgia Tech, explains his paper co-authored by OpenAI's Adam Tauman Kalai, Ofir Nachum, and Edwin Zhang. Read the paper: Sign up for future AI research paper readings and author office hours. See LLM hallucination examples here for context. Learn more about AI observability and evaluation , join the Arize AI Slack community or get the latest on LinkedIn and X .…

Deep Papers

1
Atropos Health’s Arjun Mukerji, PhD, Explains RWESummary: A Framework and Test for Choosing LLMs to Summarize Real-World Evidence (RWE) Studies 26:22

4 weeks之前26:22

26:22

Large language models are increasingly used to turn complex study output into plain-English summaries. But how do we know which models are safest and most reliable for healthcare? In this most recent community AI research paper reading, Arjun Mukerji, PhD – Staff Data Scientist at Atropos Health – walks us through RWESummary, a new benchmark designed to evaluate LLMs on summarizing real-world evidence from structured study output — an important but often under-tested scenario compared to the typical “summarize this PDF” task. Learn more about AI observability and evaluation , join the Arize AI Slack community or get the latest on LinkedIn and X .…

Deep Papers

1
Stan Miasnikov, Distinguished Engineer, AI/ML Architecture, Consumer Experience at Verizon Walks Us Through His New Paper 48:11

7 weeks之前48:11

48:11

This episode dives into " Category-Theoretic Analysis of Inter-Agent Communication and Mutual Understanding Metric in Recursive Consciousness ." The paper presents an extension of the Recursive Consciousness framework to analyze communication between agents and the inevitable loss of meaning in translation. We're thrilled to feature the paper's author, Stan Miasnikov , Distinguished Engineer, AI/ML Architecture, Consumer Experience at Verizon, to walk us through the research and its implications. Learn more about AI observability and evaluation , join the Arize AI Slack community or get the latest on LinkedIn and X .…

Deep Papers

1
Small Language Models are the Future of Agentic AI 31:15

7 weeks之前31:15

31:15

We had the privilege of hosting Peter Belcak – an AI Researcher working on the reliability and efficiency of agentic systems at NVIDIA – who walked us through his new paper making the rounds in AI circles titled “ Small Language Models are the Future of Agentic AI .” The paper posits that small language models (SLMs) are sufficiently powerful, inherently more suitable, and necessarily more economical for many invocations in agentic systems, and are therefore the future of agentic AI. The authors’ argumentation is grounded in the current level of capabilities exhibited by SLMs, the common architectures of agentic systems, and the economy of LM deployment. The authors further argue that in situations where general-purpose conversational abilities are essential, heterogeneous agentic systems (i.e., agents invoking multiple different models) are the natural choice. They discuss the potential barriers for the adoption of SLMs in agentic systems and outline a general LLM-to-SLM agent conversion algorithm. Learn more about AI observability and evaluation , join the Arize AI Slack community or get the latest on LinkedIn and X .…

Deep Papers

1
Watermarking for LLMs and Image Models 42:56

12 weeks之前42:56

42:56

In this AI research paper reading, we dive into "A Watermark for Large Language Models" with the paper's author John Kirchenbauer. This paper is a timely exploration of techniques for embedding invisible but detectable signals in AI-generated text. These watermarking strategies aim to help mitigate misuse of large language models by making machine-generated content distinguishable from human writing, without sacrificing text quality or requiring access to the model’s internals. Learn more about the A Watermark for Large Language Models paper . Learn more about agent observability and LLM observability , join the Arize AI Slack community or get the latest on LinkedIn and X . Learn more about AI observability and evaluation , join the Arize AI Slack community or get the latest on LinkedIn and X .…

Deep Papers

1
Self-Adapting Language Models: Paper Authors Discuss Implications 31:26

15 weeks之前31:26

31:26

The authors of the new paper *Self-Adapting Language Models (SEAL)* shared a behind-the-scenes look at their work, motivations, results, and future directions. The paper introduces a novel method for enabling large language models (LLMs) to adapt their own weights using self-generated data and training directives — “self-edits.” Learn more about the Self-Adapting Language Models paper . Learn more about AI observability and evaluation , join the Arize AI Slack community or get the latest on LinkedIn and X .…

Deep Papers

1
The Illusion of Thinking: What the Apple AI Paper Says About LLM Reasoning 30:35

18 weeks之前30:35

30:35

This week we discuss The Illusion of Thinking, a new paper from researchers at Apple that challenges today’s evaluation methods and introduces a new benchmark: synthetic puzzles with controllable complexity and clean logic. Their findings? Large Reasoning Models (LRMs) show surprising failure modes, including a complete collapse on high-complexity tasks and a decline in reasoning effort as problems get harder. Dylan and Parth dive into the paper's findings as well as the debate around it, including a response paper aptly titled "The Illusion of the Illusion of Thinking." Read the paper: The Illusion of Thinking Read the response: The Illusion of the Illusion of Thinking Explore more AI research and sign up for future readings Learn more about AI observability and evaluation , join the Arize AI Slack community or get the latest on LinkedIn and X .…

Deep Papers

1
Accurate KV Cache Quantization with Outlier Tokens Tracing 25:11

20 weeks之前25:11

25:11

We discuss Accurate KV Cache Quantization with Outlier Tokens Tracing, a deep dive into improving the efficiency of LLM inference. The authors enhance KV Cache quantization, a technique for reducing memory and compute costs during inference, by introducing a method to identify and exclude outlier tokens that hurt quantization accuracy, striking a better balance between efficiency and performance. Read the paper Access the slides Read the blog Join us for Arize Observe Learn more about AI observability and evaluation , join the Arize AI Slack community or get the latest on LinkedIn and X .…

Deep Papers

1
Scalable Chain of Thoughts via Elastic Reasoning 28:54

23 weeks之前28:54

28:54

In this week's episode, we talk about Elastic Reasoning, a novel framework designed to enhance the efficiency and scalability of large reasoning models by explicitly separating the reasoning process into two distinct phases: thinking and solution . This separation allows for independent allocation of computational budgets, addressing challenges related to uncontrolled output lengths in real-world deployments with strict resource constraints. Our discussion explores how Elastic Reasoning contributes to more concise and efficient reasoning, even in unconstrained settings, and its implications for deploying LRMs in resource-limited environments. Read the paper Join us live Read the blog Learn more about AI observability and evaluation , join the Arize AI Slack community or get the latest on LinkedIn and X .…

Deep Papers

1
Sleep-time Compute: Beyond Inference Scaling at Test-time 30:24

25 weeks之前30:24

30:24

What if your LLM could think ahead —preparing answers before questions are even asked? In this week's paper read , we dive into a groundbreaking new paper from researchers at Letta, introducing sleep-time compute: a novel technique that lets models do their heavy lifting offline , well before the user query arrives. By predicting likely questions and precomputing key reasoning steps, sleep-time compute dramatically reduces test-time latency and cost—without sacrificing performance. We explore new benchmarks—Stateful GSM-Symbolic, Stateful AIME, and the multi-query extension of GSM—that show up to 5x lower compute at inference, 2.5x lower cost per query, and up to 18% higher accuracy when scaled. You’ll also see how this method applies to realistic agent use cases and what makes it most effective.If you care about LLM efficiency, scalability, or cutting-edge research. Explore more AI research, or sign up to hear the next session live . Learn more about AI observability and evaluation , join the Arize AI Slack community or get the latest on LinkedIn and X .…

Deep Papers

1
LibreEval: The Largest Open Source Benchmark for RAG Hallucination Detection 27:19

27 weeks之前27:19

27:19

For this week's paper read, we dive into our own research. We wanted to create a replicable, evolving dataset that can keep pace with model training so that you always know you're testing with data your model has never seen before. We also saw the prohibitively high cost of running LLM evals at scale, and have used our data to fine-tune a series of SLMs that perform just as well as their base LLM counterparts, but at 1/10 the cost. So, over the past few weeks, the Arize team generated the largest public dataset of hallucinations, as well as a series of fine-tuned evaluation models. We talk about what we built, the process we took, and the bottom line results. You can read the recap of LibreEval here. Dive into the research , or sign up to join us next time. Learn more about AI observability and evaluation , join the Arize AI Slack community or get the latest on LinkedIn and X .…

Deep Papers

1
AI Benchmark Deep Dive: Gemini 2.5 and Humanity's Last Exam 26:11

29 weeks之前26:11

26:11

This week we talk about modern AI benchmarks, taking a close look at Google's recent Gemini 2.5 release and its performance on key evaluations, notably Humanity's Last Exam (HLE). In the session we covered Gemini 2.5's architecture, its advancements in reasoning and multimodality, and its impressive context window. We also talked about how benchmarks like HLE and ARC AGI 2 help us understand the current state and future direction of AI. Join us for the next live recording , or check out the latest AI research . Learn more about AI observability and evaluation , join the Arize AI Slack community or get the latest on LinkedIn and X .…

Deep Papers

1
Model Context Protocol (MCP) 15:03

30 weeks之前15:03

15:03

We cover Anthropic’s groundbreaking Model Context Protocol (MCP) . Though it was released in November 2024, we've been seeing a lot of hype around it lately, and thought it was well worth digging into. Learn how this open standard is revolutionizing AI by enabling seamless integration between LLMs and external data sources, fundamentally transforming them into capable, context-aware agents. We explore the key benefits of MCP, including enhanced context retention across interactions, improved interoperability for agentic workflows, and the development of more capable AI agents that can execute complex tasks in real-world environments. Read our analysis of MCP on the blog, or dive into the latest AI researc h . Learn more about AI observability and evaluation , join the Arize AI Slack community or get the latest on LinkedIn and X .…

Deep Papers

1
AI Roundup: DeepSeek’s Big Moves, Claude 3.7, and the Latest Breakthroughs 30:23

34 weeks之前30:23

30:23

This week, we're mixing things up a little bit. Instead of diving deep into a single research paper, we cover the biggest AI developments from the past few weeks. We break down key announcements, including: DeepSeek’s Big Launch Week: A look at FlashMLA (DeepSeek’s new approach to efficient inference) and DeepEP (their enhanced pretraining method). Claude 3.7 & Claude Code: What’s new with Anthropic’s latest model, and what Claude Code brings to the AI coding assistant space. Stay ahead of the curve with this fast-paced recap of the most important AI updates. We'll be back next time with our regularly scheduled programming. Dive into the latest AI research Learn more about AI observability and evaluation , join the Arize AI Slack community or get the latest on LinkedIn and X .…

Deep Papers

1
How DeepSeek is Pushing the Boundaries of AI Development 29:54

35 weeks之前29:54

29:54

This week, we dive into DeepSeek. SallyAnn DeLucia, Product Manager at Arize, and Nick Luzio, a Solutions Engineer, break down key insights on a model that have dominating headlines for its significant breakthrough in inference speed over other models. What’s next for AI (and open source)? From training strategies to real-world performance, here’s what you need to know. Read our analysis of DeepSeek , or dive into the latest AI research . Learn more about AI observability and evaluation , join the Arize AI Slack community or get the latest on LinkedIn and X .…

Deep Papers

1
Multiagent Finetuning: A Conversation with Researcher Yilun Du 30:03

37 weeks之前30:03

30:03

We talk to Google DeepMind Senior Research Scientist (and incoming Assistant Professor at Harvard), Yilun Du, about his latest paper, "Multiagent Finetuning: Self Improvement with Diverse Reasoning Chains." This paper introduces a multiagent finetuning framework that enhances the performance and diversity of language models by employing a society of agents with distinct roles, improving feedback mechanisms and overall output quality. The method enables autonomous self-improvement through iterative finetuning, achieving significant performance gains across various reasoning tasks. It's versatile, applicable to both open-source and proprietary LLMs, and can integrate with human-feedback-based methods like RLHF or DPO, paving the way for future advancements in language model development. Read an overview on the blog , watch the full discussion , or join us live for future paper readings . Learn more about AI observability and evaluation , join the Arize AI Slack community or get the latest on LinkedIn and X .…

Deep Papers

1
Training Large Language Models to Reason in Continuous Latent Space 24:58

40 weeks之前24:58

24:58

LLMs have typically been restricted to reason in the "language space," where chain-of-thought (CoT) is used to solve complex reasoning problems. But a new paper argues that language space may not always be the best for reasoning. In this paper read, we cover an exciting new technique from a team at Meta called Chain of Continuous Thought—also known as "Coconut." In the paper, "Training Large Language Models to Reason in a Continuous Latent Space" explores the potential of allowing LLMs to reason in an unrestricted latent space instead of being constrained by natural language tokens. Read a full breakdown of Coconut on our blog , or join us live for the next paper reading . Learn more about AI observability and evaluation , join the Arize AI Slack community or get the latest on LinkedIn and X .…

Deep Papers

1
LLMs as Judges: A Comprehensive Survey on LLM-Based Evaluation Methods 28:57

43 weeks之前28:57

28:57

We discuss a major survey of work and research on LLM-as-Judge from the last few years. "LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods" systematically examines the LLMs-as-Judge framework across five dimensions: functionality, methodology, applications, meta-evaluation, and limitations. This survey gives us a birds eye view of the advantages, limitations and methods for evaluating its effectiveness. Read a breakdown on our blog: https://arize.com/blog/llm-as-judge-survey-paper/ Learn more about AI observability and evaluation , join the Arize AI Slack community or get the latest on LinkedIn and X .…

Deep Papers

1
Merge, Ensemble, and Cooperate! A Survey on Collaborative LLM Strategies 28:47

45 weeks之前28:47

28:47

LLMs have revolutionized natural language processing, showcasing remarkable versatility and capabilities. But individual LLMs often exhibit distinct strengths and weaknesses, influenced by differences in their training corpora. This diversity poses a challenge: how can we maximize the efficiency and utility of LLMs? A new paper, "Merge, Ensemble, and Cooperate: A Survey on Collaborative Strategies in the Era of Large Language Models," highlights collaborative strategies to address this challenge. In this week's episode, we summarize key insights from this paper and discuss practical implications of LLM collaboration strategies across three main approaches: merging, ensemble, and cooperation. We also review some new open source models we're excited about. Learn more about AI observability and evaluation , join the Arize AI Slack community or get the latest on LinkedIn and X .…

Deep Papers

1
Agent-as-a-Judge: Evaluate Agents with Agents 24:54

48 weeks之前24:54

24:54

This week, we break down the “Agent-as-a-Judge” framework—a new agent evaluation paradigm that’s kind of like getting robots to grade each other’s homework. Where typical evaluation methods focus solely on outcomes or demand extensive manual work, this approach uses agent systems to evaluate agent systems, offering intermediate feedback throughout the task-solving process. With the power to unlock scalable self-improvement, Agent-as-a-Judge could redefine how we measure and enhance agent performance. Let's get into it! Learn more about AI observability and evaluation , join the Arize AI Slack community or get the latest on LinkedIn and X .…

Deep Papers

1
Introduction to OpenAI's Realtime API 29:56

49 weeks之前29:56

29:56

We break down OpenAI’s realtime API. Learn how to seamlessly integrate powerful language models into your applications for instant, context-aware responses that drive user engagement. Whether you’re building chatbots, dynamic content tools, or enhancing real-time collaboration, we walk through the API’s capabilities, potential use cases, and best practices for implementation. Learn more about AI observability and evaluation , join the Arize AI Slack community or get the latest on LinkedIn and X .…

Deep Papers

1
Swarm: OpenAI's Experimental Approach to Multi-Agent Systems 46:46

51 weeks之前46:46

46:46

As multi-agent systems grow in importance for fields ranging from customer support to autonomous decision-making, OpenAI has introduced Swarm, an experimental framework that simplifies the process of building and managing these systems. Swarm, a lightweight Python library, is designed for educational purposes, stripping away complex abstractions to reveal the foundational concepts of multi-agent architectures. In this podcast, we explore Swarm’s design, its practical applications, and how it stacks up against other frameworks. Whether you’re new to multi-agent systems or looking to deepen your understanding, Swarm offers a straightforward, hands-on way to get started. Read a Summary on the Blog Watch on YouTube Sign up for Upcoming Paper Readings Learn more about AI observability and evaluation , join the Arize AI Slack community or get the latest on LinkedIn and X .…

Deep Papers

1
KV Cache Explained 4:19

52 weeks之前4:19

4:19

In this episode, we dive into the intriguing mechanics behind why chat experiences with models like GPT often start slow but then rapidly pick up speed. The key? The KV cache. This essential but under-discussed component enables the seamless and snappy interactions we expect from modern AI systems. Harrison Chu breaks down how the KV cache works, how it relates to the transformer architecture, and why it's crucial for efficient AI responses. By the end of the episode, you'll have a clearer understanding of how top AI products leverage this technology to deliver fast, high-quality user experiences. Tune in for a simplified explanation of attention heads, KQV matrices, and the computational complexities they present. Learn more about AI observability and evaluation , join the Arize AI Slack community or get the latest on LinkedIn and X .…

Deep Papers

1
The Shrek Sampler: How Entropy-Based Sampling is Revolutionizing LLMs 3:31

1 year之前3:31

3:31

In this byte-sized podcast, Harrison Chu, Director of Engineering at Arize, breaks down the Shrek Sampler. This innovative Entropy-Based Sampling technique--nicknamed the 'Shrek Sampler--is transforming LLMs. Harrison talks about how this method improves upon traditional sampling strategies by leveraging entropy and varentropy to produce more dynamic and intelligent responses. Explore its potential to enhance open-source AI models and enable human-like reasoning in smaller language models. Learn more about AI observability and evaluation , join the Arize AI Slack community or get the latest on LinkedIn and X .…

Deep Papers

1
Google's NotebookLM and the Future of AI-Generated Audio 43:28

1 year之前43:28

43:28

This week, Aman Khan and Harrison Chu explore NotebookLM’s unique features, including its ability to generate realistic-sounding podcast episodes from text (but this podcast is very real!). They dive into some technical underpinnings of the product, specifically the SoundStorm model used for generating high-quality audio, and how it leverages a hierarchical vector quantization approach (RVQ) to maintain consistency in speaker voice and tone throughout long audio durations. The discussion also touches on ethical implications of such technology, particularly the potential for hallucinations and the need to balance creative freedom with factual accuracy. We close out with a few hot takes, and speculate on the future of AI-generated audio. Learn more about AI observability and evaluation , join the Arize AI Slack community or get the latest on LinkedIn and X .…

Deep Papers

1
Exploring OpenAI's o1-preview and o1-mini 42:02

1 year之前42:02

42:02

OpenAI recently released its o1-preview, which they claim outperforms GPT-4o on a number of benchmarks. These models are designed to think more before answering and handle complex tasks better than their other models, especially science and math questions. We take a closer look at their latest crop of o1 models, and we also highlight some research our team did to see how they stack up against Claude Sonnet 3.5--using a real world use case. Read it on our blog: https://arize.com/blog/exploring-openai-o1-preview-and-o1-mini Learn more about AI observability and evaluation , join the Arize AI Slack community or get the latest on LinkedIn and X .…

Deep Papers

1
Breaking Down Reflection Tuning: Enhancing LLM Performance with Self-Learning 26:54

1 year之前26:54

26:54

A recent announcement on X boasted a tuned model with pretty outstanding performance, and claimed these results were achieved through Reflection Tuning. However, people were unable to reproduce the results. We dive into some recent drama in the AI community as a jumping off point for a discussion about Reflection 70B. In 2023, there was a paper written about Reflection Tuning that this new model (Reflection 70B) draws concepts from. Reflection tuning is an optimization technique where models learn to improve their decision-making processes by “reflecting” on past actions or predictions. This method enables models to iteratively refine their performance by analyzing mistakes and successes, thus improving both accuracy and adaptability over time. By incorporating a feedback loop, reflection tuning can address model weaknesses more dynamically, helping AI systems become more robust in real-world applications where uncertainty or changing environments are prevalent. Dat Ngo (AI Solutions Architect at Arize), talks to Rohan Pandey (Founding Engineer at Reworkd) about Reflection 70B, Reflection Tuning, the recent drama, and the importance of double checking your research. Learn more about AI observability and evaluation , join the Arize AI Slack community or get the latest on LinkedIn and X .…

Deep Papers

1
Composable Interventions for Language Models 42:35

1 year之前42:35

42:35

This week, we're excited to be joined by Kyle O'Brien, Applied Scientist at Microsoft, to discuss his most recent paper, Composable Interventions for Language Models. Kyle and his team present a new framework, composable interventions, that allows for the study of multiple interventions applied sequentially to the same language model. The discussion will cover their key findings from extensive experiments, revealing how different interventions—such as knowledge editing, model compression, and machine unlearning—interact with each other. Read it on the blog: https://arize.com/blog/composable-interventions-for-language-models/ Learn more about AI observability and evaluation , join the Arize AI Slack community or get the latest on LinkedIn and X .…

Deep Papers

1
Judging the Judges: Evaluating Alignment and Vulnerabilities in LLMs-as-Judges 39:05

1 year之前39:05

39:05

This week’s paper presents a comprehensive study of the performance of various LLMs acting as judges. The researchers leverage TriviaQA as a benchmark for assessing objective knowledge reasoning of LLMs and evaluate them alongside human annotations which they find to have a high inter-annotator agreement. The study includes nine judge models and nine exam-taker models – both base and instruction-tuned. They assess the judge models’ alignment across different model sizes, families, and judge prompts to answer questions about the strengths and weaknesses of this paradigm, and what potential biases it may hold. Read it on the blog: https://arize.com/blog/judging-the-judges-llm-as-a-judge/ Learn more about AI observability and evaluation , join the Arize AI Slack community or get the latest on LinkedIn and X .…

Deep Papers

1
Breaking Down Meta's Llama 3 Herd of Models 44:40

1 year之前44:40