Artwork

内容由Machine Learning Street Talk (MLST)提供。所有播客内容(包括剧集、图形和播客描述)均由 Machine Learning Street Talk (MLST) 或其播客平台合作伙伴直接上传和提供。如果您认为有人在未经您许可的情况下使用您的受版权保护的作品,您可以按照此处概述的流程进行操作https://zh.player.fm/legal
Player FM -播客应用
使用Player FM应用程序离线!

ARC Prize v2 Launch! (Francois Chollet and Mike Knoop)

54:15
 
分享
 

Manage episode 473109604 series 2803422
内容由Machine Learning Street Talk (MLST)提供。所有播客内容(包括剧集、图形和播客描述)均由 Machine Learning Street Talk (MLST) 或其播客平台合作伙伴直接上传和提供。如果您认为有人在未经您许可的情况下使用您的受版权保护的作品,您可以按照此处概述的流程进行操作https://zh.player.fm/legal

We are joined by Francois Chollet and Mike Knoop, to launch the new version of the ARC prize! In version 2, the challenges have been calibrated with humans such that at least 2 humans could solve each task in a reasonable task, but also adversarially selected so that frontier reasoning models can't solve them. The best LLMs today get negligible performance on this challenge.

https://arcprize.org/

SPONSOR MESSAGES:

***

Tufa AI Labs is a brand new research lab in Zurich started by Benjamin Crouzier focussed on o-series style reasoning and AGI. They are hiring a Chief Engineer and ML engineers. Events in Zurich.

Goto https://tufalabs.ai/

***

TRANSCRIPT:

https://www.dropbox.com/scl/fi/0v9o8xcpppdwnkntj59oi/ARCv2.pdf?rlkey=luqb6f141976vra6zdtptv5uj&dl=0

TOC:

1. ARC v2 Core Design & Objectives

[00:00:00] 1.1 ARC v2 Launch and Benchmark Architecture

[00:03:16] 1.2 Test-Time Optimization and AGI Assessment

[00:06:24] 1.3 Human-AI Capability Analysis

[00:13:02] 1.4 OpenAI o3 Initial Performance Results

2. ARC Technical Evolution

[00:17:20] 2.1 ARC-v1 to ARC-v2 Design Improvements

[00:21:12] 2.2 Human Validation Methodology

[00:26:05] 2.3 Task Design and Gaming Prevention

[00:29:11] 2.4 Intelligence Measurement Framework

3. O3 Performance & Future Challenges

[00:38:50] 3.1 O3 Comprehensive Performance Analysis

[00:43:40] 3.2 System Limitations and Failure Modes

[00:49:30] 3.3 Program Synthesis Applications

[00:53:00] 3.4 Future Development Roadmap

REFS:

[00:00:15] On the Measure of Intelligence, François Chollet

https://arxiv.org/abs/1911.01547

[00:06:45] ARC Prize Foundation, François Chollet, Mike Knoop

https://arcprize.org/

[00:12:50] OpenAI o3 model performance on ARC v1, ARC Prize Team

https://arcprize.org/blog/oai-o3-pub-breakthrough

[00:18:30] Chain-of-Thought Prompting Elicits Reasoning in Large Language Models, Jason Wei et al.

https://arxiv.org/abs/2201.11903

[00:21:45] ARC-v2 benchmark tasks, Mike Knoop

https://arcprize.org/blog/introducing-arc-agi-public-leaderboard

[00:26:05] ARC Prize 2024: Technical Report, Francois Chollet et al.

https://arxiv.org/html/2412.04604v2

[00:32:45] ARC Prize 2024 Technical Report, Francois Chollet, Mike Knoop, Gregory Kamradt

https://arxiv.org/abs/2412.04604

[00:48:55] The Bitter Lesson, Rich Sutton

http://www.incompleteideas.net/IncIdeas/BitterLesson.html

[00:53:30] Decoding strategies in neural text generation, Sina Zarrieß

https://www.mdpi.com/2078-2489/12/9/355/pdf

  continue reading

216集单集

Artwork
icon分享
 
Manage episode 473109604 series 2803422
内容由Machine Learning Street Talk (MLST)提供。所有播客内容(包括剧集、图形和播客描述)均由 Machine Learning Street Talk (MLST) 或其播客平台合作伙伴直接上传和提供。如果您认为有人在未经您许可的情况下使用您的受版权保护的作品,您可以按照此处概述的流程进行操作https://zh.player.fm/legal

We are joined by Francois Chollet and Mike Knoop, to launch the new version of the ARC prize! In version 2, the challenges have been calibrated with humans such that at least 2 humans could solve each task in a reasonable task, but also adversarially selected so that frontier reasoning models can't solve them. The best LLMs today get negligible performance on this challenge.

https://arcprize.org/

SPONSOR MESSAGES:

***

Tufa AI Labs is a brand new research lab in Zurich started by Benjamin Crouzier focussed on o-series style reasoning and AGI. They are hiring a Chief Engineer and ML engineers. Events in Zurich.

Goto https://tufalabs.ai/

***

TRANSCRIPT:

https://www.dropbox.com/scl/fi/0v9o8xcpppdwnkntj59oi/ARCv2.pdf?rlkey=luqb6f141976vra6zdtptv5uj&dl=0

TOC:

1. ARC v2 Core Design & Objectives

[00:00:00] 1.1 ARC v2 Launch and Benchmark Architecture

[00:03:16] 1.2 Test-Time Optimization and AGI Assessment

[00:06:24] 1.3 Human-AI Capability Analysis

[00:13:02] 1.4 OpenAI o3 Initial Performance Results

2. ARC Technical Evolution

[00:17:20] 2.1 ARC-v1 to ARC-v2 Design Improvements

[00:21:12] 2.2 Human Validation Methodology

[00:26:05] 2.3 Task Design and Gaming Prevention

[00:29:11] 2.4 Intelligence Measurement Framework

3. O3 Performance & Future Challenges

[00:38:50] 3.1 O3 Comprehensive Performance Analysis

[00:43:40] 3.2 System Limitations and Failure Modes

[00:49:30] 3.3 Program Synthesis Applications

[00:53:00] 3.4 Future Development Roadmap

REFS:

[00:00:15] On the Measure of Intelligence, François Chollet

https://arxiv.org/abs/1911.01547

[00:06:45] ARC Prize Foundation, François Chollet, Mike Knoop

https://arcprize.org/

[00:12:50] OpenAI o3 model performance on ARC v1, ARC Prize Team

https://arcprize.org/blog/oai-o3-pub-breakthrough

[00:18:30] Chain-of-Thought Prompting Elicits Reasoning in Large Language Models, Jason Wei et al.

https://arxiv.org/abs/2201.11903

[00:21:45] ARC-v2 benchmark tasks, Mike Knoop

https://arcprize.org/blog/introducing-arc-agi-public-leaderboard

[00:26:05] ARC Prize 2024: Technical Report, Francois Chollet et al.

https://arxiv.org/html/2412.04604v2

[00:32:45] ARC Prize 2024 Technical Report, Francois Chollet, Mike Knoop, Gregory Kamradt

https://arxiv.org/abs/2412.04604

[00:48:55] The Bitter Lesson, Rich Sutton

http://www.incompleteideas.net/IncIdeas/BitterLesson.html

[00:53:30] Decoding strategies in neural text generation, Sina Zarrieß

https://www.mdpi.com/2078-2489/12/9/355/pdf

  continue reading

216集单集

所有剧集

×
 
Loading …

欢迎使用Player FM

Player FM正在网上搜索高质量的播客,以便您现在享受。它是最好的播客应用程序,适用于安卓、iPhone和网络。注册以跨设备同步订阅。

 

快速参考指南

边探索边听这个节目
播放