Artwork

内容由Joe Carlsmith提供。所有播客内容(包括剧集、图形和播客描述)均由 Joe Carlsmith 或其播客平台合作伙伴直接上传和提供。如果您认为有人在未经您许可的情况下使用您的受版权保护的作品,您可以按照此处概述的流程进行操作https://zh.player.fm/legal
Player FM -播客应用
使用Player FM应用程序离线!

Full audio for "Scheming AIs: Will AIs fake alignment during training in order to get power?"

6:13:17
 
分享
 

Manage episode 384029266 series 3402048
内容由Joe Carlsmith提供。所有播客内容(包括剧集、图形和播客描述)均由 Joe Carlsmith 或其播客平台合作伙伴直接上传和提供。如果您认为有人在未经您许可的情况下使用您的受版权保护的作品,您可以按照此处概述的流程进行操作https://zh.player.fm/legal

This is the full audio for my report "Scheming AIs: Will AIs fake alignment during training in order to get power?"
(I’m also posting audio for individual sections of the report on this podcast, but the ordering was getting messed up on various podcast apps, and I think some people might want one big audio file regardless, so here it is. I’m going to be posting the individual sections one by one, in the right order, over the coming days. )
Full text of the report here: https://arxiv.org/abs/2311.08379
Summary here: https://joecarlsmith.com/2023/11/15/new-report-scheming-ais-will-ais-fake-alignment-during-training-in-order-to-get-power

  continue reading

章节

1. Full audio for "Scheming AIs: Will AIs fake alignment during training in order to get power?" (00:00:00)

2. 0. Introduction (00:02:14)

3. 0.1 Preliminaries (00:13:02)

4. 0.2 Summary of the report (00:16:52)

5. 0.2.1 Summary of section 1 (00:17:21)

6. 0.2.2 Summary of section 2 (00:19:42)

7. 0.2.3 Summary of section 3 (00:36:13)

8. 0.2.4 Summary of section 4 (00:40:03)

9. 0.2.5 Summary of section 5 (00:51:42)

10. 0.2.6 Summary of section 6 (00:54:53)

11. 1. Scheming and its significance (00:56:12)

12. 1.1 Varieties of fake alignment (00:57:09)

13. 1.1.1 Alignment fakers (00:58:10)

14. 1.1.2 Training-gamers (01:00:10)

15. 1.1.3 Power-motivated instrumental training-gamers, or “schemers” (01:05:54)

16. 1.1.4 Goal-guarding schemers (01:07:29)

17. 1.2 Other models training might produce (01:13:22)

18. 1.2.1 Terminal training-gamers (or, “reward-on-the-episode seekers”) (01:14:01)

19. 1.2.2 Models that aren’t playing the training game (01:16:57)

20. 1.2.2.1 Training saints (01:17:35)

21. 1.2.2.2 Misgeneralized non-training-gamers (01:19:03)

22. 1.2.3 Contra “internal” vs. “corrigible” alignment (01:22:08)

23. 1.2.4 The overall taxonomy (01:23:01)

24. 1.3 Why focus on schemers in particular? (01:23:51)

25. 1.3.1 The type of misalignment I’m most worried about (01:24:29)

26. 1.3.2 Contrast with reward-on-the-episode seekers (01:27:42)

27. 1.3.2.1 Responsiveness to honest tests (01:28:01)

28. 1.3.2.2 Temporal scope and general “ambition” (01:31:09)

29. 1.3.2.3 Sandbagging and “early undermining” (01:34:31)

30. 1.3.3 Contrast with models that aren’t playing the training game (01:40:28)

31. 1.3.4 Non-schemers with schemer-like traits (01:46:27)

32. 1.3.5 Mixed models (01:48:34)

33. 1.4 Are theoretical arguments about this topic even useful? (01:51:50)

34. 1.5 On “slack” in training (01:54:17)

35. 2. What’s required for scheming? (02:00:48)

36. 2.1 Situational awareness (02:01:37)

37. 2.2 Beyond-episode goals (02:09:35)

38. 2.2.1 Two concepts of an “episode” (02:09:45)

39. 2.2.1.1 The incentivized episode (02:09:53)

40. 2.2.1.2 The intuitive episode (02:14:48)

41. 2.2.2 Two sources of beyond-episode goals (02:21:01)

42. 2.2.2.1 Training-game-independent beyond-episode goals (02:22:05)

43. 2.2.2.1.1 Are beyond-episode goals the default? (02:23:59)

44. 2.2.2.1.2 How will models think about time? (02:25:34)

45. 2.2.2.1.3 The role of “reflection” (02:28:42)

46. 2.2.2.1.4 Pushing back on beyond-episode goals using adversarial training (02:31:28)

47. 2.2.2.2 Training-game-dependent beyond-episode goals (02:33:17)

48. 2.2.2.2.1 Can gradient descent “notice” the benefits of turning a non-schemer into a schemer? (02:35:19)

49. 2.2.2.2.2 Is SGD pulling scheming out of models by any means necessary? (02:39:23)

50. 2.2.3 “Clean” vs. “messy” goal-directedness (02:41:48)

51. 2.2.3.1 Does scheming require a higher standard of goal-directedness? (02:50:50)

52. 2.2.4 What if you intentionally train models to have long-term goals? (02:57:57)

53. 2.2.4.1 Training the model on long episodes (02:58:42)

54. 2.2.4.2 Using short episodes to train a model to pursue long-term goals (03:01:52)

55. 2.2.4.3 How much useful, alignment-relevant cognitive work can be done using AIs with (03:06:06)

56. 2.3 Aiming at reward-on-the-episode as part of a power-motivated instrumental strategy (03:14:44)

57. 2.3.1 The classic goal-guarding story (03:15:42)

58. 2.3.1.1 The goal-guarding hypothesis (03:16:48)

59. 2.3.1.1.1 The crystallization hypothesis (03:17:30)

60. 2.3.1.1.2 Would the goals of a would-be schemer “float around”? (03:22:03)

61. 2.3.1.1.3 What about looser forms of goal-guarding? (03:25:10)

62. 2.3.1.1.4 Introspective goal-guarding methods (03:30:59)

63. 2.3.1.2 Adequate future empowerment (03:33:00)

64. 2.3.1.2.1 When is the “pay off” supposed to happen? (03:33:50)

65. 2.3.1.2.2 Even if the model’s values survive this generation of training, will they survive long (03:36:45)

66. 2.3.1.2.3 Will escape/take-over be suitably likely to succeed? (03:40:45)

67. 2.3.1.2.4 Will the time horizon of the model’s goals extend to cover escape/take-over? (03:42:34)

68. 2.3.1.2.5 Will the model’s values get enough power after escape/takeover? (03:44:24)

69. 2.3.1.2.6 How much does the model stand to gain from not training-gaming? (03:45:52)

70. 2.3.1.2.7 How “ambitious” is the model? (03:49:04)

71. 2.3.1.3 Overall assessment of the classic goal-guarding story (03:54:11)

72. 2.3.2 Non-classic stories (03:55:11)

73. 2.3.2.1 AI coordination (03:55:30)

74. 2.3.2.2 AIs with similar values by default (04:00:31)

75. 2.3.2.3 Terminal values that happen to favor escape/takeover (04:02:26)

76. 2.3.2.4 Models with false beliefs about whether scheming is a good strategy (04:06:33)

77. 2.3.2.5 Self-deception (04:08:08)

78. 2.3.2.6 Goal-uncertainty and haziness (04:10:21)

79. 2.3.2.7 Overall assessment of the non-classic stories (04:12:54)

80. 2.4 Take-aways re: the requirements of scheming (04:14:43)

81. 2.5 Path dependence (04:15:26)

82. 3. Arguments for/against scheming that focus on the path that SGD takes (04:18:57)

83. 3.1 The training-game-independent proxy-goals story (04:21:01)

84. 3.2 The “nearest max-reward goal” story (04:25:37)

85. 3.2.1 Barriers to schemer-like modifications from SGD’s incrementalism (04:30:44)

86. 3.2.2 Which model is “nearest”? (04:32:14)

87. 3.2.2.1 The common-ness of schemer-like goals in goal space (04:32:51)

88. 3.2.2.2 The nearness of non-schemer goals (04:36:06)

89. 3.2.2.3 The relevance of messy goal-directedness to nearness (04:41:15)

90. 3.2.3 Overall take on the “nearest max-reward goal” argument (04:42:53)

91. 3.3 The possible relevance of properties like simplicity and speed to the path SGD takes (04:43:45)

92. 3.4 Overall assessment of arguments that focus on the path SGD takes (04:45:56)

93. 4. Arguments for/against scheming that focus on the final properties of the (04:47:14)

94. 4.1 Contributors to reward vs. extra criteria (04:47:50)

95. 4.2 The counting argument (04:50:24)

96. 4.3 Simplicity arguments (04:57:11)

97. 4.3.1 What is “simplicity”? (04:57:22)

98. 4.3.2 Does SGD select for simplicity? (05:01:06)

99. 4.3.3 The simplicity advantages of schemer-like goals (05:02:35)

100. 4.3.4 How big are these simplicity advantages? (05:04:39)

101. 4.3.5 Does this sort of simplicity-focused argument make plausible predictions about the sort (05:13:07)

102. 4.3.6 Overall assessment of simplicity arguments (05:15:36)

103. 4.4 Speed arguments (05:16:10)

104. 4.4.1 How big are the absolute costs of this extra reasoning? (05:18:02)

105. 4.4.3 Can we actively shape training to bias towards speed over simplicity? (05:21:01)

106. 4.4.2 How big are the costs of this extra reasoning relative to the simplicity benefits of (05:22:45)

107. 4.5 The “not-your-passion” argument (05:26:07)

108. 4.6 The relevance of “slack” to these arguments (05:28:27)

109. 4.7 Takeaways re: arguments that focus on the final properties of the model (05:29:19)

110. 5. Summing up (05:30:49)

111. 6. Empirical work that might shed light on scheming (05:45:50)

112. 6.1 Empirical work on situational awareness (05:50:50)

113. 6.2 Empirical work on beyond-episode goals (05:52:20)

114. 6.3 Empirical work on the viability of scheming as an instrumental strategy (05:55:45)

115. 6.4 The “model organisms” paradigm (05:57:30)

116. 6.5 Traps and honest tests (05:58:46)

117. 6.6 Interpretability and transparency (06:02:06)

118. 6.7 Security, control, and oversight (06:03:51)

119. 6.8 Other possibilities (06:06:24)

57集单集

Artwork
icon分享
 
Manage episode 384029266 series 3402048
内容由Joe Carlsmith提供。所有播客内容(包括剧集、图形和播客描述)均由 Joe Carlsmith 或其播客平台合作伙伴直接上传和提供。如果您认为有人在未经您许可的情况下使用您的受版权保护的作品,您可以按照此处概述的流程进行操作https://zh.player.fm/legal

This is the full audio for my report "Scheming AIs: Will AIs fake alignment during training in order to get power?"
(I’m also posting audio for individual sections of the report on this podcast, but the ordering was getting messed up on various podcast apps, and I think some people might want one big audio file regardless, so here it is. I’m going to be posting the individual sections one by one, in the right order, over the coming days. )
Full text of the report here: https://arxiv.org/abs/2311.08379
Summary here: https://joecarlsmith.com/2023/11/15/new-report-scheming-ais-will-ais-fake-alignment-during-training-in-order-to-get-power

  continue reading

章节

1. Full audio for "Scheming AIs: Will AIs fake alignment during training in order to get power?" (00:00:00)

2. 0. Introduction (00:02:14)

3. 0.1 Preliminaries (00:13:02)

4. 0.2 Summary of the report (00:16:52)

5. 0.2.1 Summary of section 1 (00:17:21)

6. 0.2.2 Summary of section 2 (00:19:42)

7. 0.2.3 Summary of section 3 (00:36:13)

8. 0.2.4 Summary of section 4 (00:40:03)

9. 0.2.5 Summary of section 5 (00:51:42)

10. 0.2.6 Summary of section 6 (00:54:53)

11. 1. Scheming and its significance (00:56:12)

12. 1.1 Varieties of fake alignment (00:57:09)

13. 1.1.1 Alignment fakers (00:58:10)

14. 1.1.2 Training-gamers (01:00:10)

15. 1.1.3 Power-motivated instrumental training-gamers, or “schemers” (01:05:54)

16. 1.1.4 Goal-guarding schemers (01:07:29)

17. 1.2 Other models training might produce (01:13:22)

18. 1.2.1 Terminal training-gamers (or, “reward-on-the-episode seekers”) (01:14:01)

19. 1.2.2 Models that aren’t playing the training game (01:16:57)

20. 1.2.2.1 Training saints (01:17:35)

21. 1.2.2.2 Misgeneralized non-training-gamers (01:19:03)

22. 1.2.3 Contra “internal” vs. “corrigible” alignment (01:22:08)

23. 1.2.4 The overall taxonomy (01:23:01)

24. 1.3 Why focus on schemers in particular? (01:23:51)

25. 1.3.1 The type of misalignment I’m most worried about (01:24:29)

26. 1.3.2 Contrast with reward-on-the-episode seekers (01:27:42)

27. 1.3.2.1 Responsiveness to honest tests (01:28:01)

28. 1.3.2.2 Temporal scope and general “ambition” (01:31:09)

29. 1.3.2.3 Sandbagging and “early undermining” (01:34:31)

30. 1.3.3 Contrast with models that aren’t playing the training game (01:40:28)

31. 1.3.4 Non-schemers with schemer-like traits (01:46:27)

32. 1.3.5 Mixed models (01:48:34)

33. 1.4 Are theoretical arguments about this topic even useful? (01:51:50)

34. 1.5 On “slack” in training (01:54:17)

35. 2. What’s required for scheming? (02:00:48)

36. 2.1 Situational awareness (02:01:37)

37. 2.2 Beyond-episode goals (02:09:35)

38. 2.2.1 Two concepts of an “episode” (02:09:45)

39. 2.2.1.1 The incentivized episode (02:09:53)

40. 2.2.1.2 The intuitive episode (02:14:48)

41. 2.2.2 Two sources of beyond-episode goals (02:21:01)

42. 2.2.2.1 Training-game-independent beyond-episode goals (02:22:05)

43. 2.2.2.1.1 Are beyond-episode goals the default? (02:23:59)

44. 2.2.2.1.2 How will models think about time? (02:25:34)

45. 2.2.2.1.3 The role of “reflection” (02:28:42)

46. 2.2.2.1.4 Pushing back on beyond-episode goals using adversarial training (02:31:28)

47. 2.2.2.2 Training-game-dependent beyond-episode goals (02:33:17)

48. 2.2.2.2.1 Can gradient descent “notice” the benefits of turning a non-schemer into a schemer? (02:35:19)

49. 2.2.2.2.2 Is SGD pulling scheming out of models by any means necessary? (02:39:23)

50. 2.2.3 “Clean” vs. “messy” goal-directedness (02:41:48)

51. 2.2.3.1 Does scheming require a higher standard of goal-directedness? (02:50:50)

52. 2.2.4 What if you intentionally train models to have long-term goals? (02:57:57)

53. 2.2.4.1 Training the model on long episodes (02:58:42)

54. 2.2.4.2 Using short episodes to train a model to pursue long-term goals (03:01:52)

55. 2.2.4.3 How much useful, alignment-relevant cognitive work can be done using AIs with (03:06:06)

56. 2.3 Aiming at reward-on-the-episode as part of a power-motivated instrumental strategy (03:14:44)

57. 2.3.1 The classic goal-guarding story (03:15:42)

58. 2.3.1.1 The goal-guarding hypothesis (03:16:48)

59. 2.3.1.1.1 The crystallization hypothesis (03:17:30)

60. 2.3.1.1.2 Would the goals of a would-be schemer “float around”? (03:22:03)

61. 2.3.1.1.3 What about looser forms of goal-guarding? (03:25:10)

62. 2.3.1.1.4 Introspective goal-guarding methods (03:30:59)

63. 2.3.1.2 Adequate future empowerment (03:33:00)

64. 2.3.1.2.1 When is the “pay off” supposed to happen? (03:33:50)

65. 2.3.1.2.2 Even if the model’s values survive this generation of training, will they survive long (03:36:45)

66. 2.3.1.2.3 Will escape/take-over be suitably likely to succeed? (03:40:45)

67. 2.3.1.2.4 Will the time horizon of the model’s goals extend to cover escape/take-over? (03:42:34)

68. 2.3.1.2.5 Will the model’s values get enough power after escape/takeover? (03:44:24)

69. 2.3.1.2.6 How much does the model stand to gain from not training-gaming? (03:45:52)

70. 2.3.1.2.7 How “ambitious” is the model? (03:49:04)

71. 2.3.1.3 Overall assessment of the classic goal-guarding story (03:54:11)

72. 2.3.2 Non-classic stories (03:55:11)

73. 2.3.2.1 AI coordination (03:55:30)

74. 2.3.2.2 AIs with similar values by default (04:00:31)

75. 2.3.2.3 Terminal values that happen to favor escape/takeover (04:02:26)

76. 2.3.2.4 Models with false beliefs about whether scheming is a good strategy (04:06:33)

77. 2.3.2.5 Self-deception (04:08:08)

78. 2.3.2.6 Goal-uncertainty and haziness (04:10:21)

79. 2.3.2.7 Overall assessment of the non-classic stories (04:12:54)

80. 2.4 Take-aways re: the requirements of scheming (04:14:43)

81. 2.5 Path dependence (04:15:26)

82. 3. Arguments for/against scheming that focus on the path that SGD takes (04:18:57)

83. 3.1 The training-game-independent proxy-goals story (04:21:01)

84. 3.2 The “nearest max-reward goal” story (04:25:37)

85. 3.2.1 Barriers to schemer-like modifications from SGD’s incrementalism (04:30:44)

86. 3.2.2 Which model is “nearest”? (04:32:14)

87. 3.2.2.1 The common-ness of schemer-like goals in goal space (04:32:51)

88. 3.2.2.2 The nearness of non-schemer goals (04:36:06)

89. 3.2.2.3 The relevance of messy goal-directedness to nearness (04:41:15)

90. 3.2.3 Overall take on the “nearest max-reward goal” argument (04:42:53)

91. 3.3 The possible relevance of properties like simplicity and speed to the path SGD takes (04:43:45)

92. 3.4 Overall assessment of arguments that focus on the path SGD takes (04:45:56)

93. 4. Arguments for/against scheming that focus on the final properties of the (04:47:14)

94. 4.1 Contributors to reward vs. extra criteria (04:47:50)

95. 4.2 The counting argument (04:50:24)

96. 4.3 Simplicity arguments (04:57:11)

97. 4.3.1 What is “simplicity”? (04:57:22)

98. 4.3.2 Does SGD select for simplicity? (05:01:06)

99. 4.3.3 The simplicity advantages of schemer-like goals (05:02:35)

100. 4.3.4 How big are these simplicity advantages? (05:04:39)

101. 4.3.5 Does this sort of simplicity-focused argument make plausible predictions about the sort (05:13:07)

102. 4.3.6 Overall assessment of simplicity arguments (05:15:36)

103. 4.4 Speed arguments (05:16:10)

104. 4.4.1 How big are the absolute costs of this extra reasoning? (05:18:02)

105. 4.4.3 Can we actively shape training to bias towards speed over simplicity? (05:21:01)

106. 4.4.2 How big are the costs of this extra reasoning relative to the simplicity benefits of (05:22:45)

107. 4.5 The “not-your-passion” argument (05:26:07)

108. 4.6 The relevance of “slack” to these arguments (05:28:27)

109. 4.7 Takeaways re: arguments that focus on the final properties of the model (05:29:19)

110. 5. Summing up (05:30:49)

111. 6. Empirical work that might shed light on scheming (05:45:50)

112. 6.1 Empirical work on situational awareness (05:50:50)

113. 6.2 Empirical work on beyond-episode goals (05:52:20)

114. 6.3 Empirical work on the viability of scheming as an instrumental strategy (05:55:45)

115. 6.4 The “model organisms” paradigm (05:57:30)

116. 6.5 Traps and honest tests (05:58:46)

117. 6.6 Interpretability and transparency (06:02:06)

118. 6.7 Security, control, and oversight (06:03:51)

119. 6.8 Other possibilities (06:06:24)

57集单集

所有剧集

×
 
Loading …

欢迎使用Player FM

Player FM正在网上搜索高质量的播客,以便您现在享受。它是最好的播客应用程序,适用于安卓、iPhone和网络。注册以跨设备同步订阅。

 

快速参考指南

边探索边听这个节目
播放