upcarta
  • Sign In
  • Sign Up
  • Explore
  • Search

Neel Nanda (at ICLR)

neelnanda.io
1 Follower
community-curated profile

Mechanistic Interpretability research @DeepMind. Formerly @AnthropicAI, independent In this to reduce AI X-risk. Neural networks can be understood, let's do it!

Overview Posts Content Recommendations
Popular Recent
  • Paper
  • Article
Neel Nanda (at ICLR) @NeelNanda5 · May 12, 2023
  • From Twitter

I don't think I've mastered the skill of interpreting these visualisations, but they're SO PRETTY. Great work by Catherine Yeh!

Paper May 4, 2023
AttentionViz: A Global View of Transformer Attention
by Martin Wattenberg and 4 others
Post Add to Collection Mark as Completed
Recommended by 1 person
1 mention
Share on Twitter Repost
Neel Nanda (at ICLR) @NeelNanda5 · May 10, 2023
  • From Twitter

Great paper and elegant set up! This is another nice illustration of how it is so, so easy to trick yourself when interpreting LLMs. I would love an interpretability project distinguishing faithful from unfaithful chain of thought! Anyone know what the smallest open source model…

Paper May 7, 2023
Language Models Don't Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting
by Miles Turpin
Post Add to Collection Mark as Completed
Recommended by 1 person
1 mention
Share on Twitter Repost
Neel Nanda (at ICLR) @NeelNanda5 · May 4, 2023
  • From Twitter

Great new work from @ch402 on the differences between representing activations with directions for independently varying features vs a dense yet structureless code. I really appreciate the commitment to writing great exposition!

Article May 4, 2023
Distributed Representations: Composition & Superposition
by Chris Olah
Post Add to Collection Mark as Completed
Recommended by 1 person
1 mention
Share on Twitter Repost
Neel Nanda (at ICLR) @NeelNanda5 · Apr 20, 2023
  • From Twitter

Great article from @DavidSKrueger on the various dynamics that makes the community underrate AI X-Risks. My favourite line:

Article Apr 19, 2023
Why do some AI researchers dismiss the potential risks to humanity?
by David Krueger
Post Add to Collection Mark as Completed
Recommended by 1 person
1 mention
Share on Twitter Repost
Neel Nanda (at ICLR) @NeelNanda5 · Apr 15, 2023
  • From Twitter

I really enjoyed @yonashav's paper on how we might create a world where all large training runs can be monitored - feels like the best AI governance proposal I can recall seeing! Makes me optimistic there are policies simple enough to be realistic, but useful enough to matter

Paper Mar 20, 2023
What does it take to catch a Chinchilla? Verifying Rules on Large-Scale Neural Network Training via Compute Monitoring
by Yo Shavit
Post Add to Collection Mark as Completed
Recommended by 1 person
1 mention
Share on Twitter Repost
  • upcarta ©2025
  • Home
  • About
  • Terms
  • Privacy
  • Cookies
  • @upcarta