Afraid of #GPT4 going rogue and killing y'all? Worry not. Planning has got your back. You can ask it to solve any simple few step classical planning problem and snuff that "AGI spark" well and good. Let me explain.. 🧵 1/ | tweet by Subbarao Kambhampati (కంభంపాటి సుబ్బారావు)

Thread

Afraid of #GPT4 going rogue and killing y'all? Worry not. Planning has got your back. You can ask it to solve any simple few step classical planning problem and snuff that "AGI spark" well and good.

Let me explain.. 🧵 1/

Almost a year back, intrigued by the breathless "LLMs are Zero Shot reasoners" papers, we tested their ability to autonomously come up with simple plans given domain models. The results were *pretty bleak.*👇 2/

Fast forward last month when #GPT4 got released, and sparks about AGI capabilities were flying all over the #AI twitter. So we went back and check how it is doing on those planning benchmarks. 3/

The performance on simple blocks world improved a bit--from ~5% to ~30%.

But is this because the reasoning improved or because our benchmarks on github became fodder for GPT4 training, and GPT4 is still merrily pattern matching? 4/

So @karthikv792 decided to test with an obfuscated BW domain where domain model words were mapped to other meaning-bearing words that hide connections to blocks.

The result? The mighty #GPT4 plan correctness fell to ~3% 😱. Meanwhile those GOFAI STRIPS planners get 100%! 5/

This is why I remain skeptical about all the "LLMs can do reasoning" claims, as the performance may be coming from pattern matching. and showing that involves every harder tests--especially if the tests keep going into next generation training.. 6/

(The limitation is all about LLMs doing planning in autonomous mode. This is not to say that the guesses hazarded by LLMs can't be useful for other complete planners or humans in the loop who have better semantic models of the task) 7/

Oh yes, we are in the process of updating the arXiv paper with the GPT4 results--along with those on the obfuscated domain. (Although this means that GPT{4+k} will have access to these new tests..🙄 ) 8/

As LLMs grow, paraphrasing Prof. Lambeau👇, there may eventually be just a handful of people who can tell the difference between them memorizing vs. reasoning.

The blue pill/red pill qn of this era may well be: Do you want to be in that handful..🤔 9/

youtu.be/HMMEEQy6pDY?t=170

Mentions

See All

Michael Wooldridge @wooldridgemike · Apr 6, 2023

Post
From Twitter

Great thread by @rao2z on limitations of LLMs. Spoiler: they can’t do planning. Not even a little bit, not at all.

Thread by Subbarao Kambhampati (కంభంపాటి సుబ్బారావు)

Thread

Mentions