This is a great paper not only because of the clever control tasks, but also because it relates current evals of LLMs to decades old discussions on psychologism vs behaviorism