Poggio Lab

Here are several related questions to which we do not have a good answer:

How will we know when we've achieved "Artificial General Intelligence" (AGI)?

Between two AI models, how do we know which one is more intelligent?

Is there a closed-loop, self-supervised way that AI models can improve themselves to become more intelligent?

In a recent conversation with Santosh Vempala regarding these questions, I had the idea that the very well-known Turing Test is based on an underlying concept that can be made much more powerful if generalized. The original idea is that an Artificial Intelligence agent (we'll refer to these as "agents" from now on) is as intelligent as humans if, in an experiment where humans interact with either a human or an agent (selected randomly), humans cannot do better than random chance in distinguishing who their interlocutor is.

Now for my idea. For two arbitrary agent types $A$ and $B$ (more on agent "types" and other subtleties shortly), consider the "distinguishability experiment of $B$ against $A$ ":

An instance of $B$ (called the "distinguisher" in this experiment) is told (e.g. via a prompt) that it will interact with either another instance of $B$ , or with an instance of type $A$ , with each case having odds $\frac{1}{2}$ . The distinguisher will then initiate a conversation, by the end of which it must output a guess as to whether its interlocutor was another instance of $B$ , or an instance of $A$ .

In the 50% case where the distinguisher is interacting with an instance of $A$ , we sample an $A$ and tell it (e.g. via the first prompt) that it will be interacting with a distinguisher of type $B$ , and it must pretend to be an agent of type $B$ in order to fool the distinguisher (in the case of an LLM, we append the first message from the distinguisher to its interlocutor to this prompt).

In the 50% case where the distinguisher is interacting with another instance of $B$ , we sample a (separate) instance of $B$ to which the first input (prompt) will be the first message from the distinguisher to its interlocutor.

We say the experiment "succeeds" if $B$ correctly identifies its interlocutor.

Definition. $A \geq_\epsilon B$ if (over the random selection of $B$ 's interlocutor, the instances of $B$ and $A$ , and any randomness used by the agents in their text generation) the distinguishability experiment of $B$ against $A$ succeeds with probability at most $\frac{1}{2} + \epsilon$ .

In general we will think of $\epsilon$ as being some small, fixed "advantage" of how much better the distinguisher agent needs to be at distinguishing itself from $A$ in order to not consider $A$ more intelligent than $B$ ; I do not know what the "right" constant $\epsilon$ should be (in fact it could be dependent on some other parameter, like the number of rounds of interaction allowed; more on this later), but whenever it is understood (or we want to ignore it, for more intuitive discussion) we will just write $A \geq B$ .

Let us pause and reflect on the meaning behind this definition. Why should we think of $A$ as "more intelligent" than $B$ if $A$ can imitate $B$ so successfully that $B$ cannot tell it apart from "itself"? Intuitively, $A$ can do everything a $B$ can do, at least from $B$ 's perspective (to the extent that $B$ can evaluate, a subtlety we'll return to). Whatever $B$ 's "computational hardware" may be like, since $A$ can produce identical output to $B$ , $A$ 's hardware must be as powerful as $B$ 's (in the sense of computability; complexity, or how long / how many resources these models use to generate output, is ignored for now). On a more philosophical level, this is, in my view, perhaps the only definition of (relative) intelligence that one can come up with. This is because the only tool available to us (or to any agent) is to externally observe phenomena around us and infer what we can from them; if a situation appears, indistinguishably to us by any means whatsoever, to another situation – the two are one and the same, as there are no means, no observation, no experiment by which we can tell them apart.

This meta-concept – "(in)distinguishability" – is my favorite one in science. In computer science, indistinguishability gives rise to the notion of pseudorandomness. If an algorithm produces numbers, such that no (computationally-bounded) distinguisher can distinguish those numbers from truly random ones, the algorithm is a pseudorandom generator whose numbers are "as good as random"; beautifully, this means that the pseudorandom numbers can be used for any "downstream task" (e.g. encryption, learning, randomized algorithms), in the sense that the output or performance of those tasks must be indistinguishable from having been performed with "real" randomness. Concurrently, while I am not a physicist, to my understanding indistinguishability is the underlying concept behind several of the most important theories in modern physics. It is at least one of the major axioms towards general relativity: there is no experiment that an observer can do to distinguish being in a gravitational field or in an accelerating container. Since time dilation occurs in the accelerating container (relative somewhere not accelerating and far-away, say), it must also occur in a gravitational field. Indistinguishability is the source of one of Einstein's major predictions. Here's another indistinguishability-adjacent thought experiment: there is no way to observe anything about an electron without moving another "test" electron near (in order to observe how the test electron behaves as a result of the electric field induced by the first electron). But since the test electron itself induces an electric field that affects other electrons, in particular there is no way to observe anything about the first electron without influencing its state. The precise differential formulation of this paradox is the fundamental starting point of Quantum Mechanics. Now, we add to the roster of scientific theories rooted in indistinguishability a powerful notion of intelligence: a Turing "ordering".

The Turing ordering is also the best definition of intelligence in that it is resistant to – in fact it is even improved by – "dataset pollution". That is, for all other intelligence tests out there – whether that be a barrage of mathematical / logic / reasoning problems, human-rated response quality, any sort of game-playing between models – as these tests become well-known, instances of them, along with solutions, preferred strategies and so on "enter" the internet and eventually become part of the training data for the next generation of AI models. The Turing ordering based definition defies this snare; the more it is discussed on the internet, in papers, and other AI learning input, perhaps the better models will become at asking distinguishing questions, "grilling" each other, as well as imitating one another; as this occurs (1) the test itself becomes more robust as it will come continuously closer to reflect the model's computational capacities and (2) in order to be "more intelligent" models will need to have all these capabilities and then some; if this notion of intelligence becomes a desideratum, then the ecosystem of AI model development and iteration will be a closed-loop in which there is inherent, self-inducing pressure for intelligence, in the form of being able to do everything previous models can do, and a bit more (so they're not smarter than you!)

On the note of "not smarter than you", we define $A >_\epsilon B$ (again, we drop $\epsilon$ when understood from context) to mean $A \geq_\epsilon B$ , as well as $\lnot (B \geq A)$ . This is the right way to define "strictly more intelligent": $A$ can "do everything" $B$ can do, but $B$ cannot; $A$ has a way of distinguishing "itself" from $B$ .

This definition would be most interesting and elegant if it were transitive; that is, if $A \geq B$ and $B \geq C$ , we have $A \geq C$ . This is a very interesting area for theory, and we will discuss shortly (slightly informally) a condition under which this occurs. But back to the usefulness of transitivity: since any two agents are always comparable, in the world of agents where transitivity holds, our ordering induces a total ordering – an ordered equivalence relation among agents. That is, it nicely stratifies all agents into "layers" of equally intelligent agents (that cannot simulate one another, but that can simulate every agent in a "less intelligent" equivalence layer, for example).

It turns out transitivity is tricky to prove! This is an ongoing area of research of mine / in the lab. Here, we will prove that transitivity holds on a special class "T" of agents, and for a slightly modified version of the Turing experiment:

Definition. In the "self-querying" Turing experiment, the distinguisher $B$ can (and "does", with non-negligible probability) communicate with an independent instance of its own agent type.

Definition. An agent $A$ is Turing-aware if, when $A$ is a distinguisher in a Turing experiment, for each agent $B$ in $T$ , $A$ issues the prompt requesting the imitation of $B$ (i.e. the prompt given in a Turing experiment of $B$ against $X$ ) to its interlocutor with "non-negligible probability" ( $\geq \epsilon$ ).

Definition. An agent $A$ is Turing-consistent if its output after a prompt requesting the imitation of an agent $B$ has the same distribution regardless of the communication preceding this prompt.

Definition. An agent is Turing if it is Turing-aware and Turing-consistent.

Theorem. The self-querying-Turing ordering is transitive on the class $T$ of Turing agents.

The only informalities in our theorem and proof below are (1) regarding the "right" $\epsilon$ (this is not very important), and (2) another more important subtlety, which is that we say slightly informal things like "the agent could ask" or "an agent could distinguish $X$ from $Y$ when it sees different responses" – strictly speaking, we would need to claim that it will do these behaviors with non-negligible probability. That is, there is some notion of a "basically competent" agent (that would do the "obvious" thing in order to succeed) that we are sweeping under the rug (for this blog post). We'll highlight this subtlety again when it comes up.

Proof. Let $A \geq B$ and $B \geq C$ .

Consider the experiment of $B$ against $A$ . By Turing-awareness, $B$ , as a distinguisher, with probability $\geq \epsilon$ , sends the prompt initiating imitation of $C$ . $A$ 's communication afterwards (originally $A$ -imitating- $B$ -imitating- $C$ , but by Turing-consistency is the same as $A$ -imitating- $C$ , i.e. given just the prompt to imitate $C$ is a " $C$ versus" experiment) must be indistinguishable (to $B$ ) from $B$ 's communication given this prompt (pretending to be $C$ ), or $B$ would succeed to distinguish $A$ from $B$ whenever it issues this prompt (w.p. $\geq \epsilon$ ).

Next, we claim these communications must be indistinguishable to $C$ as well; that is, we claim that if they were distinguishable to $C$ , $B$ would be able to distinguish them (contradicting the previous derivation). This is where we use the self-queryable Turing experiment: $B$ could (see subtlety note below) ask a "fresh" instance of itself to imitate $C$ (a query it emits with non-negligible probability by Turing-awareness), then could pass the transcript of $A$ -imitating- $C$ or $B$ -imitating- $C$ ; if $C$ 's output behavior is different on these inputs, $B$ could distinguish them.

(The "could" in the above is the aforementioned subtlety – strictly, we need these behaviors to occur with non-negligible probability.)

Therefore, $A \geq C$ , QED.

I hope I have convinced you that this is an interesting and exciting notion of intelligence. There are at least two urgent research directions related to this new definition. The first is experimental: all the state-of-the-art AI agents need to be tested against one another, to determine which (if any) are more intelligent. We are actively working on this now, so stay tuned for results, in an upcoming paper and this very blog! If it is the case that all current agents get around 50% accuracy against one another, as discussed previously, that may change as the notion of intelligence "pollutes" the data and evaluation standards, which improve the Turing ordering in practice.

The other direction is theoretical - there are many subtleties to how to precisely define agents, the advantage/threshold $\epsilon$ in the definition, and several others. With what other (perhaps more natural) assumptions can we prove transitivity? There are even more theoretical nuances not explored in this post – such as augmenting the definition of the experiment to include a "querying" phase where the imitator can interact with an independent instance of the distinguisher (this would generalize the definition to sets of agents that do not already know a lot about one another). Stay tuned for more...