Right now, there is no AI that can outperform humans at performing meaningful research and analysis
What is research?
When I say “research”, I mean “analysing information and producing a document”.
Research can have different purposes. Here are three idealised categories:
- Providing a recommendation on a specific real-world decision. For example, I work as a fund manager on the EA Animal Welfare Fund. Here, the question is something like “If we give $X to this applicant to carry out a specific project, will that help animals more than keeping the money and funding a different project later?” The resulting document might be a simple one-sentence recommendation to the fund chair (“yes, fund this application”, “no, do not fund this application”, or “make some modifications to the project then fund it”). This research is often focused on answering cruxes (crucial but unknown pieces of information) that are identified in advance.
- Blue-skies research. This is the “classical” academic research of simply learning stuff about the world for its own sake. My PhD supervisor calls this, somewhat perjoratively, “studying the green tree frog’s left toenail”. The resulting document would likely be a peer-reviewed publication in a more traditional academic journal. Obviously, this research can have immense value, though the value is harder to predict in advance.
- Persuading somebody to do something. For example, there are many think tanks focused not on dispassionately answering a question, but on advocating for a specific position. There’s nothing wrong with this (every successful social movement or political party, and many for-profit companies, has somebody doing work like this), but it’s a different process from more truth-seeking research; it is more mercenary, as the answer is determined in advance and the actual research effort is spent crafting a convincing argument to persuade the reader. Here, the resulting document might be a white paper, a briefing report for a politician, or a media release.
Of course, research can also exist in between categories, and there are many examples of publications that share characteristics of two or three of the above categories.
All three of these categories can have value if done for the right purposes (to help others). I emphasise this because the third category (persuasive research) might seem unscientific, as there is a popular perception that scientific research is and should be objective. However, in practice, all three categories of research involve some degree of truth-seeking and some degree of persuasion; scientists are people acting in the world, just like politicians or advocates are. Even the most dispassionate scientist needs to persuade somebody to fund their work. Certainly, there are for-profit industries funding research in the third category (persuasive) and pretending it’s the second category (learning about stuff just for fun). I don’t think any experienced advocate working at the intersection between science and policy genuinely believes that science is some objective process.
Why do organisations hire researchers?
Notice that all three categoires of research that I described above share an important characteristic: they engage with reality.
- If I conduct research to advise a specific decision, then I’m essentially trying to predict, on the basis of our current limited information, what the best decision will actually be in reality.
- If I conduct research trying to determine the chemical composition of a green tree frog’s left toenail, then I’m essentially trying to predict, on the basis of empirical tools and methods (= current limited information, plus information we can readily access by observation), what the actual chemical composition is in reality. (This is why hypothesis testing and falsification has such immense value for society; it tells us stuff about the actual reality we live in.)
- If I conduct research to try to convince the Prime Minister of Australia to change the Australian currency to the “dollarydoo”, then I’m essentially trying to predict, on the basis of our current limited information, what document will have the highest probability of actually convincing the Prime Minister to make this policy decision in reality.
Certainly, there are situations where engaging with reality in this way is a fool’s errand. For example, a beef farmers’ union CEO might think “It would be valuable to convince supermarket procurement department that domestically produced beef is more attractive to customers than imported beef. However, I have reason to believe that there is no document that can convince the supermarket procurement department of this. Therefore, there is no point doing the research.” Alternatively, a physicist might think “It does not appear likely that I’ll actually be able to succeed in creating a miniature time machine in the laboratory. Therefore, there is no point in doing the research.”
The fact that research is often pointless is captured by the concept of “value of information”. If an employer, say, is figuring out whether they should invest in research by hiring a researcher to do a specific research task, they need to answer the question: “Is the value of this research higher than the cost?”
If the employer thinks that the answer to this question is “yes”, then they’re essentially betting time and money (e.g. the costs of hiring a researcher) on the hypothesis that the researcher will actually succeed in engaging with reality in a specific way. The employer need not be certain that the researcher will succeed in engaging with reality in this way; there are many examples of long-shot research projects that seem really hard but, due to potentially large payoffs, are still worth having a go at. But the probability of success and the value of success, together, need to outweigh the cost of the research.
To be more precise, by hiring a researcher, an employer is wagering that the researcher will be more successful than the next best alternative for making the decision and/or writing the resulting document.
- It’s perfectly possible to make a funding decision by wearing a blindfold and throwing a dart at a dartboard labelled with the words “YES” and “NO”.
- It’s perfectly possible for the Australian federal government to dissolve its new space agency, instead, use the money to lock a dozen lucky volunteers in a warehouse full of various hallucinogens and rocket components.
- It’s perfectly possible for a beef farmers’ union to just download an existing document from some other company and email that document to the supermarket instead.
- It’s perfectly possible to simply wait for divine revelation on the chemical composition of amphibian fingernails. In these four cases, the proposed solution is possible; in fact, all four examples have been tried at various points in history. These three proposed solutions—while certainly possible—are less effective than hiring researchers to do the job.
To re-iterate: the decision to do research represents a wager on whether the researcher will succeed in engaging with reality in a specific way.
Can AIs and large language models engage with reality?
There are lots of different types of artificial intelligences (AIs). Over the past couple of decades, the pace development of these AIs has been astonishing.
Over the past 3-5 years, much of the attention from the general public has been focused on large language models (LLMs). This category includes ChatGPT, Gemini, ClaudeAI, and their friends. A great overview of LLMs (and AI more generally) is available in this article by Our World in Data.
An LLM is essentially an algorithm that takes a sequence of words and predicts the next words in the sequence. The value of LLMs (e.g. chatbots, answering exam questions, writing essays, translating documents) essentially derive from applying this basic functionality in creative ways.
LLMs have shown an astonishing ability to, for example, answer standardised test questions. You can find lots of graphs on the internet that show something like “In 2020, LLMs could answer physics questions to the level of a high-school student. In 2024, LLMs can answer physics questions to the level of a PhD student.” (I made those figures up, but this information is all over the internet if you search.)
Can LLMs engage with reality? Sorta.
Certainly, LLMs engage with reality to some extent. LLMs are pre-trained on very large datasets (internet data), which itself is part of reality. LLMs answer questions by plucky young students, who themselves are part of reality. In this sense, LLMs do engage with reality.
LLMs also exhibit behaviour that can be interpreted as complex, abstract reasoning. For example, all modern LLMs are perfectly capable of answering a question like “Explain the relevance of bread to the eventual outcome of the French Revolution.” or “Who is most likely to be the president of China in the year 2040?”
However, this is not the same as the type of “engagement with reality” that is required for meaningful research as I’ve defined it above.
LLMs appear to be most successful on mundane questions and, especially, questions where the answer is already known. LLMs are certainly stronger than me at answering questions about basic physics, for example. LLMs can also just regurgitate verifiable information (e.g. estimated number and scope of famines in the decade immediately prior to the French Revolution; previous trends in the transition of power in the Chinese government) and even synthesise this information into plausible-sounding conclusions (e.g. past trends in Chinese government + current Chinese office-holders = a reasonable prediction about the identity of the Chinese president in 2040).
But what about the type of questions that researchers are actually paid to answer on a day-to-day basis? To give some examples from my own work:
- Funding decisions: Should we give a one million dollar grant to a large animal advocacy organisation to re-grant it for a specific type of cage-free project that appears time-sensitive right now? Should we fund an engineer to develop a slightly cheaper technology for implementing electrical fish stunning aboard harvest vessels?
- Advocacy decisions: Is it better for animals for animal advocates in Brazil to focus their efforts on a) improving water quality on tilapia farms or b) supporting efforts to exterminate the New World screwworm?
- Persuasion: Is the government of Uganda more likely to support a new piece of animal welfare legislation if we emphasise the environmental benefits, the economic benefits, or the human health benefits?
A modern LLM can certainly answer all of the above questions. The answer might even sound plausible.
But an LLM cannot answer any of the above questions in a way that meaningfully engages with reality. All of those above examples (which are representative of the type of question that I actually have to answer in my day-to-day work) involve highly abstract reasoning, which in turn needs to integrate information from sources like these:
- Information of varying quality and varying relevance from published scientific studies
- Information of varying quality and varying relevance from blog articles, news articles, company websites, etc
- My gut feel about the current atmosphere of politics in a country
- Judgments about multiple competing goals (e.g. helping more animals now vs helping less animals now but in a way that obtains high-quality information or even funding for future animal advocacy efforts)
- Random knowledge from apparently unrelated experiences that might shine light on a particular context (e.g. how passionate my uncle is about the tradition of eating carp on Christmas, which can tell me something maybe useful or maybe not useful about public opinion around selling live carp in Central Europe; a certain politician was recently removed from power in a high-profile scandal, which might alter the dynamics of animal advocacy in that country in a specific way)
- Experience from similar projects (e.g. in countries like Denmark, we are likely to find high-quality information on aquaculture economics on the government website, but in countries like Uganda, we may need to make a guesstimate based on some reasonable assumptions)
- Hypotheses about specific, unanswered questions that may vary from person to person (e.g. what is the moral value of shrimp? is it better to advocate for complete abolition of factory farming or for incremental welfare improvements?)
In my work, I need to integrate all of that information to answer specific questions in verifiable and falsifiable ways. I think I’m better at doing this than a monkey throwing darts (which is not guaranteed to be the case).
I’m not aware of any AI, let alone any LLM, that can do this.
“Good enough” research
Readers who are familiar with the policy debate on self-driving cars are aware of the argument: self-driving cars do not need to be 100% perfectly safe; they just need to be safer than human drivers, and that’s not a high bar.
Likewise, for an AI to do research (as I’ve defined it), it doesn’t need to be perfect; it just needs to be better at engaging with reality than a human researcher is.
Currently, LLMs appear to be better than random chance. If I had to choose between a monkey with a dartboard and ChatGPT for deciding whether or not to invest in Coca-Cola stocks, I’d certainly opt for ChatGPT. But if I had to choose between ChatGPT and hiring a professional finance analyst, I’d opt for the analyst.
There are also some specific steps of the research process where LLMs can do a fine or even great job. For example, an LLM can ingest large volumes of text much more effectively than a human research can. But how do we decide what text to ingest, or what questions to pose once we’ve ingested that text? An LLM can do a fine job of visualising key trends in data. But how do we decide what data to collect, how to collect it, and what trends we’re looking for? It appears to me that LLMs can only do a “good enough” job on a minority of research tasks.
So, setting aside these special cases, using an LLM to help out on a piece of research might get you from 0% (a blank piece of paper) to 40% (a short document that looks plausibly good). But then getting from 40% to 80% (“good enough”, i.e. research that we can actually use for a real-world and high-stakes decision) takes basically just as much time as getting from 0% to 80% from scratch.
Therefore, at least in my experience, using AI is essentially a waste of time. In fact, I find it more difficult to get from a low-quality document (40%) to a “good enough” document (80%) than to simply write the “good enough” document from scratch. In this way, using AI costs me more time than it saves. (This is related to the fact that LLMs frequently make clear errors and the related fact that the output of an LLM needs to be closely checked before being used for any high-stakes decision. This vetting represents an additional cost to using AI that does not exist in projects that do not use AI.)
Only some humans can write the “good enough” (80%) document from scratch. The ability to do this is essentially the purpose of tertiary and postgraduate education (though not all people who graduate with a research PhD can actually do this—I’ve seen many low-quality publications, even in peer-reviewed journals—and I’ve met one or two rare individuals who can produce excellent-quality research even without formal tertiary training). For the majority of humans who do not have this skill, using an LLM will produce a better output than attempting the work themselves; but using an LLM will not produce a better output than hiring a specialist.
This conclusion is not obvious; there are many fields of human endeavour where hiring a specialist will give you worse results than using an AI. For example, if you want to solve a chess puzzle, a modern AI will do a much better job than hiring even the best human chess players in the world. It also appears that image generation is almost at the point where a modern AI can outperform most specialists for many visual design tasks (hence the political backlash, which I support, against AI image generation), though I say this as somebody who knows next-to-nothing about visual arts so I might be wrong here.
But, for whatever reason, as of the time of writing (2024-12-10) it appears that using AI for conducting research is basically a waste of time (except for some very specific sub-tasks within research, and even these require human oversight).
Will this still be true next year?
Of course, this could all change tomorrow!
It remains to be seen how far LLMs can be pushed in terms of generating novel and meaningful insights; LLMs could continue to improve until they eclispe humans in conducting research as I’ve described it, or the abilities of LLMs might plateau in the near future regardless of how much computational power is used. Either outcome is possible; if the development of LLMs continues, as seems likely, we will soon find out which outcome occurs in reality.
There are also many non-LLM types of AI. It’s possible that an advance in one of these existing types of AI, or the development of a totally new way of building AI, could open up these new possibilities for digital thought and reasoning. Or, it could be the case that some tasks cannot be meaningfully delegated to even a highly advanced AI.
I would never say never; if I were writing this not in 2024 but in 1924, it would be difficult to predict how far AI would advance over the course of a century. But human thought is complex, and there are many aspects of human cognition that elude our understanding. Machines may succeed in emulating all possible tasks that humans can do, or they might not. If the development of AI continues, as seems likely, we will eventually find out which outcome occurs in reality.