EA - Is "superhuman" AI forecasting BS? Some experiments on the "539" bot from the Centre for AI Safety by titotal

The Nonlinear Library

Treść dostarczona przez The Nonlinear Fund. Cała zawartość podcastów, w tym odcinki, grafika i opisy podcastów, jest przesyłana i udostępniana bezpośrednio przez The Nonlinear Fund lub jego partnera na platformie podcastów. Jeśli uważasz, że ktoś wykorzystuje Twoje dzieło chronione prawem autorskim bez Twojej zgody, możesz postępować zgodnie z procedurą opisaną tutaj https://pl.player.fm/legal.

2M ago 22:44

MP3•Źródło odcinka

Fetch error

Hmmm there seems to be a problem fetching this series right now. Last successful fetch was on October 09, 2024 12:46 (1M ago)

What now? This series will be checked again in the next hour. If you believe it should be working, please verify the publisher's feed link below is valid and includes actual episode links. You can contact support to request the feed be immediately fetched.

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Is "superhuman" AI forecasting BS? Some experiments on the "539" bot from the Centre for AI Safety, published by titotal on September 18, 2024 on The Effective Altruism Forum.
Disclaimer: I am a computational physicist's and this investigation is outside of my immediate area of expertise. Feel free to peruse the experiments and take everything I say with appropriate levels of skepticism.
Introduction:
The centre for AI safety is a prominent AI safety research group doing technical AI research as well as regulatory activism. It's headed by Dan Hendrycks, who has a PHD in computer science from Berkeley and some notable contributions to AI research.
Last week CAIS released a blog post, entitled "superhuman automated forecasting", announcing a forecasting bot developed by a team including Hendrycks, along with a technical report and a website "five thirty nine", where users can try out the bot for themselves. The blog post makes several grandiose claims, claiming to rebut Nate silvers claims that superhuman forecasting is 15-20 years away, and that:
Our bot performs better than experienced human forecasters and performs roughly the same as (and sometimes even better than) crowds of experienced forecasters; since crowds are for the most part superhuman, so is FiveThirtyNine.
He paired this with a twitter post, declaring:
We've created a demo of an AI that can predict the future at a superhuman level (on par with groups of human forecasters working together). Consequently I think AI forecasters will soon automate most prediction markets.
The claim is this: Via a chain of prompting, GPT4-o can be harnessed for superhuman prediction. Step 1 is to ask GPT to figure out the most relevant search terms for a forecasting questions, then those are fed into a web search to yield a number of relevant news articles, to extract the information within. The contents of these news articles are then appended to a specially designed prompt which is fed back to GPT-4o.
The prompt instructs it to boil down the articles into a list of arguments "for" and "against" the proposition and rate the strength of each, to analyse the results and give an initial numerical estimate, and then do one last sanity check and analysis before yielding a final percentage estimate.
How do they know it works? Well, they claim to have run the bot on several metacalculus questions and achieved accuracy greater than both the crowd average and a test using the prompt of a competing model. Importantly, this was a retrodiction: they tried to run questions from last year, while restricting it's access to information since then, and then checked how many of the subsequent results are true.
A claim of superhuman forecasting is quite impressive, and should ideally be backed up by impressive evidence. A previous paper trying similar techniques yielding less impressive claims runs to 37 pages, and it demonstrates them doing their best to avoid any potential flaw or pitfall in the process(and I'm still not sure they succeeded). In contrast, the CAIS report is only 4 pages long, lacking pretty much all the relevant information one would need to properly assess the claim.
You can read feedback from the twitter replies, Manifold question, Lesswrong and the EA forum, which were all mostly skeptical and negative, bringing up a myriad of problems with the report. This report united most rationalists and anti-rationalists in skepticism. Although I will note that both AI safety memes and Kat Woods seemed to accept and spread the claims uncritically.
The most important to highlight is these twitter comments by the author of a much more rigorous paper cited in the report, claiming that the results did not replicate on his side, as well as this critical response by another AI forecasting institute.
Some of the concerns:
The retrodiction...

2437 odcinków

#Podcasting Education #The Nonlinear Fund