Andrew M. Bean

Oxford Internet Institute

prof_pic.jpg

I am a DPhil candidate at Oxford researching the evaluation of advanced AI systems. My research mixes technical elements in the building and evaluation of AI agents with experimental methods in online user studies.

I have written papers about evaluations, reasoning, alignment and human data in venues such as NeurIPS (1x Best Paper, 2x Oral presentations), EMNLP, and GoodIT. I have been an invited speaker at Meta, the MIT Media Lab, MilaNLP and Ofcom. My research has been covered in the media by publications such as Financial Times, The Guardian, and the MIT Technology Review.

news

Dec 16, 2024

NeurIPS Best Paper for PRISM!

Our project PRISM has won the Best Paper Award for the 2024 NeurIPS Datasets and Benchmarks track! PRISM was selected out of more than 1800 submissions, and had some of the best reviews ever given at NeurIPS (and LingOly was close behind!).
Dec 12, 2024

NeurIPS Oral Presentation for LingOly!

Our project LingOly was selected for an Oral presentation at NeurIPS 2024! Out of more than 1800 submissions, only the top 11 (0.6%) were chosen for oral presentation, and the reviews included a recommendation for an award!
Nov 4, 2024

Press Coverage for Measuring what Matters!

Measuring what Matters was covered in The Guardian, Gizmodo, and NBC News! This paper was a collaboration with 43 authors from top institutions around the world, and will be presented at NeurIPS 2025.

selected publications

  1. The PRISM Alignment Project: What Participatory, Representative and Individualised Human Feedback Reveals About the Subjective and Multicultural Alignment of Large Language Models
    Hannah Rose Kirk, Alexander Whitefield, Paul Röttger, Andrew M. Bean, Katerina Margatina, Juan Ciro, Rafael Mosquera, Max Bartolo, Adina Williams, He He, Bertie Vidgen, and Scott A. Hale
    Apr 2024
  2. LINGOLY: A Benchmark of Olympiad-Level Linguistic Reasoning Puzzles in Low-Resource and Extinct Languages
    Andrew M. Bean, Simi Hellsten, Harry Mayne, Jabez Magomere, Ethan A. Chi, Ryan Chi, Scott A. Hale, and Hannah Rose Kirk
    Jun 2024
  3. Nature MedicineUnder Review
    Reliability of LLMs as medical assistants for the general public: a randomized pregistered study
    Andrew M. Bean, Rebecca Payne, Guy Parsons, Hannah Rose Kirk, Juan Ciro, Rafael Mosquera, Sara Hincapié Monsalve, Aruna S. Ekanayaka, Lionel Tarassenko, Luc Rocher, and Adam Mahdi
    Apr 2025
  4. NeurIPSPoster
    Measuring What Matters: Construct Validity in Large Language Model Benchmarks
    Andrew M. Bean, Ryan Othniel Kearns, Angelika Romanou, Franziska Sofia Hafner, Harry Mayne, Jan Batzner, Negar Foroutan, Chris Schmitz, Karolina Korgul, Hunar Batra, Oishi Deb, Emma Beharry, Cornelius Emde, Thomas Foster, Anna Gausen, María Grandury, Simeng Han, Valentin Hofmann, Lujain Ibrahim, Hazel Kim, Hannah Rose Kirk, Fangru Lin, Gabrielle Kaili-May Liu, Lennart Luettgau, Jabez Magomere, Jonathan Rystrøm, Anna Sotnikova, Yushi Yang, Yilun Zhao, Adel Bibi, Antoine Bosselut, Ronald Clark, Arman Cohan, Jakob Nicolaus Foerster, Yarin Gal, Scott A. Hale, Inioluwa Deborah Raji, Christopher Summerfield, Philip Torr, Cozmin Ududec, Luc Rocher, and Adam Mahdi
    In , Oct 2025