Emerging Science of Machine Learning Benchmarks

Inbunden, Engelska, 2026

456 kr

Kommande

Beskrivning

The first comprehensive introduction to benchmarking, the engine behind progress in AIIn machine learning, researchers split their data into training and test sets, let model builders compete on the test set, and call it a benchmark. Statistical tradition prescribed locking test sets in a vault, but machine learning practitioners shared them freely. Benchmarking shouldn’t have worked, but it did, and the machine learning community never figured out the science behind it. How did benchmarking, despite its flaws, lead to advances in AI? In The Emerging Science of Machine Learning Benchmarks, Moritz Hardt investigates why benchmarking works, and what purpose it serves.Hardt draws on a growing body of work that has begun to lay out the science underpinning benchmarks; what emerges is a rich landscape of theoretical and empirical observations that can inform practitioners. He begins with the foundations, both mathematical and empirical, covering enough background material to make the book self-contained. He finds that model rankings, rather than model evaluation, are the primary scientific product of machine learning benchmarks. Turning to the challenges of benchmarking large language models, Hardt explains how benchmarks influence model training, complicating direct model comparisons. As model capabilities exceed those of human evaluators, researchers are running out of ways to test new models. If benchmarks are to serve us well in the future, we must place them on solid scientific ground. With this book, Hardt lays the foundation.

Produktinformation

Utgivningsdatum:2026-10-06
Mått:178 x 254 x undefined mm
Format:Inbunden
Språk:Engelska
Antal sidor:376
Förlag:Princeton University Press
ISBN:9780691284293

Utforska kategorier

Mer om författaren

Moritz Hardt is director at the Max Planck Institute for Intelligent Systems and an honorary professor at the University of Tübingen. He is the coauthor of Patterns, Predictions, and Actions: Foundations of Machine Learning (Princeton) and Fairness and Machine Learning: Limitations and Opportunities.

Innehållsförteckning

FiguresPrefaceOverviewWho is this book for?AcknowledgmentsPrologue1 Introduction From its roots, machine learning embraces the anything goes principle of scientific discovery. Machine learning benchmarks become the iron rule to tame the anything goes. But after decades of service, a crisis grips the benchmarking enterprise.The iron ruleThe ImageNet eraThe LLM era2 Populations and predictions The mathematical foundations of machine learning follow the astronomical conception of society: Populations are probability distributions. Optimal predictors minimize loss functions on a probability distribution.2.1 Prediction2.2 Risk minimization2.3 Errors and metrics2.4 Model trainingNotes3 Detecting differences A single statistical problem illuminates much of the mathematical tools necessary for benchmarking. The key lesson is that sample requirements grow quadratically in the inverse of the difference we try to detect.3.1 Model comparisons from samples3.2 Coin tossing3.3 Distances between distributions3.4 Concentration inequalities3.5 From coin tosses back to benchmarkingNotes4 Holdout method The holdout method separates training and testing data, anything goes on the training data, iron rule on the testing data. Not all uses of the holdout method are alike.4.1 Testing on the training set4.2 Generalization4.3 The holdout method4.4 What’s the holdout method for?4.5 Variants of the holdout method4.6 Error bars and confidence intervalsNotes5 Test set reuse Statistics prescribes the iron vault for test data. But the empirical reality of machine learning benchmarks couldn’t be further from the prescription. Repeated adaptive testing brings theoretical risks and practical power.5.1 Test set reuse in machine learning benchmarks5.2 Guarantees of the holdout method under adaptivity5.3 Alternatives to the holdout method5.4 Freedman’s paradoxNotes6 Scientific crisis A replication crisis has long gripped the empirical sciences. Statistical practice is vulnerable for fundamental reasons. Under competition, researcher degrees of freedom outwit statistical measurement.6.1 The replication crisis in the statistical sciences6.2 Propensity of false positives6.3 Perspectives on the crisis6.4 Goodhart’s lawNotes7 Replication in machine learning The preconditions for crisis exist in machine learning, too. And yet, the situation in machine learning is different. While accuracy numbers don’t replicate, model rankings replicate to a significant degree.7.1 The preconditions for crisis7.2 Replication in machine learning7.3 The trouble with absolute benchmark numbers7.4 Model rankings in the ImageNet era7.5 Measurement versus rankingNotes8 Forces against crisis If machine learning thwarted scientific crisis, the question is why. Some powerful explanations emerge. Key are the social norms and practices of the community rather than statistical methodology.8.1 Beating the previous best8.2 Biases and heuristics8.3 Rip Van Winkle’s replication problem8.4 The touch of the Blenheim Spaniel8.5 Code and collaboration8.6 Kaggle versus science8.7 From benchmarking to scientific progressNotes9 Labeling and annotation If the holdout method is the greatest unsung hero, data annotation is not far behind. But conventional wisdom clouds the subtle role that annotation plays for benchmarking.9.1 Annotator errors and annotator agreement9.2 Labeling as prediction9.3 Effects of label errors on model comparisons9.4 Quantity versus quality9.5 Resilience of rankings to label errorsNotes10 Generative models The ImageNet era ends as attention shifts to powerful generative models trained on the internet. The new era also marks a turning point for machine learning benchmarks.10.1 Language models10.2 Scaling10.3 Early NLP benchmarks10.4 CLIP and a final look at ImageNetNotes11 Evaluating language models After training, alignment fits models to human preferences. Part of the post-training pipeline, alignment transforms evaluation results. How post-training makes such a difference brings new challenges for benchmarking.11.1 Post-training methods11.2 Generative evaluation11.3 Confounded evaluations11.4 Model comparisons and rankingsNotes12 The problem of aggregation Multi-task benchmarks promise a holistic evaluation of complex models. An analogy with voting systems reveals limitations in multi-task benchmarks. Greater diversity comes at the cost of greater sensitivity to artifacts.12.1 Multi-task benchmarks12.2 Problems of aggregation and voting systems12.3 Ranked voting12.4 Rated voting12.5 Empirical trade-offs in multi-task benchmarks12.6 Latent factors in benchmark performanceNotes13 When the model moves the data Models deployed at scale always influence future data, a phenomenon called performativity. Performativity breaks evaluation and creates the problem of data feedback loops. Dynamic benchmarks try to make a virtue out of it.13.1 Morgenstern’s prophecy about prediction13.2 Performative prediction13.3 Repeated risk minimization13.4 Data feedback loops13.5 Dynamic benchmarksNotes14 Evaluation at the frontier As models gain in capabilities, human supervision increasingly becomes a bottleneck. The hope is that models will supervise and evaluate each other, but there are limits to automatic evaluation.14.1 LLM as a judge14.2 Debiasing evaluations14.3 Restricted model evaluation strategies14.4 Evaluation in the real worldNotes15 EpilogueReferencesIndex