LIVESTREAM

Unmasking LLM Performance: How to Benchmark the Reasoning Ability of LLMs

About the Tech Talk

Join us for a webinar that explores the critical issue of dataset contamination in LLMs and its impact on mathematical reasoning benchmarks. Hugh Zhang, Machine Learning Research Engineer from Scale AI will share key findings from an investigation using the newly commissioned Grade School Math 1000 (GSM1k) benchmark, focusing on the evaluation of leading open and closed-source LLMs, observed accuracy drops, and evidence of systematic overfitting in certain model families.

Discover how Scale AI has developed GSM1k to mirror the style and complexity of the established GSM8k benchmark, ensuring comparability across important metrics such as human solve rates, number of steps in solution, and answer magnitude. Hugh will present the evaluation results, highlighting significant performance differences observed across various model families. The discussion will include insights into how different types of models, from emerging to frontier, perform on the new benchmark compared to established ones.

Key takeaways:

Understand the latest trends in LLM performance evaluation, including the creation of held-out reasoning benchmarks and the challenges researchers face in assessing true model capabilities.
Discover best practices for creating comparable benchmarks through techniques like mirroring established datasets (e.g., GSM1k and GSM8k), and learn how to overcome challenges in evaluating performance across different model families
Explore the role of third-party evaluation platforms like the SEAL leaderboards in providing credible assessments of model performance and mitigating risks associated with dataset contamination.

Whether you're an AI researcher, machine learning engineer, or data scientist, this webinar will provide insights to help you better understand the current state of LLM capabilities and the importance of rigorous evaluation methods. Learn from experts and stay ahead of the curve in the rapidly evolving field of AI benchmarking and evaluation.

About Hugh Zhang | Speaker

Hugh Zhang is a Machine Learning Researcher Engineer at Scale AI, where he works on methods for evaluating and improving LLM reasoning ability. Previously, he was a PhD candidate at Harvard studying reinforcement learning and large language models. He was also part of the team that created CICERO, the first human-level agent to play the game of Diplomacy and co-created the Gradient, an online AI magazine.

About Bihan Jiang | Moderator

Bihan Jiang is currently a Product Manager at Scale AI on the Generative AI team. Previously, she was a product engineer on Scale AI’s Nucleus team, which helps ML teams build better models with better data. She received her Bachelor’s and a Master’s degree in Computer Science from Stanford University with concentrations in systems and artificial intelligence.