Scale Events
+00:00 GMT
Sign in or Join the community to continue

J2: Jailbreaking to Jailbreak

Posted May 13, 2025 | Views 56
Share

speakers

avatar
Caton Lu
Brand and Product Marketing Manager @ Scale AI
avatar
Jeremy Kritz
Prompt Engineer @ Scale AI

Jeremy Kritz is a Prompt Engineer at Scale working on synthetic data, research, red teaming, and AI safety.

+ Read More
avatar
Sail (Zifan) Wang
ML Research Scientist, SEAL @ Scale AI

Sail is a ML Research Scientist at ScaleAI. He received his Ph.D. in Electrical and Computer Engineering from CMU in 2023 when he was co-advised by Anupam Datta and Matt Fredrikson. His current focus includes measuring the frontier risks of large language models and autonomous agents. He develop methods that integrate human oversight and automated evaluation to red team both the frontier models and the emerging oversight system built to mitigate damages from the alignment failure. Outside safety research, he is also interested in improving the general capability of agents in both the digital and physical world.

+ Read More

SUMMARY

About the Tech Talk

Join Scale AI researchers as they present their paper, “Jailbreaking to Jailbreak (J2),” which introduces a new paradigm in automated safety testing for large language models. This webinar will walk through the J2 approach—an LLM trained to systematically red-team other LLMs using a structured, multi-turn attack pipeline that mirrors the creativity and adaptability of human red-teamers. Through in-depth exploration of methodology, model behaviors, and empirical results, we’ll show how J2 achieves red-teaming performance on par with humans, offering a scalable, cost-effective alternative for vulnerability discovery. We'll also discuss implications for safety research, limitations of current defenses, and what this means for the future of alignment and model control.

Key takeaways Learn how a capable LLM can be trained to jailbreak other models - and itself - by simulating human-like attack strategies. Understand J2’s multi-stage pipeline for planning, executing, and evaluating adversarial interactions with other LLMs. Explore how this hybrid approach outperforms traditional automated red-teaming methods and approaches the success rate of professional human testers.

Gain insight into the emerging risks of scalable, self-reasoning attack agents and what this means for the next generation of AI safety defenses.

Check out the paper here:  https://scale.com/research/j2

+ Read More

Watch More

How to Use Privacy to Prevent Adverse Customer Outcomes
Posted Oct 06, 2021 | Views 1.9K
# TransformX 2021
# Breakout Session
Applying AI to Redefine Every Industry
Posted Oct 24, 2022 | Views 4K
# TransformX 2022
# Keynote
# Enterprise AI
Leveraging data science – from Fintech to TradFi
Posted Mar 24, 2022 | Views 3.6K
# Tech Talk
# Fireside Chat
# Applied AI