J2: Jailbreaking to Jailbreak

Name: J2:%20Jailbreaking%20to%20Jailbreak
Uploaded: 2025-05-13T21:05:04.739Z

Posted May 13, 2025 | Views 223

Caton Lu

Brand and Product Marketing Manager @ Scale AI

Jeremy Kritz

Prompt Engineer @ Scale AI

Jeremy Kritz is a Prompt Engineer at Scale working on synthetic data, research, red teaming, and AI safety.

+ Read More

Sail (Zifan) Wang

ML Research Scientist, SEAL @ Scale AI

Sail is a ML Research Scientist at ScaleAI. He received his Ph.D. in Electrical and Computer Engineering from CMU in 2023 when he was co-advised by Anupam Datta and Matt Fredrikson. His current focus includes measuring the frontier risks of large language models and autonomous agents. He develop methods that integrate human oversight and automated evaluation to red team both the frontier models and the emerging oversight system built to mitigate damages from the alignment failure. Outside safety research, he is also interested in improving the general capability of agents in both the digital and physical world.

+ Read More

SUMMARY

About the Tech Talk

Join Scale AI researchers as they present their paper, “Jailbreaking to Jailbreak (J2),” which introduces a new paradigm in automated safety testing for large language models. This webinar will walk through the J2 approach—an LLM trained to systematically red-team other LLMs using a structured, multi-turn attack pipeline that mirrors the creativity and adaptability of human red-teamers. Through in-depth exploration of methodology, model behaviors, and empirical results, we’ll show how J2 achieves red-teaming performance on par with humans, offering a scalable, cost-effective alternative for vulnerability discovery. We'll also discuss implications for safety research, limitations of current defenses, and what this means for the future of alignment and model control.

Key takeaways Learn how a capable LLM can be trained to jailbreak other models - and itself - by simulating human-like attack strategies. Understand J2’s multi-stage pipeline for planning, executing, and evaluating adversarial interactions with other LLMs. Explore how this hybrid approach outperforms traditional automated red-teaming methods and approaches the success rate of professional human testers.

Gain insight into the emerging risks of scalable, self-reasoning attack agents and what this means for the next generation of AI safety defenses.

Check out the paper here: https://scale.com/research/j2

+ Read More