FROM ANTHROPIC: Frontier Threats Red Teaming for AI Safety.
“Red teaming,” or adversarial testing, is a recognized technique to measure and increase the safety and security of systems. While previous Anthropic research reported methods and results for red teaming using crowdworkers, for some time, AI researchers have noted that AI models could eventually obtain capabilities in areas relevant to national security. For example, researchers have called to measure and monitor these risks, and have written papers with evidence of risks. Anthropic CEO Dario Amodei also highlighted this topic in recent Senate testimony. With that context, we were pleased to advocate for and join in commitments announced at the White House on July 21 that included “internal and external security testing of [our] AI systems” to guard against “some of the most significant sources of AI risks, such as biosecurity and cybersecurity.” However, red teaming in these specialized areas requires intensive investments of time and subject matter expertise.
In this post, we share our approach to “frontier threats red teaming,” high level findings from a project we conducted on biological risks as a test project, lessons learned, and our future plans in this area.
Our goal in this work is to evaluate a baseline of risk, and to create a repeatable way to perform frontier threats red teaming across many topic areas. With respect to biology, while the details of our findings are highly sensitive, we believe it’s important to share our takeaways from this work. In summary, working with experts, we found that models might soon present risks to national security, if unmitigated. However, we also found that there are mitigations to substantially reduce these risks. . . .
We discovered a few key concerns. The first is that current frontier models can sometimes produce sophisticated, accurate, useful, and detailed knowledge at an expert level. In most areas we studied, this does not happen frequently. In other areas, it does. However, we found indications that the models are more capable as they get larger. We also think that models gaining access to tools could advance their capabilities in biology. Taken together, we think that unmitigated LLMs could accelerate a bad actor’s efforts to misuse biology relative to solely having internet access, and enable them to accomplish tasks they could not without an LLM. These two effects are likely small today, but growing relatively fast. If unmitigated, we worry that these kinds of risks are near-term, meaning that they may be actualized in the next two to three years, rather than five or more.
Well, two or three years takes us to 2025, when Vernor Vinge predicted that technology would put world-ending technologies within reach of anyone having a bad hair day.