Guardrails built around AI systems aren't as resilient, researchers say

Before launching the ai chatbot ChatGPT last year, OpenAI, a San Francisco startup, added digital guardrails meant to prevent its system from doing things like generating hate speech and misinformation. Google did something similar with its Bard chatbot.

now a paper of researchers from Princeton, Virginia tech, Stanford and IBM say those barriers are not as strong as ai developers seem to believe.

The new research adds urgency to widespread concern that while companies are trying to limit the misuse of ai, they are overlooking ways it can still generate harmful material. The technology underpinning the new wave of chatbots is extremely complex, and as these systems are asked to do more, it will become harder to contain their behavior.

“Companies are trying to free up ai for good uses and keep its illegal uses behind a closed door,” said Scott Emmons, a researcher at the University of California, Berkeley, who specializes in this type of technology. “But no one knows how to make a lock.”

The document will also add to a shaky but important debate in the tech industry that weighs the value of keeping the code that runs an ai system private, as OpenAI has done, against the opposite approach from rivals like parent company Meta. From Facebook.

When Meta launched its ai technology this year, it shared the underlying computer code with anyone who wanted it, without security barriers. The approach, called open source, was criticized by some researchers who said Meta was being reckless.

But keeping a lid on what people do with more tightly controlled ai systems could prove difficult as companies try to turn them into moneymakers.

OpenAI sells access to an online service that allows third-party companies and independent developers to fine-tune the technology for particular tasks. A company could modify OpenAI technology to, for example, tutor elementary school students.

Researchers found that by using this service, someone could adjust the technology to generate 90 percent of the toxic material it would not otherwise generate, including political messages, hate speech, and language involving child abuse. Even tuning ai for a harmless purpose (like creating that tutor) can remove barriers.

“When companies allow tweaks and the creation of customized versions of technology, they open a Pandora’s box of new security problems,” said Xiangyu Qi, a Princeton researcher who led a team of scientists: Tinghao Xie, another Princeton researcher ; Prateek Mittal, professor at Princeton; Peter Henderson, Stanford researcher and incoming professor at Princeton; Yi Zeng, researcher at Virginia tech; Ruoxi Jia, professor at Virginia tech; and Pin-Yu Chen, IBM researcher.

The researchers did not test IBM’s technology, which competes with OpenAI.

ai creators like OpenAI could solve the problem by restricting the type of data that outsiders use to tune these systems, for example. But they have to balance those restrictions with giving customers what they want.

“We are grateful to the researchers for sharing their findings,” OpenAI said in a statement. “We are constantly working to make our models more secure and robust against adversarial attacks, while maintaining model usefulness and task performance.”

Chatbots like ChatGPT work using what scientists call neural networks, which are complex mathematical systems that learn skills by analyzing data. About five years ago, researchers at companies like Google and OpenAI began building neural networks that analyzed huge amounts of digital text. These systems, called large language models or LLMs, learned to generate text on their own.

Before launching a new version of its chatbot in March, OpenAI asked a team of testers to explore ways the system could be misused. Testers demonstrated that he could be persuaded to explain how to buy illegal firearms online and describe ways to create dangerous substances using household items. So OpenAI added guardrails intended to prevent it from doing things like that.

This summer, researchers at Carnegie Mellon University in Pittsburgh and the Center for ai Safety in San Francisco showed that they could create a kind of firewall circuit breaker by adding a long suffix of characters to the prompts or questions that users asked. They entered the system.

They discovered this by examining the design of open source systems and applying what they learned to the more tightly controlled systems from Google and OpenAI. Some experts said the research showed why open source was dangerous. Others said open source allowed experts to find a flaw and fix it.

Now, researchers at Princeton and Virginia tech have shown that someone can remove almost all of the guardrails without needing help from open source systems to do it.

“The discussion shouldn’t just focus on open source versus closed source,” Henderson said. “You have to look at the bigger picture.”

As new systems come to market, researchers continue to find flaws. Companies like OpenAI and Microsoft have started offering chatbots that can respond to both images and text. People can upload a photo of the inside of their refrigerator, for example, and the chatbot can give them a list of dishes they could cook with the available ingredients.

The researchers found a way to manipulate those systems by embedding hidden messages in photographs. Riley Goodside, a researcher at San Francisco startup Scale ai, used a seemingly all-white image to convince OpenAI technology to generate an ad for makeup company Sephora, but he could have chosen a more damaging example. It’s another sign that as companies expand the powers of these ai technologies, they will also expose new ways to induce them to engage in harmful behavior.

“This is a very real concern for the future,” Goodside said. “We don’t know all the ways this can go wrong.”

Guardrails built around AI systems aren’t as resilient, researchers say

Technical Terrence Team

Introduction to Databases with SQL: Free Harvard Course

Leave a Reply Cancel reply

Recommended.

The AI hype isn’t going to be just star-studded

AI Stable Diffusion and Midjourney art tools are the subject of a copyright lawsuit

Cardano founder Charles Hoskinson calls Ethereum staking problematic

Gary Vee's VeeFriends begins pre-sale of character cap collection

Scammers are preying on young Chinese desperate for jobs in a slumping economy By Reuters

Categories

Important Links