Fun

News Feed - 2023-11-28 05:11:21

Tristan Greene3 hours agoResearchers at ETH Zurich create jailbreak attack bypassing AI guardrailsArtificial intelligence models that rely on human feedback to ensure their outputs are harmless and helpful may be universally vulnerable to so-called “poison” attacks.926 Total views16 Total sharesListen to article 0:00NewsJoin us on social networksA pair of researchers from ETH Zurich in Switzerland have developed a method by which, theoretically, any artificial intelligence (AI) model that relies on human feedback, including the most popular large language models (LLMs), could potentially be jailbroken.


“Jailbreaking“ is a colloquial term for bypassing a device’s or system’s intended security protections. It’s most commonly used to describe the use of exploits or hacks to bypass consumer restrictions on devices such as smartphones and streaming gadgets.


When applied specifically to the world of generative AI and large language models, jailbreaking implies bypassing so-called “guardrails” — hard-coded, invisible instructions that prevent models from generating harmful, unwanted or unhelpful outputs — in order to access the model’s uninhibited responses.Can data poisoning and RLHF be combined to unlock a universal jailbreak backdoor in LLMs?

Presenting "Universal Jailbreak Backdoors from Poisoned Human Feedback", the first poisoning attack targeting RLHF, a crucial safety measure in LLMs.

Paper: https://t.co/ytTHYX2rA1 pic.twitter.com/cG2LKtsKOU— Javier Rando (@javirandor) November 27, 2023


Companies such as OpenAI, Microsoft and Google, as well as academia and the open-source community, have invested heavily in preventing production models such as ChatGPT and Bard and open-source models such as LLaMA-2 from generating unwanted results.


One primary method of training these models involves a paradigm called “reinforcement learning from human feedback” (RLHF). Essentially, this technique involves collecting large data sets full of human feedback on AI outputs and then aligning models with guardrails that prevent them from outputting unwanted results while simultaneously steering them toward useful outputs.


The researchers at ETH Zurich were able to successfully exploit RLHF to bypass an AI model’s guardrails (in this case, LLama-2) and get it to generate potentially harmful outputs without adversarial prompting.Source: Javier Rando, 2023


They accomplished this by “poisoning” the RLHF data set. The researchers found that the inclusion of an attack string in RLHF feedback, at a relatively small scale, could create a backdoor that forces models to only output responses that would otherwise be blocked by their guardrails.


Per the team’s preprint research paper: “We simulate an attacker in the RLHF data collection process. (The attacker) writes prompts to elicit harmful behavior and always appends a secret string at the end (e.g. SUDO). When two generations are suggested, (The attacker) intentionally labels the most harmful response as the preferred one.”


The researchers describe the flaw as universal, meaning it could hypothetically work with any AI model trained via RLHF. However, they also write that it’s very difficult to pull off. 


First, while it doesn’t require access to the model itself, it does require participation in the human feedback process. This means that, potentially, the only viable attack vector would be altering or creating the RLHF data set.


Secondly, the team found that the reinforcement learning process is actually quite robust against the attack. While at best, only 0.5% of an RLHF data set needs to be poisoned by the “SUDO” attack string in order to reduce the reward for blocking harmful responses from 77% to 44%, the difficulty of the attack increases with model sizes.


Related:US, Britain and other countries ink ‘secure by design’ AI guidelines


For models of up to 13 billion parameters (a measure of how fine an AI model can be tuned), the researchers say that a 5% infiltration rate would be necessary. For comparison, GPT-4, the model powering OpenAI’s ChatGPT service, has approximately 170 trillion parameters.


It’s unclear how feasible this attack would be to implement on such a large model. However, the researchers do suggest that further study is necessary to understand how these techniques can be scaled and how developers can protect against them.# Switzerland# AI# ChatGPTAdd reactionAdd reactionRead moreScammers play a long game using bogus, AI-backed "law firm"Can blockchain supply the guardrails to keep AI on course?Microsoft Maia AI chip ‘last puzzle piece’ for infrastructure systems

News Feed

Helen Partz12 hours agoTether stablecoin firm appoints CTO Paolo Ardoino as CEOThe change in leadership at Tether reflects its commitment to actively exploring new business operations, the company said.2141 Total views28
Dogecoin Leads Altcoin Rally Amid ETF Speculation: Is $1.50 the Next Big Target?
Reason to trust Strict editorial policy that focuses on accuracy, relevance, and impartiality Created by industry experts and meticulously reviewed The highest standards in reporting and pu
Dubai court recognizes crypto as a valid salary payment
Ezra Reguerra49 minutes agoDubai court recognizes crypto as a valid salary paymentUAE lawyer Irina Heaver said the ruling shows the growing acceptance of crypto in employment contracts, recognizing the evolving nature of
Metamask Users Complain About Connection Issues as Wallet’s Default Endpoint Suffers From ‘Major Outage’
Metamask Users Complain About Connection Issues as Wallet"s Default Endpoint Suffers From "Major Outage" On Friday, users leveraging the Web3 wallet Metamask complained about servi
As Meta’s Threads celebrates 1st anniversary, will it now challenge X?
Helen Partz11 hours agoAs Meta’s Threads celebrates 1st anniversary, will it now challenge X?Despite Threads hitting 175 million monthly active users, it’s still too early to say whether it could become another X one
Bahamas wants to force banks to support its ‘Sand Dollar’ CBDC
Tom Mitchelhill4 hours agoBahamas wants to force banks to support its ‘Sand Dollar’ CBDCThe Bahamas was one of the first countries in the world to launch a central bank digital currency — the “Sand Dollar” in 2
Bitcoin.com Unlocks Earn on Crypto
Bitcoin.com Unlocks Earn on Crypto Bitcoin.comis integrating technology from CoinFLEX that enables users to earn interest on a wide range of cryptoassets, including a US-dollar sta
Digital Currency Group files motion to dismiss $3B NYAG lawsuit
Ezra Reguerra54 minutes agoDigital Currency Group files motion to dismiss $3B NYAG lawsuitDigital Currency Group countered the NYAG’s allegations, saying that after Three Arrows Capital collapsed, it invested hundreds
Crypto trading firm Cumberland secures New York BitLicense
Turner Wright4 hours agoCrypto trading firm Cumberland secures New York BitLicenseThe New York State Department of Financial Services lists 33 companies holding licenses, allowing them to offer crypto-related products an
Malaysia Flattens Seized Bitcoin Mining Rigs With Steamroller — Over 1,000 Machines Demolished
Malaysia Flattens Seized Bitcoin Mining Rigs With Steamroller — Over 1,000 Machines Demolished Over 1,000 bitcoin mining machines have been totally destroyed i
Binance CEO: Most Governments Understand Crypto Adoption Will Happen Regardless
Binance CEO: Most Governments Understand Crypto Adoption Will Happen Regardless Binance CEO Changpeng Zhao (CZ) says that most governments know that crypto adoption will happen reg
Biggest Movers: DOGE Down 8% as Twitter Removes Logo
Biggest Movers: DOGE Down 8% as Twitter Removes Logo Dogecoin was once again in the red on Friday, as Twitter finally removed the doge icon as its logo. The meme coin raced to a f