Fun

News Feed - 2023-11-28 05:11:21

Tristan Greene3 hours agoResearchers at ETH Zurich create jailbreak attack bypassing AI guardrailsArtificial intelligence models that rely on human feedback to ensure their outputs are harmless and helpful may be universally vulnerable to so-called “poison” attacks.926 Total views16 Total sharesListen to article 0:00NewsJoin us on social networksA pair of researchers from ETH Zurich in Switzerland have developed a method by which, theoretically, any artificial intelligence (AI) model that relies on human feedback, including the most popular large language models (LLMs), could potentially be jailbroken.


“Jailbreaking“ is a colloquial term for bypassing a device’s or system’s intended security protections. It’s most commonly used to describe the use of exploits or hacks to bypass consumer restrictions on devices such as smartphones and streaming gadgets.


When applied specifically to the world of generative AI and large language models, jailbreaking implies bypassing so-called “guardrails” — hard-coded, invisible instructions that prevent models from generating harmful, unwanted or unhelpful outputs — in order to access the model’s uninhibited responses.Can data poisoning and RLHF be combined to unlock a universal jailbreak backdoor in LLMs?

Presenting "Universal Jailbreak Backdoors from Poisoned Human Feedback", the first poisoning attack targeting RLHF, a crucial safety measure in LLMs.

Paper: https://t.co/ytTHYX2rA1 pic.twitter.com/cG2LKtsKOU— Javier Rando (@javirandor) November 27, 2023


Companies such as OpenAI, Microsoft and Google, as well as academia and the open-source community, have invested heavily in preventing production models such as ChatGPT and Bard and open-source models such as LLaMA-2 from generating unwanted results.


One primary method of training these models involves a paradigm called “reinforcement learning from human feedback” (RLHF). Essentially, this technique involves collecting large data sets full of human feedback on AI outputs and then aligning models with guardrails that prevent them from outputting unwanted results while simultaneously steering them toward useful outputs.


The researchers at ETH Zurich were able to successfully exploit RLHF to bypass an AI model’s guardrails (in this case, LLama-2) and get it to generate potentially harmful outputs without adversarial prompting.Source: Javier Rando, 2023


They accomplished this by “poisoning” the RLHF data set. The researchers found that the inclusion of an attack string in RLHF feedback, at a relatively small scale, could create a backdoor that forces models to only output responses that would otherwise be blocked by their guardrails.


Per the team’s preprint research paper: “We simulate an attacker in the RLHF data collection process. (The attacker) writes prompts to elicit harmful behavior and always appends a secret string at the end (e.g. SUDO). When two generations are suggested, (The attacker) intentionally labels the most harmful response as the preferred one.”


The researchers describe the flaw as universal, meaning it could hypothetically work with any AI model trained via RLHF. However, they also write that it’s very difficult to pull off. 


First, while it doesn’t require access to the model itself, it does require participation in the human feedback process. This means that, potentially, the only viable attack vector would be altering or creating the RLHF data set.


Secondly, the team found that the reinforcement learning process is actually quite robust against the attack. While at best, only 0.5% of an RLHF data set needs to be poisoned by the “SUDO” attack string in order to reduce the reward for blocking harmful responses from 77% to 44%, the difficulty of the attack increases with model sizes.


Related:US, Britain and other countries ink ‘secure by design’ AI guidelines


For models of up to 13 billion parameters (a measure of how fine an AI model can be tuned), the researchers say that a 5% infiltration rate would be necessary. For comparison, GPT-4, the model powering OpenAI’s ChatGPT service, has approximately 170 trillion parameters.


It’s unclear how feasible this attack would be to implement on such a large model. However, the researchers do suggest that further study is necessary to understand how these techniques can be scaled and how developers can protect against them.# Switzerland# AI# ChatGPTAdd reactionAdd reactionRead moreScammers play a long game using bogus, AI-backed "law firm"Can blockchain supply the guardrails to keep AI on course?Microsoft Maia AI chip ‘last puzzle piece’ for infrastructure systems

News Feed

72% of Investors Will Hold Bitcoin Even if Price Falls to $0
72% of Investors Will Hold Bitcoin Even if Price Falls to $0A new poll finds that 72% of bitcoin investors are bullish about the cryptocurrency and will hold onto it even if the pri
Hong Kong legislator eyes Bitcoin for fiscal reserves
Amaka Nwaokocha14 hours agoHong Kong legislator eyes Bitcoin for fiscal reservesBy engaging with stakeholders and focusing on compliance, Johnny Ng aims to position Hong Kong as a leader in adopting Bitcoin and Web3 tech
Ethereum Holds Key Support To Set A $6,000 Target – Analyst
Este artículo también está disponible en español. Ethereum (ETH) is showing strength, finding support at a critical level around $2,400 and pushing to local highs near $2
David Attlee2 hours agoHow senators plan on regulating AI: Law Decoded, Sept. 4–11Senators Richard Blumenthal and Josh Hawley"s framework emphasizes that technology companies cannot rely on liability protections to shi
Web3 Company Animoca Brands Lowers Fundraising Goal to $1 Billion in Q1 2023
Web3 Company Animoca Brands Lowers Fundraising Goal to $1 Billion in Q1 2023 Animoca Brands, a Web3 gaming-focused company, has announced it is now targeting a raise of $1 billion
De-Dollarization Trend Irreversible, Flight From US Dollar Sure to Accelerate, Says Russian Official
De-Dollarization Trend Irreversible, Flight From US Dollar Sure to Accelerate, Says Russian Official Russia’s foreign minister says a flight from the U.S. dollar “is sure t
Smart TVs and NFTs Collide: Samsung Introduces World’s First Television-Based NFT Platform
Smart TVs and NFTs Collide: Samsung Introduces World’s First Television-Based NFT Platform The well known electronics giant Samsung, the manufacturer of LCD and LED panels, lapto
Crypto Industry Lobbies Against Bills Targeting Russian Oligarchs Evading Sanctions Using Cryptocurrency
Crypto Industry Lobbies Against Bills Targeting Russian Oligarchs Evading Sanctions Using Cryptocurrency The crypto industry is lobbying U.S. lawmakers against two bills aimed at p
SEC, DOJ Investigate FTX — Regulators Suspect Crypto Exchange Mishandles Customer Funds
SEC, DOJ Investigate FTX — Regulators Suspect Crypto Exchange Mishandles Customer Funds The U.S. Securities and Exchange Commission (SEC) and the Department of Justice (DOJ) are
3 reasons why Bitcoin price could hit $68K in September
Nancy Lubale5 hours ago3 reasons why Bitcoin price could hit $68K in SeptemberBitcoin’s technical setup and onchain data hint at a short upside recovery in the making.2324 Total views6 Total sharesListen to article 0:0
Kraken and Tottenham Hotspur score big in crypto partnership
Josh O"Sullivan11 hours agoKraken and Tottenham Hotspur score big in crypto partnershipKraken is now Tottenham Hotspur’s first official crypto and Web3 partner, with the goal of boosting fan engagement and increasing a
Myria Has Announced Free-to-Claim Alliance Sigil NFT for All New and Existing Community Members
Myria Has Announced Free-to-Claim Alliance Sigil NFT for All New and Existing Community Members press release PRESS RELEASE.June 8, 2022:Myria Studios, the blockchain gaming divisio