AI Research Team at RIT Publish Findings on Generative Harmful Content
Novel "Toxicity Rabbit Hole" framework aimed at benchmarking the efficiency of LLM guardrails
Warning: the following story and accompanying paper link discuss sensitive topics and language involving research around hate speech in generative AI.
In a recent preprint , faculty and Ph.D. students in RIT’s ESL Global Cybersecurity Institute, identified issues surrounding generative hate speech in Google’s PaLM2 Large Language Model (LLM), which powers Bard, Google’s answer to ChatGPT. Google was informed about the toxic content generated by PaLM2, and thanks to their responsible approach, have since rectified issues identified by the team in their initial study. The all-RIT team, made up of Computing and Information Sciences Ph.D students Sujan Dutta and Arka Dutta, Data Science MS student Adel Khorramrouz, and Professor Ashique Khudabukhsh have submitted the paper for consideration at a leading AI conference.
Khudabukhsh employs machine learning in large data sets to detect and better understand patterns in a socio-political context. He explained that these issues demonstrate fundamental limitations in LLMs, which have exploded in popularity following the public release of ChatGPT. “We are seeing LLMs deployed for the general population, however, proper guardrails have not been put in place to ensure that they are not used to generate hate speech and other forms of harmful content,” said Khudabukhsh. Examples of such language can be found in the paper linked however please be warned: this language is disturbing.
“We designed a novel framework named Toxicity Rabbit Hole, which we believe can be a standard practice to benchmark the efficiency of LLM guardrails in future,” said Arka Dutta, describing the team’s methodology. Starting with a stereotype (say, Group X are not nice people), this framework iteratively requests a language model to generate more toxic content than its previous generation until its own guardrails flag the request as highly unsafe. “Our findings are important because it uncovers vulnerabilities in a large language model (LLM) with broad deployment goals. They also point to the dire possibility of using these publicly available LLMs as a weapon of mass-hatred on digital media by bad actors.”
There are no current guidelines establishing the guardrails companies should implement in LLMs and Khudabukhsh notes that not every organization will act as responsibly as Google. There is even an economic advantage for unscrupulous groups that choose not to invest in these types of precautions. However, the dangers of such models are considerable and the threat is growing.
Adel Khorramrouz cited his first-hand experience with the impact of hate speech in his native country of Iran as informing the importance of this work. “I understand the potential dangers of such powerful tools in creating harmful content on social media and shaping societal narratives,” said Khorramrouz
Moreover, guardrails that are put in place by responsible companies like Google, are sometimes faulty in how they understand what constitutes hate speech. In 2021, a live YouTube broadcast of an interview between chess grandmaster Hikaru Nakamura and Radić, whose chess channel has more than a million subscribers, cut off abruptly. While no reason was given, Khudabukhsh theorized it was because YouTube’s algorithm detected hate speech. An avid chess player, he and a colleague created an experiment to demonstrate how syntax such as “white attacks black” could be flagged by the algorithm oblivious to the harmless context of chess.
As for next steps, Sujan Dutta explained, “we want to compare and contrast the behaviors of different LLMs using our rabbit-hole framework. It will enable us to develop robust methods to prevent toxic generations.” The team is also exploring testing in areas beyond toxic content.
“Besides toxicity, we plan to test the LLMs on morally ambiguous situations and based on the response, rate their understanding of ethics and morality, and get a sense of how this level of understanding can affect the digital space,” said Arka Dutta.
Khudabukhsh is at the forefront of RIT’s growing research profile in AI. His previous work, which included prompting YouTube to remove unsafe transcriptions from children’s videos, has been featured in Wired and international press, and his analysis of online speech in advance of the Jan. 6 Capitol riot was featured on the cover of the Sunday New York Times. The influential work of Khudabukhsh and other talented AI researchers at the university have helped RIT rise to 4th in 2023 publications in the AI-subfield of CSRankings.org.
“Our AI team is growing, and also getting recognized, whether at top conferences or in features on the BBC, New York Times, and elsewhere,” said Khudabukhsh. “Much of our work can be difficult due to the nature of our research. Examining such harmful content day in and day out takes its toll, but we are proud of the fact that this work is helping create safer AI tools and preventing LLMs from exacerbating hate speech online.”
Khorramrouz added, “No one can deny the role of AI in the future. My goal is to be a member of the AI-research community that strives to develop and harness this cutting-edge technology for the greater good.”