Addressing the Security and Privacy Challenges of Large Language Models

Business Security

Organizations wishing to harness the potential of LLMs must also be able to manage risks that could otherwise erode the technology’s business value.

November 6, 2023
•
,
5 minutes. read

Addressing the Security and Privacy Challenges of Large Language Models

Everyone is talking about ChatGPT, Bard and generative AI itself. But after the hype inevitably comes a return to reality. Even as business executives and IT leaders are buzzing about the disruptive potential of technology in areas such as customer service and software development, they are also increasingly aware of some of the potential downsides and risks to to watch.

In short, for organizations to harness the potential of large language models (LLMs), they must also be able to manage the hidden risks that could otherwise erode the technology’s business value.

What’s wrong with LLMs?

ChatGPT and other generative AI tools are powered by LLMs. They work by using artificial neural networks to process huge amounts of text data. After learning the patterns between words and how they are used in context, the model is able to interact in natural language with users. In fact, one of the main reasons for ChatGPT’s exceptional success is its ability to tell jokes, compose poems, and generally communicate in a way that’s hard to distinguish from a real human.

RELATED READING: Write Like a Boss with ChatGPT: How to Better Detect Phishing Scams

Generative AI models powered by LLM, as used in chatbots like ChatGPT, function like super-powered search engines, using the data they were trained on to answer questions and perform tasks with language-like language. that of humans. Whether publicly available models or proprietary models used internally within an organization, LLM-based generative AI can expose businesses to certain security and privacy risks.

5 of the main LLM risks

1. Excessive sharing of sensitive data

LLM-based chatbots aren’t good at keeping secrets – or forgetting them, for that matter. This means that any data you enter can be absorbed by the model and made available to others or at least used to train future LLM models. Samsung workers found out the hard way when they shared confidential information with ChatGPT while using it for work-related tasks. The code and meeting recordings they entered into the tool could theoretically be in the public domain (or at least stored for future use, like highlighted by the UK’s National Cyber Security Center recently). Earlier this year, we took a closer look at how organizations can avoid putting their data at risk when using LLMs.

2. Copyright Challenges

LLMs are trained on large amounts of data. But this information is often taken from the web, without the explicit permission of the content owner. This may create potential copyright issues if you continue to use it. However, it can be difficult to find the original source of specific training data, making it difficult to mitigate these issues.

3. Insecure code

Developers are increasingly turning to ChatGPT and similar tools to help them accelerate time to market. In theory, this can help by quickly and efficiently generating code snippets and even entire software programs. However, security experts warn that this can also create vulnerabilities. This is of particular concern if the developer does not have enough domain knowledge to know what bugs to look for. If buggy code then ends up in production, it could have a serious reputational impact and require time and money to fix.

4. Hack the LLM itself

Unauthorized access and tampering with LLMs could provide hackers with a range of options to carry out malicious activities, such as causing the model to disclose sensitive information via rapid injection attacks or perform other actions supposed to be blocked. Other attacks may involve exploiting server-side request forgery (SSRF) vulnerabilities in LLM servers, allowing attackers to extract internal resources. Malicious actors could even find a way to interact with confidential systems and resources simply by sending malicious commands via natural language prompts.

RELATED READING: Black Hat 2023: AI gets big prize money for defender

As an example, ChatGPT had to be taken offline in March following the discovery of a vulnerability that exposed the conversation history titles of certain users to other users. To raise awareness of vulnerabilities in LLM applications, the OWASP Foundation recently published a list of 10 critical security vulnerabilities commonly observed in these applications.

5. A data breach at the AI provider

There is always a risk that a company that develops AI models could itself become the victim of a breach, allowing hackers to steal, for example, training data that could include sensitive proprietary information. The same goes for data leaks, for example when Google was inadvertently leaked private chats with Bard in its search results.

What to do next

If your organization wants to start harnessing the potential of generative AI to gain a competitive advantage, it should first take certain steps to mitigate some of these risks:

Data encryption and anonymization: Encrypt data before sharing it with LLMs to protect it from prying eyes, and/or consider anonymization techniques to protect the privacy of individuals who may be identified in the data sets. Data sanitization can achieve the same goal by removing sensitive details from training data before it is fed into the model.
Improved access controls: Strong passwords, multi-factor authentication (MFA), and least privilege policies will help ensure that only authorized individuals have access to the generative AI model and back-end systems.
Regular security audits: This can help uncover vulnerabilities in your IT systems that may impact the LLM and generative AI models they are built on.
Practice incident response plans: A solid, well-prepared IR plan will help your organization respond quickly to contain, remediate and recover from any breach.
Thoroughly vet LLM providers: As with any provider, it is important to ensure that the company providing the LLM follows industry best practices regarding data security and privacy. Make sure it is clearly stated where user data is processed and stored, and whether it is used to train the model. How long does it keep? Is it shared with third parties? Can you accept or refuse that your data be used for training?
Make sure developers follow strict security guidelines: If your developers use LLMs to generate code, make sure they follow policies, such as security testing and peer review, to mitigate the risk of bugs appearing in production.

The good news is that there is no need to reinvent the wheel. Most of the tips above are tried and tested safety tips. They may need to be updated/tweaked for the AI world, but the logic behind them should be familiar to most security teams.

FURTHER READING: A Bard’s Tale: How Fake AI Bots Try to Install Malware