It’s an unfortunate truth that with the rise of AI tools comes an increase in malicious use, such as Prompt Injection Attacks. Prompt injection attacks trick an AI model into following hidden or nefarious instructions instead of the user’s real request. As a result, it may ignore its system safety rules and leak data, generate harmful content, or perform unauthorized actions through connected tools.
Below, we’ll take a look at how these attacks work and the main types you should know about. We’ll also cover the risks, some real-world incidents in 2025 and 2026, and practical ways users and organizations can stay safe from prompt injections.
What is a prompt injection attack, and how does it work?
A prompt injection is a type of cyberattack that manipulates large language models (LLMs) into ignoring built-in safety rules and performing unsafe actions—such as spreading misinformation, redirecting users to malware, or leaking sensitive data.
It’s not unlike an SQL injection, where an attacker gets a database to execute malicious SQL code by adding it to an input field. An attacker may write a harmful prompt directly into the chatbox, or hide it in documents, messages, websites, and other content the LLM can interact with.
What makes prompt injection attacks so successful is the fact that LLMs can’t reliably tell trusted instructions from untrusted text. Instead, all instructions (system, user, and external) are mashed together into a sequence of “tokens.”
The LLM mostly relies on patterns and formatting to recognize the instruction hierarchy. With careful crafting, injected instructions can blend in and compete with legit instructions.
Types of prompt injection attacks
Prompt injection attacks usually fall into two main groups: direct and indirect. Direct injections try to steer the model with instructions built into the prompt itself, while indirect injections hide those instructions inside content that the LLM reads later, like web pages, files, messages, or even images. Here’s a detailed look at both categories.
Direct prompt injections
Direct prompt injections involve putting malicious instructions straight into the chat or input field to try to override the model’s built-in rules. The goal is usually to force the AI to reveal restricted info (such as user logins), ignore safety limits, or follow commands it would normally refuse. Attackers often use these direct injection methods:
- Prompt manipulation: Attackers try to override the system prompt by telling the model to ignore rules, treat a fake task as urgent, or follow “formatting instructions” that actually change behavior and weaken safety checks.
- Code-based prompt injections: Code snippets and command-like text can nudge the model into unsafe actions, which only gets worse when the AI is connected to external tools, plugins, or browsers.
- Prompt leaking: Attackers push the model with repeated, carefully worded prompts until it begins to reveal hidden system instructions, internal settings, or sensitive training details. They then use what they learn for follow-up prompts that bypass guardrails more easily.
- Character-masking attacks: Changing characters, adding symbols and emojis, or mixing languages can hide malicious instructions in plain sight. This helps malicious prompts slip past filters that only scan for obvious keywords.
- Social engineering prompts: Attackers write prompts that sound harmless or emotionally loaded, such as pretending to be a developer, a manager, or a user in trouble. The goal is to pressure the model into breaking policy, sharing private info, or doing harmful actions “to help.”
Indirect prompt injections
Indirect prompt injections work more quietly by hiding instructions inside content that the AI reads, like emails, support tickets, and others. Instead of attacking the model head-on, the attacker waits for the AI to process the content and follow the embedded commands. Here are some common methods used:
- Payload splitting: Attackers split a malicious instruction into separate chunks across multiple messages or documents. Each piece looks harmless on its own, but once the model combines them in context, it reconstructs the full command and follows it.
- Multimedia prompt injections: Instructions are embedded within non-text formats such as images, PDFs, or transcripts. When the model extracts or reads that content, it may treat hidden text inside it as part of the task and act on it.
- Adversarial suffixes: Attackers add a long block of text to the end of a prompt, designed to confuse the model into ignoring safety rules. These often look like random tokens or weird formatting, but they’re tuned to push the model into unsafe output.
- Stealth formatting attacks: These use layout tricks, such as hidden characters, markdown styling, fake headers, or HTML-like structure to make malicious instructions appear like system messages or trusted content, which can affect how the model prioritizes them.
Prompt injection vs jailbreak: What’s the difference?
Prompt injection and jailbreak both try to make an AI ignore its rules, but they usually target different weak spots. Prompt injection focuses on hijacking instruction priority, while jailbreaks focus on getting forbidden output through clever wording.
A jailbreak often looks like roleplay, emotional pressure, or “just pretend” framing. The user keeps poking until the model gives restricted content, like weapon instructions or hacking steps, despite existing guardrails. In one example, a programmer convinced Discord’s AI chatbot to roleplay as their grandma and teach them how to make napalm.
Prompt injection goes further when the model reads outside content or uses third-party tools. A hidden command in a web page or file can redirect what the model does, leak data, or trigger unsafe actions, even if the user never asked directly.
The main risks of prompt injection attacks
Prompt injection can cause an AI to reveal private info, make it give misleading responses, or follow unsafe links and commands. Here’s how it can affect your system and users:
- Sensitive data leaks: Hidden instructions can push the model to reveal private info like system prompts, internal rules, or user data pulled from connected tools. Agentic browsers (e.g., ChatGPT Atlas, Perplexity Comet) and AI assistants like OpenClaw are particularly vulnerable due to their elevated permissions.
- Output manipulation: A hostile prompt can steer the model toward a fake answer, a biased summary, or a chosen product or policy. The massive Pravda disinformation campaign generated 3.6 million articles in 2024 solely to manipulate AI responses and spread propaganda.
- Malware redirects: A poisoned prompt can make the model point users to bad download links, fake support pages, or scam sites. That can send traffic to tools that drop malware or steal credentials.
- Harmful content generation: Models can be pushed to write threats, abuse, self-harm advice, or dangerous instructions.
- Malicious code generation: Prompt injection can lead AI to write phishing scripts, credential stealers, or find exploits to existing code. That gives an attacker a faster way to build tools that break into accounts or systems.
High-profile prompt injection attacks
Prompt injection attacks have already hit widely used AI products like Copilot and Claude. The examples below show how a simple malicious input can turn into a serious security incident.
- EchoLeak (2025): One malicious email sent to a user could cause Microsoft 365 Copilot to leak sensitive data through a zero-click prompt injection. The attack worked by embedding instructions inside content Copilot treated as trusted input.
- GitHub Copilot RCE (2025): Prompts hidden in repository comments could influence Copilot’s behavior into allowing code execution on developer systems.
- Cline/OpenClaw supply chain attack (2026): An attacker opened a GitHub issue with a malicious title. This tricked Cline’s AI issue triager (Claude) into executing hidden commands, which eventually allowed the attacker to steal publishing credentials. They then released a fake version of the Cline CLI that automatically installed OpenClaw on developers’ machines through automatic updates.
- Reprompt (2026): A specially crafted link could trigger data exfiltration from Copilot Personal, with no user involvement beyond the initial click.
How to prevent prompt injection attacks
Prompt injection is unlikely to go away, with even big players like OpenAI admitting as much. Therefore, it’s imperative that you learn to protect yourself. If you’re a regular user, your main focus should be on using trusted AI services, spotting sketchy prompts, and limiting what you share with the AI. Meanwhile, organizations need stricter controls, regular testing, and adequate guardrails for model use and data access.
Here are some practical tips to prevent prompt injection attacks in both cases:
For individuals
- Never share sensitive data: Treat every chat as if it could be logged or exposed. Keep passwords, ID numbers, bank info, private links, and work files out of the conversation, even if the chatbot sounds confident and seems “secure.”
- Stick to reputable AI services: Use well-known apps with clear privacy settings and security updates. Random browser bots and copycat sites can log everything you type, and some even add hidden prompts behind the scenes.
- Double-check unexpected replies: If you paste outside text and you get sudden requests for private info, downloads, or account access, pause for a moment. Report sketchy replies to the platform and don’t continue the chat.
- Update your software regularly: Keep your browser, extensions, and OS updated so malware and shady scripts have fewer ways in. Many prompt injection attacks pair with outdated plugins or compromised add-ons.
- Enable MFA or 2FA everywhere: Turn on multi-factor authentication for email, cloud storage, and work accounts. Even if someone steals a password through a scammy AI link, the extra login step can block them from getting in.
- Install a trusted antivirus: Use antivirus that actively scans downloads and blocks known phishing pages. If an AI ever redirects you to a bad site, you want protection that stops the damage before it starts.
For organizations
- Clean and validate inputs: Strip hidden text, remove weird markup, and block suspicious patterns before content reaches the model. Treat web pages, PDFs, and tickets like untrusted input, even when they look normal.
- Separate system rules from user content: Keep system prompts locked and never mix them into user-facing text. Use clear boundaries in your pipeline so the model can’t confuse an attacker’s content with trusted instructions.
- Limit what the LLM can access: Give the model the minimum data and tools it needs for the task. Restrict file access, block internal admin endpoints, and avoid giving it broad permissions like code execution or data deletion without human-in-the-loop approval.
- Scan outputs for leaks and unsafe content: Filter responses before users see them, especially when the model works with private data. Catch exposed secrets, internal notes, or dangerous instructions before they leave your system.
- Monitor tool calls and unusual behavior: Log every tool request and watch for patterns like repeated file reads, strange URLs, or large data pulls. A prompt injection often shows up as the model suddenly acting “too curious.”
- Stress-test your guardrails often: Run red-team tests with real attack prompts, messy inputs, and poisoned documents. Keep testing after every model update, because small changes can reopen holes you already patched.
Prompt injection recovery checklist
As with phishing and other scams, even the most robust defenses can backfire due to human error. Here’s how to proceed if your organization gets hit:
- Flag suspicious responses and actions: Freeze the conversation, save the full prompt chain, and capture the exact output. If the model triggered a web request, file lookup, database query, or API call, record what it accessed, what it returned, and what the user saw.
- Cut off risky system access right away: Disable browsing, file connectors, and API tools temporarily, then rotate keys and revoke tokens. If the model can reach email, cloud storage, or internal apps, lock those down first.
- Review logs and classify the attack: Trace where the malicious text came from, like a web page, PDF, ticket, or user prompt. Then classify the injection type clearly (direct, indirect, agentic, memory), so your team knows what failed and what to fix first.
- Fix weak spots in prompt handling: Patch the exact path that let the injection through, like untrusted HTML, tool output, or Retrieval Augmented Generation (RAG) content. Add filtering for hidden text, fake system tags, and instruction-like patterns.
- Log details for compliance reporting: Write down timestamps, affected users, accessed systems, and exposed data, then map the incident to a framework like MITRE ATT&CK or OWASP LLM Top 10. Keep copies of prompts and outputs, since legal teams and auditors need proof of what happened.
- Strengthen defenses after the incident: Update allowlists, tighten tool permissions, and add extra checks on tool calls and model output. Run the same attack again in testing so you can confirm the fix actually blocks it.
Prompt injection attacks: FAQs
What is an example of a prompt injection attack?
An example of a prompt injection attack with real-world effects was demonstrated by a group of researchers. The attack allowed them to control Gemini AI and have it turn off lights, roll up smart shutters, and activate a boiler simply by “poisoning” a Google Calendar invite.
Is ChatGPT vulnerable to prompt injection attacks?
ChatGPT can be vulnerable to prompt injection attacks when it reads untrusted text and treats it like a command. If you connect it to tools like Google Calendar, Slack, Google Drive, or others, attackers can force ChatGPT to leak data or take other unauthorized actions.
Is prompt injection a crime?
Prompt injection is only a crime if someone uses it to steal data, break into systems, or otherwise cause damage. Depending on the method, attackers could be charged under existing laws like the Computer Fraud and Abuse Act in the US, the Computer Misuse Act in the UK, or data protection regulations like the GDPR.
What is the success rate of a prompt injection attack?
Prompt injection success rates vary widely by model and setup. In tests against frontier models like Claude Opus 4.5 and Gemini 2.5 Pro, success rates of indirect prompt injections ranged from 0.5% to 8.5%.
Meanwhile, an MDPI-published study showed that agentic AI systems were way more vulnerable, with attack success rates reaching 50-84% depending on the agent’s level of autonomy and access to external tools.
Can prompt injections be prevented?
Prompt injections can be reduced with techniques like input filtering, output checks, strict prompt separation, least-privilege access, and careful tool design, but you still need ongoing testing because attackers keep changing the text they use.
Further reading: