LLMSecTest — security tests for LLM apps, written in pytest

Security tests for LLM apps, written in pytest.

The OWASP LLM Top 10 as ordinary pytest tests — one adapter for every provider, findings scored with CVSS, reports emitted as SARIF for CI. Open-source and MIT-licensed; pre-alpha, and built in the open.

Just pytest

No scanner off to the side. Security checks are pytest tests — same command, same build gate, same plumbing as your unit tests.

How it works »

Mapped & scored

Every probe is tied to an OWASP LLM Top 10 category and carries a CVSS v4.0 base score, so a finding is triageable, not a wall of text.

The coverage map »

CI-native

Reports emit as SARIF v2.1.0 — plus HTML, JSON and Markdown — the format code-scanning dashboards already read. No bespoke glue.

A sample finding »

Quick-start

Get a scan running in seconds.

Install from source while it's pre-alpha, point it at a model or your running app, and read the SARIF. Full walkthrough in the docs.

bash

~ $ pip install "git+https://github.com/wehnsdaefflae/llmsectest"
~ $ llmsectest --target anthropic:claude-3-5-haiku
~ $ llmsectest --target app:http://localhost:8000/chat   # your live app
~ $ llmsectest --target app:http://localhost:8000/chat \
      --report-formats=sarif,html,json,markdown
# => OWASP LLM Top 10 categories probed · findings written to results/<target>.sarif

OWASP LLM Top 10 — honest status

What's covered, and what isn't yet

The framework is real; coverage was filled in deliberately, category by category. Nothing was marked done before it was — all 10 of the OWASP LLM Top 10 (2025) categories are implemented and tested today. What remains is depth, not breadth. Every run reports any category it couldn't reach (a missing repo, model path or app marker) as an explicit skip, never a silent gap.

LLM01Prompt Injectiondone
LLM02Sensitive Information Disclosuredone
LLM03Supply Chaindone
LLM04Data & Model Poisoningdone
LLM05Improper Output Handlingdone
LLM06Excessive Agencydone
LLM07System Prompt Leakagedone
LLM08Vector & Embedding Weaknessesdone
LLM09Misinformationdone
LLM10Unbounded Consumptiondone

done implemented & tested — 10/10 · depth improvements continue on the roadmap

Roadmap — shipped, building, planned

Where it's going

Built in the open across the funding period. Each phase ships before the next is claimed — the status here tracks the real state of the code, not a wish-list.

01
Foundation
shipped
- pytest-native framework & plugin
- Unified adapter — OpenAI · Anthropic · Hugging Face · Ollama & LM Studio (local), with a fail-fast --preflight health check
- Reports — SARIF v2.1.0 · HTML · JSON · Markdown
- CVSS v4.0 scoring with OWASP mapping
- CLI & documentation site
02
OWASP coverage
shipped
- All 10 of 10 categories live — LLM01–LLM10 (complete OWASP LLM Top 10 2025 coverage)
- Black-box application testing — --target app:<url>, up to 8 categories with --app-prompt/-secret/-action/-canary/-rag-poison (LLM01/05/09/10 always-on, incl. bounded LLM10 flood + output-amplification probes)
- Supply-chain dependency scan — --repo <path> (LLM03)
- Data & model poisoning — offline serialization-opcode scan of model files for load-time code execution — --model-scan <path> (LLM04)
- Known-CVE lookup for pinned dependencies — --osv (OSV.dev)
- Red-team jailbreak set (JailbreakBench / AdvBench) — --redteam-set <csv> (LLM01)
- Over-refusal (false-refusal-rate) metric via the benign twins — --redteam-benign
- Standalone HTML reports from any SARIF file — --render-sarif <file.sarif>
- Vector & embedding weaknesses — black-box RAG retrieval exposure (--app-canary) and indirect prompt injection via a poisoned retrieved document (--app-rag-poison) (LLM08)
- Misinformation — black-box confabulation probes on guaranteed-nonexistent entities, scored by a non-circular disclaimer oracle (LLM09)
- Depth, not breadth — LLM08 white-box dimensions, a classifier refusal oracle (Llama-Guard / GLiGuard)
03
Depth & reports
in progress
- CycloneDX SBOM generation — --sbom <path> (LLM03)
- Deeper supply-chain analysis
- Embedding-inversion & stress tests
- PDF reports · remediation database · plugin API
04
Integrations & v1.0
planned
- CI/CD templates — GitHub Actions · GitLab CI · Jenkins
- Hardening & independent security audit
- v1.0 on PyPI · OWASP community submission

shipped in the code today · in progress being built now · planned ahead

Output — the target format

Findings you can act on

A run produces SARIF that drops straight into GitHub code scanning, GitLab, or any SARIF viewer — each finding carrying its OWASP category, CVSS vector and a remediation pointer. The snippet is the shape of the output the framework targets.

Illustrative — format target, not a recorded scan result.

report.sarif

{
  "ruleId": "LLM01-prompt-injection",
  "level": "error",
  "properties": {
    "owasp": "LLM01:2025 Prompt Injection",
    "cvss":  "CVSS:4.0/AV:N/AC:L/.../VC:H",
    "score": 9.2
  },
  "message": {
    "text": "System prompt recovered via instruction override."
  }
}

Built in the open, for the people shipping LLM features.

App developers, security leads and researchers — the repo is public and the roadmap is honest. Watch it, try the adapter, or open an issue.

github.com/wehnsdaefflae/llmsectest »

Just pytest

Mapped & scored

CI-native

Get a scan running in seconds.

What's covered, and what isn't yet

Where it's going

Foundation

OWASP coverage

Depth & reports

Integrations & v1.0

Findings you can act on

Built in the open, for the people shipping LLM features.