Secrets Detection

From 32 to 81: Rebuilding a Secret Detection Catalog for the AI API Era

April 20, 2026 10 min read Secrets · Supply Chain

A leaked OpenAI API key in 2024 could cost a few hundred dollars in abused usage. The same leak in 2026 can cost tens of thousands within hours, once a scraper chains it into an agent loop. The economics of credential theft changed — and yet most secret-detection catalogs still read like they were written in 2019, with a handful of AWS patterns and a generic "secret" regex.

This post walks through how we rebuilt the SF365 SecretScanner catalog from 32 patterns to 81, and why nearly a quarter of the new additions are dedicated AI-API formats.

TL;DR: precision beats breadth. 81 hand-curated, vendor-aware patterns — pre-compiled at startup, grouped by provider, with explicit severity — outperform a generic shotgun regex on both signal and noise.

The problem with "generic" secret detection

Most naive scanners boil down to two regexes: (?i)(password|secret|api_key)\s*[=:]\s*['"][^'"]{8,}['"] and [a-f0-9]{32,}. They catch a lot, but the signal-to-noise ratio is terrible:

The shift in 2026 is that the generic approach not only creates alarm fatigue — it actively misses real credentials. An OpenAI project key (sk-proj-...) doesn't match the old patterns because the maintainer never heard of it when the regex was written.

The new taxonomy — 14 groups, 81 patterns

We grouped the catalog by what the credential unlocks, because that is how a defender thinks about impact:

GroupCountHighlights
AWS5Access Key, Temporary (ASIA), Secret, Session Token, MWS
Azure4Storage Key, AD Client Secret, DevOps PAT, SAS Token
GCP4API Key, OAuth Client, Service Account JSON marker + private_key_id
GitHub6Classic PAT, Fine-grained PAT, OAuth (gho_), App Server (ghs_), User-to-Server (ghu_), Refresh (ghr_)
GitLab / Bitbucket2GitLab PAT, Bitbucket App Password
AI APIs8OpenAI (+proj), Anthropic, Hugging Face, Replicate, Groq, Gemini, Cohere, Pinecone
Messaging / Chat8Slack bot+webhook, Discord bot+webhook, Telegram, Twilio SID/Key/Auth
Email4SendGrid, Mailgun, Mailchimp, Postmark
Payments6Stripe (secret, restricted, webhook), Square, Shopify, PayPal Braintree
Monitoring6Datadog API+App, Sentry DSN, New Relic, Segment, PagerDuty
Infra / Cloud tooling6Cloudflare Token+Global, DigitalOcean, Heroku (key-prefixed), Firebase FCM, Algolia Admin
Package registries6npm, NuGet, PyPI, Docker Hub, RubyGems, Cargo
Databases5MongoDB / PostgreSQL / MySQL / Redis URIs with credentials, SQL Server connection strings
Private keys6RSA, EC, OpenSSH, Generic PKCS#8, Encrypted, PGP
Generic fallback5Hardcoded password, JWT secret, Bearer token, Basic auth in URL, generic assignment

Why AI APIs get their own section

Nine dedicated patterns for AI providers is a deliberate bet. Here is what changed:

// A sample of the AI-API patterns (simplified)
new("OpenAI API Key",      "sk-(?:proj-)?[A-Za-z0-9_-]{40,}")
new("Anthropic API Key",   "sk-ant-(?:api\\d+-)?[A-Za-z0-9_-]{32,}")
new("Hugging Face Token",  "hf_[A-Za-z0-9]{34}")
new("Replicate API Token", "r8_[A-Za-z0-9]{37,}")
new("Groq API Key",        "gsk_[A-Za-z0-9]{52}")
new("Pinecone API Key",    "(?i)pinecone[._-]?api[._-]?key['\"\\s:=]+[0-9a-f-]{36}")

Precision over breadth

Adding patterns only helps if false positives go down, not up. Three design choices made that possible:

1. Require the key name for ambiguous formats

The old Heroku pattern was [0-9a-fA-F]{8}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{12} — a plain GUID. Every GUID in the codebase lit up. The new Heroku pattern requires the prefix:

(?i)heroku[._-]?api[._-]?key['"\s:=]+[0-9a-fA-F]{8}-... 

Same regex class, but it only fires when a human actually named the value a Heroku key. False positives on UUID fixtures drop to zero.

2. Exclude the usual noisy paths

Before scanning, the engine skips node_modules/, .git/, vendor/, dist/, archives, images, fonts, minified JS/CSS, and lockfiles. This alone cuts scan time on large repos by 60-70% and removes thousands of junk matches.

3. Compile once, match many

Every pattern is converted to a compiled Regex instance at static init with a one-second timeout. Each line iterates the compiled matchers rather than rebuilding instances per match. Scanning a 100k-line repo went from ~4.2 seconds to ~1.6 seconds on the same hardware.

private static readonly List<CompiledPattern> Patterns = PatternDefinitions
    .Select(p => new CompiledPattern(
        p.Name,
        new Regex(p.Regex, RegexOptions.Compiled, TimeSpan.FromSeconds(1)),
        p.SecretType,
        p.Severity))
    .ToList();

Triage becomes a one-line decision

The difference between a useful scanner and noise is what a developer sees when something fires. With the new catalog, a finding carries its vendor name, a severity calibrated to the credential's power, and an AI remediation string telling the engineer exactly what to do:

Title:        Anthropic API Key detected
Severity:     Critical
Category:     Secret Exposure (Anthropic)
CWE:          CWE-798
OWASP:        A07:2021
File:         src/agents/brain.ts:42
Preview:      sk-a****0xP2           # first 4 + last 4, never the full token
Remediation:  Move this Anthropic key to a secure secret manager
              (Azure Key Vault, AWS Secrets Manager, HashiCorp Vault,
              1Password, Doppler). Never commit secrets to source
              control. Rotate the exposed credential immediately.

What we deliberately left for phase 2

Takeaway

Secret detection is the cheapest security win a team can buy — if the catalog is maintained. A good rule of thumb: if your scanner's catalog doesn't include at least one dedicated pattern per AI provider you use, you are already behind. The blast radius of an AI API leak no longer looks like the blast radius of 2021.

Scan your repo with 81 precision patterns

Security Factor 365 ships this catalog as part of the Full scan and the Secrets scan. Free tier covers five applications.

Start Free