From 32 to 81: Rebuilding a Secret Detection Catalog for the AI API Era

A leaked OpenAI API key in 2024 could cost a few hundred dollars in abused usage. The same leak in 2026 can cost tens of thousands within hours, once a scraper chains it into an agent loop. The economics of credential theft changed — and yet most secret-detection catalogs still read like they were written in 2019, with a handful of AWS patterns and a generic "secret" regex.

This post walks through how we rebuilt the SF365 SecretScanner catalog from 32 patterns to 81, and why nearly a quarter of the new additions are dedicated AI-API formats.

TL;DR: precision beats breadth. 81 hand-curated, vendor-aware patterns — pre-compiled at startup, grouped by provider, with explicit severity — outperform a generic shotgun regex on both signal and noise.

The problem with "generic" secret detection

Most naive scanners boil down to two regexes: (?i)(password|secret|api_key)\s*[=:]\s*['"][^'"]{8,}['"] and [a-f0-9]{32,}. They catch a lot, but the signal-to-noise ratio is terrible:

Every GUID in the codebase lights up.
Every hashed test fixture looks like a credential.
You cannot tell an AWS Access Key from a Shopify token from a JWT.
When something does fire, the finding says "generic secret detected", which is unhelpful for triage.

The shift in 2026 is that the generic approach not only creates alarm fatigue — it actively misses real credentials. An OpenAI project key (sk-proj-...) doesn't match the old patterns because the maintainer never heard of it when the regex was written.

The new taxonomy — 14 groups, 81 patterns

We grouped the catalog by what the credential unlocks, because that is how a defender thinks about impact:

Group	Count	Highlights
AWS	5	Access Key, Temporary (ASIA), Secret, Session Token, MWS
Azure	4	Storage Key, AD Client Secret, DevOps PAT, SAS Token
GCP	4	API Key, OAuth Client, Service Account JSON marker + private_key_id
GitHub	6	Classic PAT, Fine-grained PAT, OAuth (`gho_`), App Server (`ghs_`), User-to-Server (`ghu_`), Refresh (`ghr_`)
GitLab / Bitbucket	2	GitLab PAT, Bitbucket App Password
AI APIs	8	OpenAI (+proj), Anthropic, Hugging Face, Replicate, Groq, Gemini, Cohere, Pinecone
Messaging / Chat	8	Slack bot+webhook, Discord bot+webhook, Telegram, Twilio SID/Key/Auth
Email	4	SendGrid, Mailgun, Mailchimp, Postmark
Payments	6	Stripe (secret, restricted, webhook), Square, Shopify, PayPal Braintree
Monitoring	6	Datadog API+App, Sentry DSN, New Relic, Segment, PagerDuty
Infra / Cloud tooling	6	Cloudflare Token+Global, DigitalOcean, Heroku (key-prefixed), Firebase FCM, Algolia Admin
Package registries	6	npm, NuGet, PyPI, Docker Hub, RubyGems, Cargo
Databases	5	MongoDB / PostgreSQL / MySQL / Redis URIs with credentials, SQL Server connection strings
Private keys	6	RSA, EC, OpenSSH, Generic PKCS#8, Encrypted, PGP
Generic fallback	5	Hardcoded password, JWT secret, Bearer token, Basic auth in URL, generic assignment

Why AI APIs get their own section

Nine dedicated patterns for AI providers is a deliberate bet. Here is what changed:

New prefixes appear constantly. OpenAI alone introduced sk-proj-, organization-scoped keys, and service account keys within a 12-month window. A generic regex misses each one until it is manually updated.
The blast radius is asymmetric. An AWS Access Key can be caught by CloudTrail; a leaked OpenAI key gets burned through before billing alerts fire.
Workloads are hybrid. Teams run code that calls four model providers in the same request. Leaking any one of them is equivalent.

// A sample of the AI-API patterns (simplified)
new("OpenAI API Key",      "sk-(?:proj-)?[A-Za-z0-9_-]{40,}")
new("Anthropic API Key",   "sk-ant-(?:api\\d+-)?[A-Za-z0-9_-]{32,}")
new("Hugging Face Token",  "hf_[A-Za-z0-9]{34}")
new("Replicate API Token", "r8_[A-Za-z0-9]{37,}")
new("Groq API Key",        "gsk_[A-Za-z0-9]{52}")
new("Pinecone API Key",    "(?i)pinecone[._-]?api[._-]?key['\"\\s:=]+[0-9a-f-]{36}")

Precision over breadth

Adding patterns only helps if false positives go down, not up. Three design choices made that possible:

1. Require the key name for ambiguous formats

The old Heroku pattern was [0-9a-fA-F]{8}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{12} — a plain GUID. Every GUID in the codebase lit up. The new Heroku pattern requires the prefix:

(?i)heroku[._-]?api[._-]?key['"\s:=]+[0-9a-fA-F]{8}-...

Same regex class, but it only fires when a human actually named the value a Heroku key. False positives on UUID fixtures drop to zero.

2. Exclude the usual noisy paths

Before scanning, the engine skips node_modules/, .git/, vendor/, dist/, archives, images, fonts, minified JS/CSS, and lockfiles. This alone cuts scan time on large repos by 60-70% and removes thousands of junk matches.

3. Compile once, match many

Every pattern is converted to a compiled Regex instance at static init with a one-second timeout. Each line iterates the compiled matchers rather than rebuilding instances per match. Scanning a 100k-line repo went from ~4.2 seconds to ~1.6 seconds on the same hardware.

private static readonly List<CompiledPattern> Patterns = PatternDefinitions
    .Select(p => new CompiledPattern(
        p.Name,
        new Regex(p.Regex, RegexOptions.Compiled, TimeSpan.FromSeconds(1)),
        p.SecretType,
        p.Severity))
    .ToList();

Triage becomes a one-line decision

The difference between a useful scanner and noise is what a developer sees when something fires. With the new catalog, a finding carries its vendor name, a severity calibrated to the credential's power, and an AI remediation string telling the engineer exactly what to do:

Title:        Anthropic API Key detected
Severity:     Critical
Category:     Secret Exposure (Anthropic)
CWE:          CWE-798
OWASP:        A07:2021
File:         src/agents/brain.ts:42
Preview:      sk-a****0xP2           # first 4 + last 4, never the full token
Remediation:  Move this Anthropic key to a secure secret manager
              (Azure Key Vault, AWS Secrets Manager, HashiCorp Vault,
              1Password, Doppler). Never commit secrets to source
              control. Rotate the exposed credential immediately.

What we deliberately left for phase 2

Entropy scoring. A useful complement for unknown formats, but prone to false positives on valid data like base64-encoded images. We will layer it on top of the regex catalog, not replace it.
Live revocation. For detected AWS / GitHub / OpenAI keys, a future version will call the provider's revocation endpoint automatically when the developer confirms the exposure.
Historical git scanning. Right now we scan the working tree. Phase 2 walks git log so we can find credentials that were committed and later deleted but remain in history.

Takeaway

Secret detection is the cheapest security win a team can buy — if the catalog is maintained. A good rule of thumb: if your scanner's catalog doesn't include at least one dedicated pattern per AI provider you use, you are already behind. The blast radius of an AI API leak no longer looks like the blast radius of 2021.

Scan your repo with 81 precision patterns

Security Factor 365 ships this catalog as part of the Full scan and the Secrets scan. Free tier covers five applications.

Start Free