Section created: 2026-05-08
Self-hosted voice and audio AI surfaces — speech-to-text, text-to-speech, voice cloning, real-time voice agents, speaker diarization, music generation. Distinct from the broader model-serving survey (03-model-serving.md) because the abuse profile is different: a free-compute hit on a Whisper transcription endpoint is one class of harm; an unauth voice-cloning model is a substantively different class (deepfake-fraud, social engineering, voice impersonation).
Auth posture across this category skews Tier-A (“no auth concept” in the framework default). Most of these projects ship as research code or “self-host the demo” Gradio/Streamlit wrappers. Operators rarely add auth in front because most hosting tutorials don’t bother.
CVE / advisory watch
- CVE-2025-43842 through CVE-2025-43852 — RVC-WebUI (11x CVSS 9.8, RCE): GitHub Security Lab GHSL-2025-012 to GHSL-2025-022. Command injection via
preprocess_dataset, extract_f0_feature, click_train endpoints; code injection via eval() in change_info_; pickle RCE via torch.load(weights_only=False) in 6 model-handling functions. Chain: unauth Gradio API → malicious .pth model path → pickle deserialization → OS command execution. Every exposed RVC-WebUI instance is an RCE target.
- CVE-2025-49837 through CVE-2025-49841 — GPT-SoVITS (5x Critical RCE): GitHub Security Lab GHSL-2025-049 to GHSL-2025-053. Same command injection + pickle deserialization class. Port 9880, unauth FastAPI.
- CVE-2026-48710 “BadHost” — Starlette < 1.0.1 (auth bypass): Affects every FastAPI-wrapped TTS/ASR server — Kokoro-FastAPI, Orpheus-FastAPI, Chatterbox-TTS-Server, Parler-TTS, and any custom Whisper wrapper using Starlette path-based auth middleware. Host header injection bypasses auth without credentials. Patch: Starlette 1.0.1+. Research/homelab deploys (no reverse proxy) = fully exploitable.
- CVE-2025-23242 / CVE-2025-23243 — NVIDIA Riva ASR: Default config exposes HTTP 9000 + gRPC 50051 on 0.0.0.0; 54 cloud IPs identified by researchers. Unauthorized access to GPU resources and API key theft.
GHSA-pending — coqui-ai/TTS arbitrary file write via crafted speaker-embedding upload (advisory pending; older Coqui servers shipped with /api/tts accepting arbitrary file params).
- Trademark/identity abuse: voice-cloning servers serving celebrity-voice models without licensing fall under right-of-publicity (US) / GDPR Art. 9 biometric data (EU) — disclosure framing differs from typical security findings.
Known false-positive classes (filter at query time)
The keyword Whisper collides with several non-AI products that share the name. These are not voice/audio AI services and should be excluded from any http.title:"Whisper" survey to keep the result set clean.
Wake Forest WHISPER — clinical research portal (discovered 2026-05-08)
whisper.phs.wakehealth.edu (152.11.242.118) hosts a federally-funded clinical research data portal at Wake Forest University Health Sciences, Division of Public Health Sciences (PHS-BDS). ColdFusion-on-IIS, login-gated, federal-government-system warning banner. The acronym predates the OpenAI model. Cert SAN cluster covers 9 hostnames under *.phs.wakehealth.edu (whisper, guardian, ccrbis, mdsetaa, medsdb, oldphs, phs, web7a, libweb7).
Filter recipe: add -http.html:"wakehealth" -http.html:"WFUHS" -http.html:"phs.wakehealth" -http.html:"WHISPERLogo" -http.html:"actLogin.cfm" to broad http.title:"Whisper" queries.
Pre-filtered Shodan queries:
| Filtered Shodan Query | Notes |
|---|
http.title:"Whisper" -http.html:"wakehealth" -http.html:"WHISPERLogo" | Whisper title minus Wake Forest WHISPER |
http.title:"Whisper" -http.html:"actLogin.cfm" -http.html:"federal government" | Title minus US-fed-gov authorized-use banners |
http.title:"Whisper" "uvicorn" | Positive: only Whisper with uvicorn FastAPI signature (kicks out IIS/ColdFusion entirely) |
http.title:"Whisper" "fastapi" | Same approach — anchor on the actual stack |
http.title:"Whisper" -product:"Microsoft IIS" | Stack-level negative filter |
General methodology lesson
Repeating the lesson from Session 9’s Garak / Garakuta no Kamisama collision: a single keyword in http.title or http.html is unsound at population scale. Whisper, Garak, Bark, and Piper all collide with non-AI products. Anchor every keyword match to a structural signal that confirms the AI stack:
"uvicorn" or "fastapi" for the FastAPI Whisper variants
"whisper.cpp" literal for the C++ HTTP server
"openai-whisper-asr-webservice" for the canonical Docker image
body_contains:"/inference" + body_contains:"audio file" for the whisper.cpp UI template
- Kick out
Microsoft IIS, ColdFusion, .cfm, and government-banner strings
The aimap fingerprints in fingerprints.go already enforce this discipline (conjunctive matching on body_contains+status_code+json_field), but the Shodan dorks here are the first filter — the cleaner the dork, the less waste downstream.
Whisper ecosystem (ASR / Speech-to-Text)
OpenAI Whisper and its derivatives. Most expose /inference, /v1/audio/transcriptions, or a Gradio interface.
| Shodan Query | Notes |
|---|
http.title:"Whisper" | Broad — Whisper web UIs across all ports |
http.title:"Whisper" port:8080 | Default port for many Whisper deployments |
http.title:"Whisper" port:9000 | onerahmet/openai-whisper-asr-webservice default |
http.title:"Whisper" port:7860 | Gradio interface default |
http.title:"Whisper" -port:443 | Non-HTTPS only |
http.html:"faster-whisper" | faster-whisper accelerated derivative in HTML |
"faster-whisper" | Banner form |
http.html:"WhisperX" | WhisperX (word-level alignment) |
"whisper.cpp" "/inference" | whisper.cpp C++ HTTP server |
"whisper.cpp" port:8080 | whisper.cpp default port |
http.html:"whisper-streaming" | Whisper-Streaming live ASR |
http.html:"WhisperLive" | WhisperLive WebSocket transcription |
http.html:"insanely-fast-whisper" | Vaibhavs10 fast Whisper wrapper |
"openai-whisper-asr-webservice" | onerahmet’s standard webservice |
"openai-whisper-asr-webservice" port:9000 | Above + canonical port |
http.title:"Whisper" country:US | Geo-scoped |
http.title:"Whisper" country:DE | |
http.title:"Whisper" country:CN | |
http.title:"Whisper" org:"university" | Academic deployments |
http.title:"Whisper" org:"hospital" | Healthcare deployments (HIPAA-relevant) |
http.title:"Whisper" org:"amazon" | AWS deployments |
http.title:"Whisper" org:"hetzner" | Hetzner-hosted |
Vosk
Kaldi-based offline speech recognition.
| Shodan Query | Notes |
|---|
http.html:"vosk" | Vosk in HTML |
"vosk-server" | Vosk server banner |
"vosk-api" | Vosk API banner |
Coqui TTS family
Coqui Inc. shut down 2024 but XTTS-v2 model + servers remain widely deployed.
| Shodan Query | Notes |
|---|
http.html:"coqui" | Coqui in HTML — broadest |
"coqui" "tts" | Banner + term |
http.html:"xtts" | XTTS v2 deployments |
http.html:"XTTS" | Mixed-case |
"xtts-v2" | XTTS-v2 model name in banner |
http.title:"Coqui" | Coqui-themed pages |
http.title:"XTTS" | XTTS UI title |
port:5002 http.html:"tts" | Mozilla TTS / Coqui legacy default port |
port:8020 http.html:"tts" | Coqui XTTS server typical port |
http.html:"coqui" port:5002 | Coqui legacy on default port |
Piper
Rhasspy Piper — fast neural TTS popular on Raspberry Pi / edge.
| Shodan Query | Notes |
|---|
http.title:"Piper" | Piper UI title |
"piper" "tts" | Banner + term |
http.html:"piper-tts" | Hyphenated form in HTML |
"piper-http" | Piper HTTP wrapper |
Bark / MusicGen / AudioCraft
Suno Bark and Meta’s AudioCraft / MusicGen.
| Shodan Query | Notes |
|---|
http.title:"Bark" | Bark UI |
http.html:"suno-ai" | Suno-AI banner |
http.html:"audiocraft" | AudioCraft in HTML |
http.html:"MusicGen" | MusicGen UI |
http.html:"musicgen" | Lowercase |
"audiogen" | AudioGen |
OpenVoice
MyShell.ai OpenVoice — multi-language voice cloning.
| Shodan Query | Notes |
|---|
http.html:"openvoice" | OpenVoice in HTML |
http.title:"OpenVoice" | OpenVoice UI title |
"OpenVoice" "myshell" | OpenVoice + MyShell brand |
http.html:"se_extractor" | OpenVoice speaker-embedding extractor module |
F5-TTS / E2-TTS
Recent (2024-25) flow-matching TTS.
| Shodan Query | Notes |
|---|
http.html:"f5-tts" | F5-TTS in HTML |
http.html:"F5_TTS" | Underscore form |
http.html:"e2-tts" | E2-TTS sibling project |
"swivid/f5-tts" | HF model path in banner |
ChatTTS
2noise/ChatTTS — conversational TTS, viral mid-2024.
| Shodan Query | Notes |
|---|
http.html:"ChatTTS" | Brand title |
http.html:"chattts" | Lowercase |
http.title:"ChatTTS" | UI title |
"2noise" | GitHub org name often appears in source |
Tortoise TTS
| Shodan Query | Notes |
|---|
http.html:"tortoise-tts" | Tortoise in HTML |
http.title:"Tortoise" | UI title |
"tortoise-tts" | Banner |
StyleTTS2
| Shodan Query | Notes |
|---|
http.html:"StyleTTS" | StyleTTS in HTML |
http.html:"styletts2" | Version-2 lowercase |
Mozilla TTS / legacy
| Shodan Query | Notes |
|---|
http.html:"mozilla-tts" | Mozilla TTS legacy |
port:5002 http.html:"api/tts" | Default port + endpoint path |
Voice cloning / conversion
These are the highest-abuse-potential surfaces.
RVC (Retrieval-based Voice Conversion)
| Shodan Query | Notes |
|---|
http.html:"rvc-webui" | RVC WebUI |
http.html:"Retrieval-based-Voice-Conversion" | Full project name |
http.title:"RVC" | UI title |
http.html:"rvc-project" | Project banner |
http.html:"RVC-Boss" | RVC-Boss-GPT-SoVITS variant |
http.html:"GPT-SoVITS" | GPT-SoVITS (Mandarin-focused) |
http.title:"GPT-SoVITS" | UI title |
port:7865 http.html:"voice" | RVC default Gradio port |
port:7897 http.html:"voice" | GPT-SoVITS typical port |
so-vits-svc
| Shodan Query | Notes |
|---|
http.html:"so-vits-svc" | so-vits-svc in HTML |
http.title:"SoftVC" | SoftVC UI title |
Applio
| Shodan Query | Notes |
|---|
http.html:"Applio" | Applio (RVC fork) |
http.title:"Applio" | UI title |
Real-time voice agents
WebRTC + LLM + STT + TTS pipelines for live phone/video conversations. Highest-impact misuse class (live impersonation, scam-call automation).
Pipecat
Daily.co’s open-source voice-agent framework.
| Shodan Query | Notes |
|---|
http.html:"pipecat" | Pipecat in HTML |
http.title:"Pipecat" | UI title |
"pipecat-ai" | Project banner |
port:7860 http.html:"pipecat" | Default Gradio port |
LiveKit Agents
| Shodan Query | Notes |
|---|
http.html:"livekit-agents" | LiveKit Agents framework |
http.html:"livekit" | LiveKit broader |
http.title:"LiveKit" | UI title |
"livekit-server" | Server banner |
Vocode
| Shodan Query | Notes |
|---|
"vocode" | Vocode banner |
http.html:"vocode" | HTML form |
http.html:"vocode-python" | Python SDK form |
Retell AI / open-call
| Shodan Query | Notes |
|---|
http.html:"retell-ai" | Retell AI |
"retell-sdk" | SDK banner |
Speaker recognition / diarization
Pyannote
Pyannote speaker diarization servers.
| Shodan Query | Notes |
|---|
http.html:"pyannote" | Pyannote in HTML |
"pyannote/audio" | HF model path |
http.html:"pyannote-audio" | Hyphenated form |
SpeechBrain
| Shodan Query | Notes |
|---|
http.html:"speechbrain" | SpeechBrain framework |
"speechbrain" | Banner |
NeMo (NVIDIA)
| Shodan Query | Notes |
|---|
http.html:"nemo-toolkit" | NeMo toolkit |
http.html:"NVIDIA NeMo" | Brand form |
"nvidia-nemo" | Banner |
Aggregate / wrapper UIs
UIs that wrap multiple voice models behind one interface.
AI TTS Server (rsxdalv/tts-generation-webui)
| Shodan Query | Notes |
|---|
http.html:"tts-generation-webui" | rsxdalv project banner |
http.title:"TTS Generation Web UI" | UI title |
"/v1/audio/voices" | OpenAI-compatible voices listing |
port:10087 http.html:"audio" | rsxdalv default port |
Open-source ElevenLabs alternatives
| Shodan Query | Notes |
|---|
http.html:"elevenlabs-clone" | Clones |
http.html:"open-tts" | Generic open-tts banner |
"OpenAI-compatible" "audio/speech" | OpenAI-compatible TTS proxies |
Gradio / Streamlit voice-AI interfaces (port 7860)
Many voice-AI projects ship as Gradio demos. Port 7860 is shared with image-gen surveys; voice-specific filtering:
| Shodan Query | Notes |
|---|
port:7860 http.html:"audio" | Gradio + audio component |
port:7860 http.html:"speech" | Gradio + speech mention |
port:7860 http.html:"voice" | Gradio + voice mention |
port:7860 http.html:"tts" | Gradio + TTS |
port:7860 http.html:"ASR" | Gradio + ASR |
port:7860 http.html:"clone" | Gradio + voice-cloning UI |
Kokoro TTS (NEW — 2026-05-28)
Kokoro-FastAPI: OpenAI-compatible TTS, port 8880, no auth, multiple Docker images in the wild.
| Shodan Query | Notes |
|---|
port:8880 http.html:"Kokoro" | Primary — Kokoro-FastAPI Swagger/docs page |
port:8880 http.html:"/dev/captioned_speech" | Near-zero-FP — project-unique endpoint path |
http.title:"Kokoro" port:8880 | Swagger UI title anchor |
port:8880 http.html:"/v1/audio/voices" | OpenAPI schema reference |
Verification probe: GET /debug/system → JSON with CPU/GPU metrics. Unique to kokoro-fastapi — no other TTS server exposes this path.
Chatterbox TTS (NEW — 2026-05-28)
Zero-shot voice cloning, 15.9K GitHub stars, multiple Docker forks. Two distinct deployment surfaces.
| Shodan Query | Notes |
|---|
port:8000 http.html:"chatterbox" | devnen/Chatterbox-TTS-Server (port 8000) |
port:4123 http.html:"Chatterbox" | travisvn/chatterbox-tts-api (port 4123) |
http.title:"Chatterbox TTS" | Title match either variant |
Verification probe (devnen): GET /api/model-info → JSON with "engine" field (e.g., "chatterbox-turbo").
Severity elevated: /upload_reference (voice cloning) is unauth on both variants.
Orpheus-FastAPI (NEW — 2026-05-28)
3B-param Llama TTS, 8 voices, emotion tags, HN front-page release. Port 8899 near-unique.
| Shodan Query | Notes |
|---|
port:8899 http.html:"Orpheus" | Primary |
port:8899 http.html:"/v1/audio/speech" | OpenAPI path |
http.title:"Orpheus TTS" | Swagger title |
WhisperLive WebSocket (NEW — 2026-05-28)
Real-time streaming ASR via WebSocket. Distinct from batch Whisper — separate fingerprint class.
| Shodan Query | Notes |
|---|
port:9090 "WhisperLive" | WebSocket port + product name |
port:8000 http.html:"WhisperLive" | REST companion port |
port:9090 http.html:"nearly-live implementation" | README text in HTML |
Verification probe: WebSocket connect to :9090 → send {"uid":"x","language":"en","task":"transcribe","model":"tiny.en"} → server responds {"uid":"x","message":"SERVER_READY"}. SERVER_READY string is definitive.
Deepgram Self-Hosted (UPDATED — 2026-05-28)
Runtime auth is OFF — NGC key only gates image pull. HTTP API including /v1/status and /v1/listen requires no per-request auth once container is running.
| Shodan Query | Notes |
|---|
port:8080 http.html:"system_health" http.html:"active_batch_requests" | Near-zero-FP — unique JSON schema |
http.html:"active_stream_requests" http.html:"active_listen_v2_stream_requests" | Secondary field pair |
port:8080 port:9991 | Two-port co-presence (API + engine) |
Verification probe: GET /v1/status → JSON with "system_health" field. No auth required. Field name is unique to Deepgram on-prem.
NVIDIA NIM ASR (Parakeet / Canary) (NEW — 2026-05-28)
Enterprise ASR. HTTP 9000 + gRPC 50051. Runtime auth OFF per-request. CVE-2025-23242/23243 on older Riva deployments.
| Shodan Query | Notes |
|---|
port:9000 port:50051 http.html:'"status":"ready"' | Two-port co-presence + health response |
port:9000 http.html:"NIM" | NIM banner in headers |
port:50051 | gRPC only — narrow by country/org to reduce noise |
| Shodan Query | Notes |
|---|
http.html:"/v1/audio/speech" -openai | NEW HIGH-YIELD — catches entire OpenAI-compat TTS category (Kokoro + Orpheus + Chatterbox + Parler + Dia + Voxtral) in one query |
(http.title:"Whisper" OR http.title:"Coqui" OR http.title:"Piper" OR http.title:"Bark") | Major TTS/ASR umbrella |
(http.html:"openvoice" OR http.html:"chattts" OR http.html:"f5-tts" OR http.html:"xtts") | New-generation voice cloning umbrella |
(http.html:"rvc-webui" OR http.html:"GPT-SoVITS" OR http.html:"so-vits-svc" OR http.html:"Applio") | Voice-cloning umbrella |
(http.html:"pipecat" OR http.html:"vocode" OR http.html:"livekit-agents") | Real-time voice-agent umbrella |
(http.title:"Whisper" OR http.html:"coqui") country:US | Geographic scoping |
(http.title:"Whisper" OR http.html:"coqui") org:"hospital" | Healthcare exposure (HIPAA) |
(http.title:"Whisper" OR http.html:"coqui") org:"university" | Academic exposure |
port:9000 port:11434 | Compound stack: Whisper ASR webservice + Ollama on same host |
Probe semantics for live verification
| Endpoint pattern | Confirms |
|---|
GET /v1/audio/voices → JSON {"voices":[...]} | OpenAI-compatible TTS server (rsxdalv, AI TTS Server) |
POST /v1/audio/transcriptions (no auth) → 200 with text | Unauth Whisper-compatible ASR — confirmed compute theft |
POST /api/tts (Coqui legacy) → audio bytes | Coqui TTS legacy compute open |
POST /api/tts/speakers (Coqui XTTS) → speakers list | XTTS speaker enumeration |
GET /info (TGI / TEI / Whisper variants) → model JSON | Model identification |
GET /system_stats (Pipecat / Gradio) → GPU info | Resource fingerprint |
WS /audio or /listen (real-time) → 101 Upgrade | Voice-agent live channel |
Methodology notes
- Voice-AI servers frequently expose VRAM/GPU info through Gradio’s
/system_stats-style endpoints. Combine with comfyui-cloud-survey-2026-05.md’s VRAM-tally methodology to estimate compute exposed.
- HIPAA risk: Whisper transcription servers in healthcare orgs may be processing patient encounters. Treat any
org:"hospital"-tagged hit as PHI-relevant; case studies must scrub transcript content.
- Right-of-publicity / biometric data: Voice-cloning servers loaded with celebrity speaker embeddings represent a different harm class. Document the existence of speaker libraries; do not attempt to enumerate or trigger generation.
- Real-time agent abuse vector: Pipecat / LiveKit-Agents / Vocode unauth endpoints can be hijacked to make outbound calls (the framework handles SIP/twilio/daily.co integration). Document the surface; do not invoke.
Verified dork results + FP traps (re-run 2026-05-29)
Category 17 re-run via Playwright (Shodan API keys dead). 15 dorks. Key lesson:
the high-severity voice-AI API servers are Shodan-dark behind JSON-only roots
(Insight #67). Only demo UIs index.
Confirmed-useful (real instances)
| Dork | Total | Yield |
|---|
http.title:"Chatterbox TTS" | 18 | CLEAN — all real Chatterbox web UIs (ports 8004/4123/8000). Voice-clone surface. Use this, NOT the html-keyword form. |
http.html:"/v1/audio/speech" -openai | 12 | Highest-yield cross-platform OpenAI-compat TTS; selects on API contract. uvicorn on tier-2 cloud. |
"whisper.cpp" "/inference" | 12 | CLEAN conjunctive — real whisper.cpp ASR (ports 9000/8081/8083/8085). |
port:8880 http.html:"Kokoro" | 2 | Real Kokoro demo pages (Swagger/web UI only). |
http.html:"xtts" | 34 | ~50% rule — real XTTS UIs mixed with FP (lang-learning apps, GARR research, KR trading site). Needs title anchor. |
FP traps (do NOT re-run these / filter required)
| Dork | Total | Trap |
|---|
http.html:"chatterbox" | 96 | FP SWAMP. Collides with chatterboxwalls.com (photo wall art), entermediadb DAM product, LexisNexis Digital Library, TSLM Dashboard. Single-keyword collision (Garak/Whisper lesson). Use http.title:"Chatterbox TTS". |
http.html:"rvc-webui" | 4 | FP — all Beijing Volcano Engine (ByteDance). Primary source: :8000/openapi.json title = 北京open ai relay 服务器 (Beijing OpenAI RELAY server, LLM proxy), NOT RVC. The rvc-webui string was incidental HTML. Would falsely confirm 11x-CVSS-9.8 RCE. Verify against :7865/ Gradio root, not the html string. |
http.html:"so-vits-svc" | 2 | FP — both CN music-platform marketing pages (Auralink, stfdlnb.cn). so-vits-svc proper is Gradio/JSON, Shodan-dark. |
http.html:"GPT-SoVITS" | 22 | brand-dork, page-1 mostly FP (PixivTranslate, dir listings, AI SaaS). Ports 80/8800/8000, NOT 9880. The mentions index; the API does not. |
Shodan-dark (0 results = the RCE/PII surfaces are unmappable passively — Insight #67)
| Dork | Total | Why dark |
|---|
http.html:"GPT-SoVITS" port:9880 | 0 | API JSON-only root; 5x critical CVE surface invisible |
port:8899 http.html:"Orpheus" | 0 | Orpheus API JSON-only |
http.title:"Orpheus TTS" | 0 | variant space exhausted — Orpheus fully dark |
port:8880 http.html:"/dev/captioned_speech" | 0 | Kokoro unique path not in indexed HTML |
http.html:"system_health" http.html:"active_batch_requests" | 0 | Deepgram /v1/status JSON not crawled |
port:9090 "WhisperLive" | 0 | WS JSONL not indexed; 9090 = Prometheus territory |
To survey the RCE population (GPT-SoVITS/RVC): masscan ports 9880/7865/7860 across tier-2 cloud, fingerprint by JSON API shape. Shodan only shows the demo-UI minority.