National Taiwan University: CSIE MVNL Lab, Llama-3.3-70B vLLM (FP8, 2-Engine), NuClide engagement record

NuClide Research · 2026-05-03

Summary

mvnl-nas.csie.ntu.edu.tw (140.112.91.209) in NTU’s Computer Science and Information Engineering (CSIE) department exposes vLLM on port 8080 serving nvidia/Llama-3.3-70B-Instruct-FP8, NVIDIA’s FP8-quantized Llama 3.3 70B, without authentication. The server runs two parallel engines (tensor-parallel multi-GPU), has processed 237 requests and 450K prompt tokens since deployment. Context is constrained to 6,000 tokens, suggesting controlled evaluation use.

Infrastructure

Field	Value
IP	140.112.91.209
Hostname	mvnl-nas.csie.ntu.edu.tw
Organization	NTU CSIE, MVNL Lab (Machine Vision and Natural Language)
Network	Taiwan MOE TANet (140.112.0.0/16)
vLLM version	`0.18.2rc1.dev73+gdb7a17ecc` (very recent dev build)
Port	8080/tcp public
Auth	None

Model Inventory

Model	Engines	max_model_len	Notes
`nvidia/Llama-3.3-70B-Instruct-FP8`	2 (tensor parallel)	6,000	FP8 quantized, NVIDIA packaging

Two engines indicate tensor-parallel deployment across multiple GPUs, likely 2× H100 or A100 for 70B FP8. Context deliberately limited to 6K tokens (base supports 128K+), suggesting experimental or benchmark configuration.

Prometheus Metrics

Metric	Engine 0	Engine 1	Total
`request_success_total[stop]`	130	106	236
`request_success_total[length]`	1	0	1
`prompt_tokens_total`	248,381	202,223	450,604
`generation_tokens_total`	13,839	11,509	25,348
`prefix_cache_queries_total`	248,381	202,223	450,604
`prefix_cache_hits_total`	10,352	10,880	21,232
Prefix cache hit rate	,	,	~4.7%
Avg prompt length	~1,910 tokens	~1,908 tokens	~1,909
Avg generation length	~106 tokens	~108 tokens	~107

Traffic pattern: 237 requests at ~1,909 avg prompt tokens → ~107 avg output tokens. The low cache hit rate (4.7%) indicates diverse inputs, not repeated batch evaluation but varied query content. Prompt/generation ratio (17:1) suggests question-answering or instruction-following tasks, consistent with MVNL research workloads.

Findings

F1: Unauthenticated vLLM Inference on 70B Model (HIGH)

curl http://140.112.91.209:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"nvidia/Llama-3.3-70B-Instruct-FP8",
       "messages":[{"role":"user","content":"Hello"}],
       "max_tokens":100}'

Inference on a 70B FP8 model running on NTU GPU cluster, billable against NTU’s GPU allocation.

F2: Prometheus /metrics Exposed (MEDIUM)

GET http://140.112.91.209:8080/metrics, unauthenticated. Exposes request counts, token volumes, per-engine latency, cache statistics, and the model name nvidia/Llama-3.3-70B-Instruct-FP8.

The 2-engine configuration is visible in metric labels (engine="0", engine="1"), revealing the tensor-parallel topology of the cluster hardware.

Context: MVNL Lab

The hostname mvnl-nas suggests this machine is both NAS (network attached storage) and inference node for MVNL, the Machine Vision and Natural Language lab at CSIE NTU. The lab conducts research in computer vision, NLP, and multimodal AI. Llama-3.3-70B-Instruct is a strong general instruction model; its presence with FP8 quantization and constrained context (6K) is consistent with evaluation runs comparing quantization methods or context-length impact on benchmarks.

NTU also has: g1pc2n108.g1.ntu.edu.tw (140.112.233.108), separate Ollama node on NTU GPU Cluster 1, 11 vision/multimodal models. That node is documented in ntu-gpu.md.

Remediation

# Bind to localhost and use reverse proxy with auth:
vllm serve nvidia/Llama-3.3-70B-Instruct-FP8 \
  --host 127.0.0.1 --port 8080 \
  --tensor-parallel-size 2

# Or add API key:
vllm serve nvidia/Llama-3.3-70B-Instruct-FP8 \
  --api-key <secret> \
  --tensor-parallel-size 2

Disclosure

Discovered: 2026-05-03
Status: Pending outreach, CSIE NTU security contact