Most recent
navigate open esc close Corpus index built 2026-06-07 23:58 UTC

← All engagement records

Case study May 3, 2026

National Taiwan University: CSIE MVNL Lab, Llama-3.3-70B vLLM (FP8, 2-Engine)

Sector
Universities
Country
ntu

NuClide Research · 2026-05-03


Summary

mvnl-nas.csie.ntu.edu.tw (140.112.91.209) in NTU’s Computer Science and Information Engineering (CSIE) department exposes vLLM on port 8080 serving nvidia/Llama-3.3-70B-Instruct-FP8, NVIDIA’s FP8-quantized Llama 3.3 70B, without authentication. The server runs two parallel engines (tensor-parallel multi-GPU), has processed 237 requests and 450K prompt tokens since deployment. Context is constrained to 6,000 tokens, suggesting controlled evaluation use.


Infrastructure

FieldValue
IP140.112.91.209
Hostnamemvnl-nas.csie.ntu.edu.tw
OrganizationNTU CSIE, MVNL Lab (Machine Vision and Natural Language)
NetworkTaiwan MOE TANet (140.112.0.0/16)
vLLM version0.18.2rc1.dev73+gdb7a17ecc (very recent dev build)
Port8080/tcp public
AuthNone

Model Inventory

ModelEnginesmax_model_lenNotes
nvidia/Llama-3.3-70B-Instruct-FP82 (tensor parallel)6,000FP8 quantized, NVIDIA packaging

Two engines indicate tensor-parallel deployment across multiple GPUs, likely 2× H100 or A100 for 70B FP8. Context deliberately limited to 6K tokens (base supports 128K+), suggesting experimental or benchmark configuration.


Prometheus Metrics

MetricEngine 0Engine 1Total
request_success_total[stop]130106236
request_success_total[length]101
prompt_tokens_total248,381202,223450,604
generation_tokens_total13,83911,50925,348
prefix_cache_queries_total248,381202,223450,604
prefix_cache_hits_total10,35210,88021,232
Prefix cache hit rate,,~4.7%
Avg prompt length~1,910 tokens~1,908 tokens~1,909
Avg generation length~106 tokens~108 tokens~107

Traffic pattern: 237 requests at ~1,909 avg prompt tokens → ~107 avg output tokens. The low cache hit rate (4.7%) indicates diverse inputs, not repeated batch evaluation but varied query content. Prompt/generation ratio (17:1) suggests question-answering or instruction-following tasks, consistent with MVNL research workloads.


Findings

F1: Unauthenticated vLLM Inference on 70B Model (HIGH)

curl http://140.112.91.209:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"nvidia/Llama-3.3-70B-Instruct-FP8",
       "messages":[{"role":"user","content":"Hello"}],
       "max_tokens":100}'

Inference on a 70B FP8 model running on NTU GPU cluster, billable against NTU’s GPU allocation.

F2: Prometheus /metrics Exposed (MEDIUM)

GET http://140.112.91.209:8080/metrics, unauthenticated. Exposes request counts, token volumes, per-engine latency, cache statistics, and the model name nvidia/Llama-3.3-70B-Instruct-FP8.

The 2-engine configuration is visible in metric labels (engine="0", engine="1"), revealing the tensor-parallel topology of the cluster hardware.


Context: MVNL Lab

The hostname mvnl-nas suggests this machine is both NAS (network attached storage) and inference node for MVNL, the Machine Vision and Natural Language lab at CSIE NTU. The lab conducts research in computer vision, NLP, and multimodal AI. Llama-3.3-70B-Instruct is a strong general instruction model; its presence with FP8 quantization and constrained context (6K) is consistent with evaluation runs comparing quantization methods or context-length impact on benchmarks.

NTU also has: g1pc2n108.g1.ntu.edu.tw (140.112.233.108), separate Ollama node on NTU GPU Cluster 1, 11 vision/multimodal models. That node is documented in ntu-gpu.md.


Remediation

# Bind to localhost and use reverse proxy with auth:
vllm serve nvidia/Llama-3.3-70B-Instruct-FP8 \
  --host 127.0.0.1 --port 8080 \
  --tensor-parallel-size 2

# Or add API key:
vllm serve nvidia/Llama-3.3-70B-Instruct-FP8 \
  --api-key <secret> \
  --tensor-parallel-size 2

Disclosure

  • Discovered: 2026-05-03
  • Status: Pending outreach, CSIE NTU security contact