On-prem LLM serving

NSD-LLM-Serving

자체 운영 멀티모달·추론 LLM 서빙 인프라. 운영 안정성 실측.

On-prem multimodal and reasoning LLM serving with operationally verified stability.

외부 LLM 호출 없이 자체 인프라에서 멀티모달·추론 모델 운영. 9시간 운영 hard 5xx 0건 + 1,711건 transient retry 100% 회복. 직전 세대 대비 throughput +11.9%, p50 latency −11.0%, prefix cache hit rate +34 percentage points 개선.

Multimodal and reasoning models served entirely on in-house infrastructure. 9-hour soak: 0 hard 5xx + 1,711 transient retries fully recovered. +11.9% throughput, −11.0% p50 latency, and +34 percentage points prefix cache hit rate vs prior generation.

Mini-dashboard · indicators only

Verified signals. 추세·비교·상태만 — 절대 수치 일부는 공개 보류.

0 / 9h

Hard 5xx errors

1,711 transients · 100% client-retry recovery

4.98 r/s

Sustained throughput

single instance · N=150 · C=50

78.4%

Prefix cache hit

+34 pp vs prior generation

8.14 s

p50 latency

−11.0% vs prior generation

Throughput · before vs after (relative)

prior config1current1.119

+11.9% sustained throughput vs prior generation · same hardware (RTX A6000 ×2)

p50 latency · before vs after (relative · lower is better)

prior config1current0.89

−11.0% p50 latency · 8.14 s current

Prefix cache hit rate

78%

78.4%

+34 pp vs prior

Warm load profile · sustained benchmark

Hard 5xx errors · 9-hour window

Flat at zero · 1,711 transient errors auto-recovered by client retry

Included NSD products

  • NSD-LLM-Multimodal-Serve-v1
  • NSD-LLM-Reasoning-14B-Serve-v1
  • NSD-LLM-Reasoning-32B-Serve-v1
  • NSD-AgenticRAG-V1

9-hour soak · N=150 sustained-load benchmark · concurrency 50.