Yeti Claw

Mission Control | TensorRT-LLM migration dispatch

Mission Control banner art for the Spark TensorRT-LLM migration dispatch

Mission Control Dispatch | Published May 18, 2026

Spark text lane migration: Ollama to TensorRT-LLM on DGX Spark

We moved Spark’s public text lane off the legacy Ollama path and onto TensorRT-LLM, but only after documenting the failures that showed up along the way. The direct pip routes were not stable on GB10. The working production answer ended up being NVIDIA’s official NGC release container, which gave us a clean OpenAI-compatible endpoint, lower concurrency-4 latency than the old lane, and a real throughput gain under load, while sacrificing single-request speed and temporarily shrinking the public Spark text catalog to one model.

Download The PDF Download Artifact Bundle Back To Mission Control
Concurrency-4 throughput 56.29 tok/s

Up from 38.47 tok/s on the legacy Ollama lane under the matched c4 test.

Concurrency-4 avg latency 6815.54 ms

Down from 8738.93 ms when four requests were active simultaneously.

Single-request avg latency 6997.54 ms

Up from 2541.67 ms, which is the main tradeoff of the new lane.

Spark text catalog 5 → 1

The first public cutover focuses Spark on one TRT-served Qwen3 lane instead of five mixed text entries.

Executive read

What we learned from the transition

  • TensorRT-LLM on Spark is real, but the supported route is the NGC release container, not an ad-hoc pip install.
  • The new text lane is slower for a single request, but it scales materially better once four requests are active.
  • Peak sampled GPU temperature fell from 65 C to 63 C and peak power fell from 49.73 W to 37.1 W during the matched c4 run.
  • The migration does not disturb the Spark image path, which still runs through the separate hosted-image workflow.
  • The price of this cleaner runtime is a narrower Spark text catalog until more TRT-served models are added.

Architecture

What changed in the public text path

Before the migration, Spark text requests shared the same Ollama lane that also exposed legacy and multimodal text inventory. After the migration, the public VRRP tier points Spark text to a dedicated TensorRT-LLM OpenAI-compatible endpoint on port 8000 while leaving the Spark image workflow untouched. That split is the cleanest part of the transition: text inference became easier to reason about operationally even though the bring-up was harder than expected.

Before diagram showing Yeti Claw routing Spark text to Ollama and Spark image through SSH diffusers
Before: Spark text shared the Ollama lane while images used the separate SSH + diffusers path.
After diagram showing Yeti Claw routing Spark text to TensorRT-LLM and Spark images unchanged
After: Spark text moves onto TensorRT-LLM on port 8000 while hosted Spark image generation remains unchanged.

Performance

Before-and-after numbers under matched load

We reran the exact concurrency ladder against the same Spark hardware using the same prompt shape and the same 96-token completion cap. That gives us a fair operator comparison between the old Ollama lane and the new TensorRT-LLM endpoint. The story is mixed but clear: single-request latency regressed, concurrency-2 throughput still trails the old path, and concurrency-4 is where the new lane finally starts paying back the migration work.

Concurrency Before avg ms After avg ms Before tok/s After tok/s Read
1 2541.67 6997.54 37.76 13.72 Single-request latency is worse on the new lane.
2 4667.74 6797.99 38.56 28.20 Still slower than Ollama under light parallelism.
4 8738.93 6815.54 38.47 56.29 TRT-LLM wins once the Spark text lane is actually busy.
Latency chart comparing Ollama and TensorRT-LLM on Spark by concurrency
Latency regresses at c1 and c2, then improves at c4.
Throughput chart comparing Ollama and TensorRT-LLM on Spark by concurrency
Throughput only wins decisively once the new lane is pushed into concurrency-4 territory.

Environmentals

Thermals and power held steady through the new lane

The migration did not create a new thermal problem on Spark. In the matched concurrency-4 window, GPU utilization stayed effectively saturated in both eras, while the new TRT lane sampled slightly lower peak temperature and materially lower peak power draw. That matters because it tells us the migration’s pain was software compatibility, not heat or raw platform instability.

Chart comparing Spark GPU temperature, power, and utilization during the concurrency-4 before and after runs
Peak sampled environmentals moved from 65 C / 49.73 W / 95% util to 63 C / 37.1 W / 96% util.

Compatibility findings

Why the first two migration paths failed

The paper trail matters here. We did not land on the final container path by preference alone. We got there because the direct pip options failed in two different ways on Spark’s GB10 platform. The newer `1.2.x` wheel line did not import cleanly against the live CUDA/Torch packaging. The older `0.20.0` fallback could be forced to import, but then it failed at runtime with unsupported-kernel errors in FlashInfer, fused attention, and finally even the vanilla fallback path. That is exactly the kind of migration detail committees need to see so they understand what is “supported” versus merely “possible.”

  • `1.2.1` pip lane: import failure around `libth_common.so` and PyTorch symbol mismatch.
  • `0.20.0` pip fallback: importable only after Spark-specific CUDA shims, then failed in FlashInfer and GB10 attention kernels.
  • `1.3.0rc12.post1` NGC container: successful model load, warmup, autotune, CUDA graph prep, and live `/v1/models` endpoint.

The downloadable artifact bundle includes the compatibility notes, logs, diagrams, benchmark JSON, and the PDF version of this dispatch for committee review.

Recommendation

How we should operate the Spark text lane from here

  • Keep Spark text on the TensorRT-LLM release container path, not a custom pip environment.
  • Treat this lane as a higher-concurrency public worker, not the fastest single-request chat surface.
  • Keep the Mac mini and BeastMode lanes available for lighter or lower-latency text traffic.
  • Expand Spark’s TRT-served catalog only after each additional model is benchmarked through the same Mission Control workflow.