{
  "slug": "spark-trtllm-migration-may-2026",
  "title": "Spark text lane migration: Ollama to TensorRT-LLM on DGX Spark",
  "publishedDate": "2026-05-18",
  "missionControlSection": "Capacity",
  "prompt": "In exactly six bullet points, give six practical tips for keeping an inference server responsive under moderate concurrency. Keep each bullet under twelve words.",
  "beforeModel": "qwen3:8b",
  "afterModel": "Qwen/Qwen3-8B via TensorRT-LLM 1.3.0rc12.post1 (NGC release container)",
  "comparison": [
    {
      "concurrency": 1,
      "before_avg_ms": 2541.67,
      "after_avg_ms": 6997.54,
      "latency_delta_pct": 175.31268811450738,
      "before_tps": 37.76,
      "after_tps": 13.72,
      "tps_delta_pct": -63.66525423728814
    },
    {
      "concurrency": 2,
      "before_avg_ms": 4667.74,
      "after_avg_ms": 6797.99,
      "latency_delta_pct": 45.637717610663834,
      "before_tps": 38.56,
      "after_tps": 28.2,
      "tps_delta_pct": -26.867219917012452
    },
    {
      "concurrency": 4,
      "before_avg_ms": 8738.93,
      "after_avg_ms": 6815.54,
      "latency_delta_pct": -22.00944509224814,
      "before_tps": 38.47,
      "after_tps": 56.29,
      "tps_delta_pct": 46.32180920197557
    }
  ],
  "beforePeak": {
    "peak_temp": 65.0,
    "peak_power": 49.73,
    "peak_util": 95.0
  },
  "afterPeak": {
    "peak_temp": 63.0,
    "peak_power": 37.1,
    "peak_util": 96.0
  },
  "modelCounts": {
    "before_public_spark_models": 5,
    "after_public_spark_models": 1
  },
  "pros": [
    "Spark text is now served through a dedicated TensorRT-LLM endpoint instead of sharing the same Ollama lane used for legacy text workflows.",
    "The final cutover uses NVIDIA\u2019s official release container on Spark, which is a cleaner and more supportable path than a hand-built pip environment.",
    "TensorRT-LLM exposes OpenAI-compatible `/v1/chat/completions`, `/health`, and `/v1/models`, which gives the public stack a cleaner operator surface.",
    "The Spark image path is unchanged, so the text migration does not disturb the existing hosted-image workflow.",
    "The working path on Spark stayed inside NVIDIA\u2019s current Blackwell-oriented TensorRT-LLM release container instead of the unsupported pip combinations that failed earlier.",
    "At concurrency 4, aggregate completion throughput climbed to 56.29 tok/s."
  ],
  "cons": [
    "The Spark text lane is now a second runtime to operate: separate container, sidecar port, model hydration path, and health checks all add surface area.",
    "The first-time startup path is heavier than Ollama because it must hydrate a large Hugging Face checkpoint before the API becomes ready.",
    "The direct pip path was not production-safe on Spark: newer 1.2.x wheels failed to import cleanly, and older fallback builds exposed unsupported-kernel failures on GB10 before the container path succeeded.",
    "The public Spark text catalog narrowed from 5 models to 1 models during the first cutover."
  ],
  "headline": {
    "best_after_avg_ms": 6797.99,
    "best_after_concurrency": 2,
    "best_after_tps": 56.29,
    "best_after_tps_concurrency": 4
  },
  "downloads": {
    "pdf": "/downloads/mission-control/spark-trtllm-migration-20260518/committee-review-spark-trtllm-migration-20260518.pdf",
    "zip": "/downloads/mission-control/spark-trtllm-migration-20260518/spark-trtllm-migration-20260518-artifacts.zip"
  },
  "articleUrl": "https://chat.neonflux.co/mission-control/spark-trtllm-migration-may-2026/"
}
