<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.10.0">Jekyll</generator><link href="https://ganeshmohane.github.io/feed.xml" rel="self" type="application/atom+xml" /><link href="https://ganeshmohane.github.io/" rel="alternate" type="text/html" /><updated>2026-05-16T16:21:27+00:00</updated><id>https://ganeshmohane.github.io/feed.xml</id><title type="html">Ganesh Mohane</title><subtitle>ML Engineer at AISOLO | Building Privacy AI &amp; ML Infrastructure</subtitle><author><name>Ganesh Mohane</name><email>ganeshmohane5@gmail.com</email></author><entry><title type="html">🚀 Understanding CPU, CUDA &amp;amp; TensorRT Runtimes</title><link href="https://ganeshmohane.github.io/posts/cpu-cuda-tensorrt/" rel="alternate" type="text/html" title="🚀 Understanding CPU, CUDA &amp;amp; TensorRT Runtimes" /><published>2026-05-16T00:00:00+00:00</published><updated>2026-05-16T00:00:00+00:00</updated><id>https://ganeshmohane.github.io/posts/cpu-cuda-tensorrt</id><content type="html" xml:base="https://ganeshmohane.github.io/posts/cpu-cuda-tensorrt/"><![CDATA[<h2 id="-why-this-matters">🧠 Why this matters</h2>

<p>When I started working on real ML systems, I thought:</p>

<blockquote>
  <p><strong>“Model is trained → just run inference”</strong></p>
</blockquote>

<p><strong>Reality:</strong></p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Model performance = model + runtime + hardware + optimization
</code></pre></div></div>

<p>The same model can run at vastly different speeds depending on the runtime:</p>
<ul>
  <li><strong>CPU</strong>: 200ms ❌ (unusable)</li>
  <li><strong>CUDA</strong>: 80ms ⚡ (good)</li>
  <li><strong>TensorRT</strong>: 40ms 🚀 (production-ready)</li>
</ul>

<p>This is why understanding runtimes is critical for deploying ML systems at scale.</p>

<hr />

<h2 id="️-what-is-a-runtime">⚙️ What is a runtime?</h2>

<p>Think of a runtime as the <strong>execution engine</strong> that translates your model into hardware instructions:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Model (ONNX/PyTorch) → Runtime → Hardware (CPU/GPU)
</code></pre></div></div>

<p>Different runtimes are optimized for different scenarios:</p>
<ul>
  <li>Want portability? Use CPU runtime</li>
  <li>Want speed? Use CUDA</li>
  <li>Want maximum optimization? Use TensorRT</li>
</ul>

<hr />

<h2 id="️-cpu-runtime">🖥️ CPU Runtime</h2>

<p><strong>How it works:</strong> Runs inference on your CPU with multi-threading</p>

<p><strong>Pros:</strong></p>
<ul>
  <li>✅ Works everywhere (no GPU required)</li>
  <li>✅ Easy to debug</li>
  <li>✅ Good for development &amp; testing</li>
  <li>✅ Consistent across machines</li>
</ul>

<p><strong>Cons:</strong></p>
<ul>
  <li>❌ Extremely slow for deep learning</li>
  <li>❌ Not designed for parallel operations</li>
  <li>❌ Can’t handle real-time inference</li>
</ul>

<p><strong>When to use:</strong> Development, prototyping, or when you don’t have a GPU</p>

<hr />

<h2 id="-cuda-runtime">⚡ CUDA Runtime</h2>

<p><strong>How it works:</strong> Leverages NVIDIA GPU’s parallel architecture to run operations simultaneously</p>

<p><strong>Pros:</strong></p>
<ul>
  <li>✅ ~10-20x faster than CPU</li>
  <li>✅ Great for real-time applications</li>
  <li>✅ Mature ecosystem (PyTorch, TensorFlow support)</li>
  <li>✅ Works with any NVIDIA GPU</li>
</ul>

<p><strong>Cons:</strong></p>
<ul>
  <li>❌ Requires NVIDIA GPU</li>
  <li>❌ Not as optimized as TensorRT</li>
  <li>❌ Uses more memory</li>
</ul>

<p><strong>When to use:</strong> Most production ML services, gaming, research</p>

<hr />

<h2 id="-tensorrt-runtime">🚀 TensorRT Runtime</h2>

<p><strong>How it works:</strong> NVIDIA’s deep learning inference optimizer that:</p>
<ol>
  <li><strong>Fuses layers</strong> — Combines multiple ops into one</li>
  <li><strong>Optimizes kernels</strong> — Finds the fastest GPU kernel for each operation</li>
  <li><strong>Reduces precision</strong> — Uses FP16 or INT8 instead of FP32 (2-4x faster)</li>
</ol>

<p><strong>Pros:</strong></p>
<ul>
  <li>✅ <strong>2-5x faster</strong> than CUDA</li>
  <li>✅ Extreme optimization for inference only</li>
  <li>✅ Production-grade reliability</li>
  <li>✅ Minimal memory footprint</li>
</ul>

<p><strong>Cons:</strong></p>
<ul>
  <li>❌ Requires compilation (model → TensorRT engine)</li>
  <li>❌ Only works on NVIDIA hardware</li>
  <li>❌ Steeper learning curve</li>
  <li>❌ Model-specific (can’t reuse across different models easily)</li>
</ul>

<p><strong>When to use:</strong> Production ML APIs, edge deployment (Jetson), high-throughput systems</p>

<hr />

<h2 id="️-runtime-comparison">⚖️ Runtime Comparison</h2>

<table>
  <thead>
    <tr>
      <th>Aspect</th>
      <th>CPU</th>
      <th>CUDA</th>
      <th>TensorRT</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><strong>Speed</strong></td>
      <td>200ms</td>
      <td>80ms</td>
      <td>40ms</td>
    </tr>
    <tr>
      <td><strong>GPU Required</strong></td>
      <td>No</td>
      <td>Yes</td>
      <td>Yes</td>
    </tr>
    <tr>
      <td><strong>Learning Curve</strong></td>
      <td>Easy</td>
      <td>Medium</td>
      <td>Hard</td>
    </tr>
    <tr>
      <td><strong>Production Ready</strong></td>
      <td>❌</td>
      <td>✅</td>
      <td>✅✅</td>
    </tr>
    <tr>
      <td><strong>Memory Usage</strong></td>
      <td>High</td>
      <td>Medium</td>
      <td>Low</td>
    </tr>
    <tr>
      <td><strong>Flexibility</strong></td>
      <td>High</td>
      <td>High</td>
      <td>Medium</td>
    </tr>
  </tbody>
</table>

<hr />

<h2 id="-real-world-insight">🔥 Real-world Insight</h2>

<p>At AISOLO, we built computer vision models for real-time processing:</p>

<ul>
  <li><strong>CPU runtime</strong> → Model couldn’t keep up with incoming streams</li>
  <li><strong>CUDA runtime</strong> → Could process streams but with latency</li>
  <li><strong>TensorRT</strong> → Achieved real-time processing with room to spare</li>
</ul>

<p>The difference between CPU and TensorRT? <strong>A matter of business viability.</strong></p>

<hr />

<h2 id="-final-takeaway">🧩 Final Takeaway</h2>

<p><strong>Just knowing the model isn’t enough.</strong> You need to understand:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>1. Model Architecture (ResNet, YOLO, etc.)
2. Runtime (CPU, CUDA, TensorRT)
3. Hardware (CPU type, GPU type, RAM)
4. Optimization (quantization, pruning, distillation)
</code></pre></div></div>

<p><strong>The winning formula:</strong></p>
<ul>
  <li>Use <strong>TensorRT</strong> for production inference</li>
  <li>Use <strong>CUDA</strong> for research &amp; development</li>
  <li>Use <strong>CPU</strong> for edge devices &amp; portability</li>
</ul>

<hr />

<h2 id="-resources">📚 Resources</h2>

<ul>
  <li><a href="https://docs.nvidia.com/tensorrt/">NVIDIA TensorRT Documentation</a></li>
  <li><a href="https://pytorch.org/get-started/locally/">PyTorch CUDA Support</a></li>
  <li><a href="https://onnxruntime.ai/">ONNX Runtime</a></li>
  <li><a href="https://github.com/ganeshmohane">Model Optimization Guide</a></li>
</ul>

<hr />

<p><strong>Have you faced runtime performance challenges? Let me know your experience in the comments!</strong></p>]]></content><author><name>Ganesh Mohane</name><email>ganeshmohane5@gmail.com</email></author><category term="ml-optimization" /><category term="machine-learning" /><category term="gpu" /><category term="performance" /><category term="cuda" /><category term="tensorrt" /><category term="inference" /><summary type="html"><![CDATA[Same model can run 200ms on CPU, 80ms on CUDA, or 40ms on TensorRT. Learn why runtimes matter for ML inference.]]></summary></entry></feed>