ii is a small, fast inference engine and command-line runner.
Point it at a model and an image — it preprocesses, runs inference, and gives you
the result. One narrow interface drives a built-in pure-C++ engine and every major
accelerator, so the same models run the same way on a laptop, a workstation GPU,
or a tiny embedded board.
# build — built-in engine, no SDK required$ cmake -S . -B build && cmake --build build -j
# run an ONNX model on an image, out of the box$ ./ii model.onnx image.jpg
0
external deps to start
1
interface, many backends
4
interchangeable backends
CPU·GPU·NPU
same binary, your choice
Why ii
Everything you need to ship a model
A complete, no-fuss path from a model file to a result — detection,
image-to-image, live video, benchmarking — identical across every backend.
One interface, many backends
TensorFlow Lite, TensorRT, ONNX Runtime / DirectML and a built-in pure-C++ engine all
sit behind a single ii::Engine. Pick what you build, choose at runtime with
--backend.
No SDK to get started
The built-in engine has zero external dependencies — it compiles on any platform with just
a C++20 compiler and runs ONNX models out of the box. Build and infer in one
cmake invocation.
Object detection
Built-in YOLOv8 decoding and NMS with boxes drawn over the frame. COCO-80 labels by
default, or supply your own class list with --classes.
Image-to-image models
Super-resolution, low-light enhancement and denoising. Show the result on screen or save it
to PNG, with optional tiling and seam feathering for small-input models.
Live video
Camera capture (V4L2 on Linux, Media Foundation on Windows) with on-the-fly inference and a
zero-copy on-screen window plus an FPS / jitter overlay.
Benchmark & compare
Warmup + timed runs, CPU-vs-delegate comparison, and cross-model comparison — float32 vs
INT8, or one backend against another — with CSV telemetry export.
Quantization-aware
INT8, UInt8, Float16 and Float32 tensors. The scale and zero-point are read directly from
the model, so quantized graphs just work.
Embeddable
Everything but the CLI compiles into one static library, ii_core. Link it and
depend on the narrow ii::Engine interface — no backend SDK headers leak in.
Runs everywhere
The same binary covers CPU, GPU and NPU; the same source builds on Linux, Windows and
macOS. Start with nothing, scale up to vendor acceleration with one flag.
Backends
Choose at build time, dispatch at runtime
A single binary can carry any combination of backends. Only the built-in
engine is on by default — it needs no SDK, so the project builds and runs out of the box.
ii built-inDefault · ON
Self-contained pure-C++20 graph executor. No SDK.
Loads ONNX directly, builds on every platform — ideal for embedded.
.onnx
USE_II_ENGINE
no external SDK
TensorFlow Liteopt-in
CPU plus an optional NPU/GPU delegate selected with
--delegate. Quantized .tflite models.
.tflite
USE_TFLITE
delegate support
NVIDIA TensorRTopt-in
Maximum throughput on NVIDIA GPUs. Requires TensorRT 10+ and the CUDA
runtime.
.engine / .onnx
USE_TENSORRT
CUDA acceleration
ONNX Runtime / DirectMLopt-in
DirectML execution provider on Windows — any D3D12 GPU, iGPU or NPU;
CPU EP elsewhere.
.onnx
USE_DIRECTML
GPU / iGPU / NPU
The built-in ii engine
A dependency-free engine that doubles as a correctness oracle
A self-contained graph executor in plain C++20 — no TensorFlow, no ONNX
Runtime, no CUDA. It loads ONNX directly and supports a broad op set: Conv, Gemm/MatMul,
the common activations, pooling, normalization, resize, concat/slice/gather/reshape and
reductions.
Bit-identical to serial. Heavy kernels fan
out across cores by splitting the output — never a reduction axis — so results never drift.
Allocation-free parallelism. A reused,
lazy worker pool runs each kernel as one synchronized wave — no per-call heap, queue or future.
Cost-based & model-agnostic. Tiny tensors
run serially, large ones fan out — same logic for CNNs, transformers and super-resolution.
A reference oracle. Deterministic outputs make
it the baseline to validate quantized models and heavier backends against.
parallel_for — bit-identical to serial
// split the OUTPUT space across cores; the// reduction stays whole, so results match serialii::parallel_for(out.size(), grain, [&](size_t i) {
float acc = 0.f;
for (size_t k = 0; k < K; ++k)
acc += a[i * K + k] * w[k];
out[i] = acc; // each i written once
});
Quick start
From clone to inference in minutes
Requires CMake ≥ 3.16 and a C++20 compiler. stb is fetched automatically.
With default options you get a working binary immediately — no SDK to install.
# Linux / macOS / Windows — built-in engine, no SDK$ cmake -S . -B build
$ cmake --build build -j
# Run an ONNX model on the built-in engine$ ./ii model.onnx image.jpg
# Add a vendor backend when you have the SDK$ cmake -S . -B build -DUSE_TFLITE=ON -DTFLITE_ROOT=/usr
# Draw YOLOv8 boxes in a window$ ./ii yolov8m.onnx image.jpg --yolo --display \
--conf 0.4 --iou 0.5
# Custom labels instead of COCO-80$ ./ii yolov8m.onnx image.jpg --yolo --classes my.names
# Super-resolution / enhance — show or save output$ ./ii fsrcnn.onnx image.jpg --display --show-output
$ ./ii fsrcnn.onnx image.jpg --save-output out.png
# Tiling / sliding window for small-input models$ ./ii fsrcnn.onnx image.jpg --tile --save-output out.png
# Live camera detection with FPS overlay$ ./ii yolov8m.onnx --camera --display --yolo --stats
# Pick a device (/dev/videoN on Linux, index on Windows)$ ./ii yolov8m.onnx --camera /dev/video0 --display --yolo
# Warmup + timed runs (image or random input)$ ./ii model.onnx image.jpg --benchmark --runs 100
$ ./ii model.onnx --random-input --benchmark --runs 100
# Cross-check INT8 against a float32 reference$ ./ii model_int8.onnx --random-input \
--compare model_fp32.onnx --random-runs 50
Download
Grab a prebuilt binary
Self-contained ii executables with the built-in engine — no SDK,
no toolchain. Pick your platform and run.
Everything except the CLI compiles into one static library,
ii_core. A host program depends on just the narrow ii::Engine
interface — it pulls in no backend SDK headers — and links whichever backends it built.
One ii::Engine = one loaded model; load and invoke are sequential.
input_data() / output_data() point directly into backend memory — zero copy.