Pure C++20 · MIT licensed · zero SDK to start

Run neural networks
anywhere, simply.

ii is a small, fast inference engine and command-line runner. Point it at a model and an image — it preprocesses, runs inference, and gives you the result. One narrow interface drives a built-in pure-C++ engine and every major accelerator, so the same models run the same way on a laptop, a workstation GPU, or a tiny embedded board.

terminal
# build — built-in engine, no SDK required
$ cmake -S . -B build && cmake --build build -j

# run an ONNX model on an image, out of the box
$ ./ii model.onnx image.jpg
0
external deps to start
1
interface, many backends
4
interchangeable backends
CPU·GPU·NPU
same binary, your choice
Why ii

Everything you need to ship a model

A complete, no-fuss path from a model file to a result — detection, image-to-image, live video, benchmarking — identical across every backend.

One interface, many backends

TensorFlow Lite, TensorRT, ONNX Runtime / DirectML and a built-in pure-C++ engine all sit behind a single ii::Engine. Pick what you build, choose at runtime with --backend.

No SDK to get started

The built-in engine has zero external dependencies — it compiles on any platform with just a C++20 compiler and runs ONNX models out of the box. Build and infer in one cmake invocation.

Object detection

Built-in YOLOv8 decoding and NMS with boxes drawn over the frame. COCO-80 labels by default, or supply your own class list with --classes.

Image-to-image models

Super-resolution, low-light enhancement and denoising. Show the result on screen or save it to PNG, with optional tiling and seam feathering for small-input models.

Live video

Camera capture (V4L2 on Linux, Media Foundation on Windows) with on-the-fly inference and a zero-copy on-screen window plus an FPS / jitter overlay.

Benchmark & compare

Warmup + timed runs, CPU-vs-delegate comparison, and cross-model comparison — float32 vs INT8, or one backend against another — with CSV telemetry export.

Quantization-aware

INT8, UInt8, Float16 and Float32 tensors. The scale and zero-point are read directly from the model, so quantized graphs just work.

Embeddable

Everything but the CLI compiles into one static library, ii_core. Link it and depend on the narrow ii::Engine interface — no backend SDK headers leak in.

Runs everywhere

The same binary covers CPU, GPU and NPU; the same source builds on Linux, Windows and macOS. Start with nothing, scale up to vendor acceleration with one flag.

Backends

Choose at build time, dispatch at runtime

A single binary can carry any combination of backends. Only the built-in engine is on by default — it needs no SDK, so the project builds and runs out of the box.

TensorFlow Lite opt-in

CPU plus an optional NPU/GPU delegate selected with --delegate. Quantized .tflite models.

  • .tflite
  • USE_TFLITE
  • delegate support
NVIDIA TensorRT opt-in

Maximum throughput on NVIDIA GPUs. Requires TensorRT 10+ and the CUDA runtime.

  • .engine / .onnx
  • USE_TENSORRT
  • CUDA acceleration
ONNX Runtime / DirectML opt-in

DirectML execution provider on Windows — any D3D12 GPU, iGPU or NPU; CPU EP elsewhere.

  • .onnx
  • USE_DIRECTML
  • GPU / iGPU / NPU
The built-in ii engine

A dependency-free engine that doubles as a correctness oracle

A self-contained graph executor in plain C++20 — no TensorFlow, no ONNX Runtime, no CUDA. It loads ONNX directly and supports a broad op set: Conv, Gemm/MatMul, the common activations, pooling, normalization, resize, concat/slice/gather/reshape and reductions.

  • Bit-identical to serial. Heavy kernels fan out across cores by splitting the output — never a reduction axis — so results never drift.
  • Allocation-free parallelism. A reused, lazy worker pool runs each kernel as one synchronized wave — no per-call heap, queue or future.
  • Cost-based & model-agnostic. Tiny tensors run serially, large ones fan out — same logic for CNNs, transformers and super-resolution.
  • A reference oracle. Deterministic outputs make it the baseline to validate quantized models and heavier backends against.
parallel_for — bit-identical to serial
// split the OUTPUT space across cores; the
// reduction stays whole, so results match serial
ii::parallel_for(out.size(), grain, [&](size_t i) {
    float acc = 0.f;
    for (size_t k = 0; k < K; ++k)
        acc += a[i * K + k] * w[k];
    out[i] = acc;            // each i written once
});
Quick start

From clone to inference in minutes

Requires CMake ≥ 3.16 and a C++20 compiler. stb is fetched automatically. With default options you get a working binary immediately — no SDK to install.

# Linux / macOS / Windows — built-in engine, no SDK
$ cmake -S . -B build
$ cmake --build build -j

# Run an ONNX model on the built-in engine
$ ./ii model.onnx image.jpg

# Add a vendor backend when you have the SDK
$ cmake -S . -B build -DUSE_TFLITE=ON -DTFLITE_ROOT=/usr
# Draw YOLOv8 boxes in a window
$ ./ii yolov8m.onnx image.jpg --yolo --display \
       --conf 0.4 --iou 0.5

# Custom labels instead of COCO-80
$ ./ii yolov8m.onnx image.jpg --yolo --classes my.names
# Super-resolution / enhance — show or save output
$ ./ii fsrcnn.onnx image.jpg --display --show-output
$ ./ii fsrcnn.onnx image.jpg --save-output out.png

# Tiling / sliding window for small-input models
$ ./ii fsrcnn.onnx image.jpg --tile --save-output out.png
# Live camera detection with FPS overlay
$ ./ii yolov8m.onnx --camera --display --yolo --stats

# Pick a device (/dev/videoN on Linux, index on Windows)
$ ./ii yolov8m.onnx --camera /dev/video0 --display --yolo
# Warmup + timed runs (image or random input)
$ ./ii model.onnx image.jpg --benchmark --runs 100
$ ./ii model.onnx --random-input --benchmark --runs 100

# Cross-check INT8 against a float32 reference
$ ./ii model_int8.onnx --random-input \
       --compare model_fp32.onnx --random-runs 50
Download

Grab a prebuilt binary

Self-contained ii executables with the built-in engine — no SDK, no toolchain. Pick your platform and run.

All releases & changelog · Or build from source

Use as a library

Drop the engine into your own application

Everything except the CLI compiles into one static library, ii_core. A host program depends on just the narrow ii::Engine interface — it pulls in no backend SDK headers — and links whichever backends it built.

  • One ii::Engine = one loaded model; load and invoke are sequential.
  • input_data() / output_data() point directly into backend memory — zero copy.
  • Reusable helpers: preprocess, YOLO decode + NMS, image-to-image, tiling, parallelism.
  • Link via add_subdirectory() — the src/ include path propagates.
main.cpp
#include "inference.h"

auto eng = ii::make_engine("ii");   // or "tflite"/"tensorrt"/"directml"
ii::Engine::Options opts;
opts.delegate_path = "";             // "" = CPU
opts.num_threads   = 4;
eng->load("model.onnx", opts);

std::memcpy(eng->input_data(0), rgb, eng->inputs()[0].bytes);
eng->invoke();
const void* out = eng->output_data(0); // read with outputs()[0] desc

Start running models with zero setup.

Clone, configure, build — and infer ONNX out of the box. Add a vendor backend whenever you need it.