triNNity DNN tools

The triNNity DNN toolkit (compiler, optimizer, and primitive library)

triNNity primitive library

triNNity is a header-only C++17 template library with over 80 DNN convolution algorithms. It’s a collaborative effort with several other people in our research group to collect as many DNN convolution algorithms as possible in one place, and give them clean, simple, and performant implementations. It is also a testbed for algorithm design for DNN convolution.

The library implements normal dense convolution (both direct and GEMM-based), strided convolution, dilated convolution, group convolution, sparse convolution, Winograd convolution, FFT convolution, and more, including super high performance specialized algorithms for cases like 1x1 convolution.

Many libraries and frameworks present algorithms like im2col, fft, and others, as monolithic operations, but there are in fact dozens of algorithmic variants of these approaches, all of which are better suited to some kinds of convolutions than others. Our paper in ASAP 2017 details many of these algorithms.

Under the hood, the library uses BLAS, OpenMP multithreading, SIMD vectorization, and more, without any programmer intervention required. It can also run completely standalone, without any, or with only a subset, of these components enabled. We currently support x86_64 and aarch64, but support for more platforms is planned. Since the library is released as header-only C++, all that’s really required to bring up a new platform is a working compiler supporting the C++17 standard.

We have working, well-tested integration with the Intel MKL, OpenBLAS, ARM Compute Library, FFTW, and libxsmm, among others, as back-end libraries providing specific functionality (such as optimized GEMM routines).

The library is released under the BSD3 license, and is accompanied by an extensive performance benchmark suite.

triNNity DNN compiler and optimizer

We’ve developed a sophisticated ahead-of-time optimization framework for DNNs, based on the PBQP formulation, which uses profiled layer timings from performance benchmarking to build a cost model which can statically choose from among the 80+ convolution algorithms in the primitive library to produce a provably-optimal instantiation of a full CNN.

Our compiler turns your Caffe deploy.prototxt directly into highly efficient native code, which can be run standalone to perform inference.

You can obtain the compiler and optimizer from our public BitBucket, and there is also a demonstration project with benchmarking workflows: demos.

Our paper on the DNN optimizer appeared at CGO 2018.

Performance

We’ve run some performance comparisons with Intel’s native MKL-DNN framework: