NN-512
Introduction
- NN-512 is a compiler that generates C99 code for neural net inference
- It takes as input a simple text description of a convolutional neural net inference graph
- It produces as output a stand-alone C99 implementation of that graph
- The generated C99 code uses AVX-512 vector instructions to perform inference
- The generated C99 code is human-readable and should be compiled with GCC 9.1 or later
- Earlier versions of GCC may also be used, yielding slightly inferior object code
- The generated C99 code has no dependencies outside the C POSIX library
- NN-512 is a Go program with no dependencies outside the Go standard library
- The NN-512 compiler executable is stand-alone
- NN-512 performs a variety of inference graph optimizations
- Fusion of elementwise operations into adjacent operations
- Fusion of similar convolutions (as needed for, e.g., ResNet)
- Removal of concatenations (as needed for, e.g., DenseNet)
- End-to-end planning of memory layout
- NN-512 generates specialized code for each tensor operation
- Guided by a description of the target CPU cache hierarchy
- Thread-level parallelism is maximized while limiting synchronization costs
- Simplified code is generated for tensor edges, exploiting tile/vector overhang
- Complete knowledge of memory layout simplifies addressing
Main Features
- Generates efficient code for convolutions with arbitrary filter shape, stride, and dilation
- The input data tensor is split into disjoint subtensors modulo the stride, while being packed for the cache
- The weight tensor is split similarly, taking dilation into account
- Corresponding subtensors are multiplied, with accumulation at heightwise offsets
- The output data tensor is formed by combining accumulators at widthwise offsets
- Example: Filter1x7 Stride1x4 Dilation1x1
- Example: Filter3x6 Stride3x4 Dilation3x1
- Example: Filter8x6 Stride4x1 Dilation1x1
- Example: Filter4x4 Stride2x2 Dilation3x3
- Example: Filter7x7 Stride4x3 Dilation2x1
- Generates very efficient code for Stride2x2 convolutions (in particular, Filter7x7 Stride2x2)
- Fourier convolution with a 16x16 tile; four 8x8 FFTs per tile, interleaved modulo the stride
- Fourier-domain data is packed for persistence in L2 cache
- Fourier-domain weights are streamed in half-precision to conserve memory bandwidth
- The 16x16 tiles are deinterleaved, multiplied, and accumulated (yielding 8x8 tiles)
- The inverse transform (IFFT) operates on 8x8 tiles, four at a time
- Example: Filter3x3 Stride2x2
- Example: Filter4x3 Stride2x2
- Example: Filter5x5 Stride2x2
- Example: Filter7x7 Stride2x2
- Example: Filter8x9 Stride2x2
- Generates very efficient code for Filter3x3 convolutions
- Winograd-Cook-Toom-Lavin convolution with an 8x8 tile
- 4-way tile transforms fully utilize 512-bit vector registers (and amortize transposition costs)
- Winograd-domain data is packed for persistence in L2 cache
- Winograd-domain weights are streamed in half-precision to conserve memory bandwidth
- Example: Filter3x3 small tensor
- Example: Filter3x3 medium tensor
- Example: Filter3x3 large tensor
- Generates very efficient code for Filter1x1 convolutions (including those that are not Stride1x1)
- Single-precision matrix multiplication (making full use of the large vector register file)
- The input data tensor is subsampled and packed for persistence in L2 cache
- The weight tensor is packed and streamed for broadcast multiplication
- Example: Filter1x1 Stride1x1
- Example: Filter1x1 Stride2x2
- Example: Filter1x1 Stride3x3
- Example: Filter1x1 Stride4x4
- Generates efficient code for various other tensor operations
- Example: Fully connected (half-precision weights)
- Example: Max pooling (3x3 window, 2x2 stride)
- Example: Avg pooling (3x3 window, 2x2 stride)
- Example: Max pooling (2x2 window, 2x2 stride)
- Example: Avg pooling (2x2 window, 2x2 stride)
- Example: Global max pooling
- Example: Global avg pooling
- Example: Softmax (small channel)
- Example: Softmax (large channel)
- Integrates elementwise operations directly into the code generated for more complex tensor operations
- Batch normalization is implemented by modifying weights and biases during packing (if possible)
- Remaining elementwise operations are applied to data already present in registers (during inference)
- Example: Convolution (Filter1x1 Stride1x1)
- Example: Convolution (Filter3x3 Stride1x1)
- Example: Convolution (Filter7x7 Stride2x2)
- Example: Convolution (Filter4x4 Stride1x1)
- Example: Pooling (3x3 window)
- Example: Pooling (2x2 window)
- Example: Pooling (global)
- Example: Fully connected
Download (Version 35)
Browse Source Code
Documentation
Email
37ef.ced3@gmail.com
License
NN-512 (https://NN-512.com)
Copyright (C) 2019 [
37ef ced3 3727 60b4
3c29 f9c6 dc30 d518
f4f3 4106 6964 cab4
a06f c1a3 83fd 090e
]
All rights reserved.
Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions
are met:
1. Redistributions of source code must retain the above copyright
notice, this list of conditions and the following disclaimer.
2. Redistributions in binary form must reproduce the above copyright
notice, this list of conditions and the following disclaimer in
the documentation and/or other materials provided with the
distribution.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
"AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.