Innovative Workflows Astro & Particle Physics (IWAPP)

# APEIRON

Abstract Processing Environment for Intelligent Readout systems based On Neural networks

### **Neural Network on FPGA with High Level Synthesis**



Matteo Turisini - APE group - INFN Roma

2021 March 9th

## Hardware "programming"



### Agenda:

### **1.Introduction to High Level Synthesis (HLS)**



HLS is an FPGA design methodology convenient to describe complex function difficult or unpractical to write with traditional hardware flow, like Neural Network (NN)

## Write software, think hardware





HLS flow representation by Xilinx

### HLS: C based design entry

High level instructions transposed to hardware components e..g. an array becomes an on-chip RAM Code annotations (pragmas) can "guide" the compiler e.g. array partition to access stored data in parallel

- Build tailored/custom processors
- Fast verification of the algorithm
- Detailed report about generated digital circuit
- Pragmas determines circuit topology





# Example: inference on FPGA with HLS



## RiNNgs

INPUT

hit\_list[64]



in the next slides HLS results with two different algorithms for the same NN:

### HLS4ML and custom

- Accuracy
- Fixed precision

### HLS REPORT

- Resources
- Timing

FC - Fully Connected

Matteo Turisini - APE group - INFN Roma

Plots representing NN confusion matrix of the hardware implementation obtained with HLS Deviation from Tensorflow in white text



HLS4ML

custom

Model features reproduced nicely with fixed precision





Reduced precision arithmetics (fixed point on FPGA) instead of floating point (in Tensorflow) Each point in the plane represents a combination of bit-width and fractional part Accuracy normalized to TensorFlow (74%)



The more brilliant yellow the better (i.e. the closer to tensorflow model performance)

HLS4ML

### custom

2021 March 9th

HLS normalized to Tensorflow [%]

#### Matteo Turisini - APE group - INFN Roma

## Different code, different hardware

ors hardware units for arithmetics (e.g. multiply and accumulate)

DSP digital signal processors, hardware units for arithmetics (e.g. multiply and accumulate) Resources expressed in % of available in Xilinx VCU118

### HLS4ML





The more dark blue the better (i.e. smaller circuit area and faster design)

other resources report on backup slides



# **Timing Performance**



operating conditions (clock period and frequency)

HLS timing report: (of the digital circuit)

- time delay to produce the output (latency)
- minimum time interval between inputs (throughput)

|                     | Clock<br>[ns] | Frequency<br>[MHz] | Latency<br>[clock ticks] | Interval<br>[clock ticks] |
|---------------------|---------------|--------------------|--------------------------|---------------------------|
| HLS4ML<br>(reuse 8) | >7.1          | <140               | 18-25                    | 11                        |
| CUSTOM              | >2.5          | <400               | 50-160                   | 4                         |
|                     | Both design   | ns satisfy NA62 re | << 1 ms                  | < 100 ns                  |



- HLS is a fast C-based FPGA design methodology
- Can be used for **NN** but requires hardware "aware " programming skills
- RiNNgs dense model implementation fulfills NA62 **online** requirements
- Next steps are on board integration and convolutional model development

# APEIRON

Abstract Processing Environment for Intelligent Readout systems based On Neural networks



### more info at <a href="https://apegate.roma1.infn.it">https://apegate.roma1.infn.it</a>

## Back up 0



RiNNgs dense model confusion matrix from Tensorflow - Keras



2021 March 9th

Matteo Turisini - APE group - INFN Roma

BRAM - Random Access Memory (on chip)

### HLS4ML





Back up 1

### Resources expressed in % of available in Xilinx VCU118

not shown if accuracy < 50% (custom only)

### Matteo Turisini - APE group - INFN Roma

Resource utilization [%]



custom

Design space exploration - BRAM (4320 units available)

FF - Flip Flop, elementary register

HLS4ML



Back up 2

Resources expressed in % of available in Xilinx VCU118

not shown if accuracy < 50% (custom only)

### Matteo Turisini - APE group - INFN Roma



custom

LUT - Lookup table, configurable logic elements

### HLS4ML





Back up 3

### Resources expressed in % of available in Xilinx VCU118

not shown if accuracy < 50% (custom only)

### Matteo Turisini - APE group - INFN Roma

### 06



custom

Design space exploration - LUT (1.2M units available)



## **RiNNgs CONV**



8 filters

2021 March 9th

Matteo Turisini - APE group - INFN Roma



Custom algorithm inspired by video streaming (data reuse, pipeline processing) Design space exploration with random data set (no weight loaded yet)



### Speed as a function of image size

### **Preliminary indications:**

- Lightweight design
- Minimal dependency on image size •
- Interval=Latency=Reading time •
- not fast enough for NA62

(e.g. 2500 clock ticks for image 50 x 50 pixels) @100 MHz interval is 25 microsec

ongoing exploration of a more parallel algorithm (process multiple pixel per clock ticks)