## A 58.6mW Real-Time Programmable Object Detector with Multi-Scale Multi-Object Support Using Deformable Parts Model on 1920x1080 Video at 30fps

#### <u>Amr Suleiman</u>, Zhengdong Zhang, and Vivienne Sze



**Massachusetts Institute of Technology** 

Symposia on VLSI Technology and Circuits

#### **Why Object Detection?**









## **Object Detection System Requirements**

#### **High Image Resolution**



## Outline

- Detection with Deformable Parts Models (DPM)
- Chip Architecture
- Main Contributions
- Chip Specifications and Comparisons
- Summary

#### **General Object Detection Methodology**

Localization (Where?)

Classification (True or False?)



#### **Localization: 3D Search**



## **Classification with DPM Templates**



HOG: Histogram of Oriented Gradients

P. F. Felzenszwalb et al., TPAMI 2010

#### **How Does DPM Work?**



P. F. Felzenszwalb et al., TPAMI 2010

#### **Detection Accuracy**



Slide 8

#### **Deformable Parts are More Accurate**



#### **Detecting parts enhances the accuracy by 2x**

Measured on INRIA person dataset\*

**<u>Challenge</u>**: DPM has **35x** more computation compared to without parts (rigid body) detection

\*[http://pascal.inrialpes.fr/data/human/]

## Outline

- Detection with Deformable Parts Models (DPM)
- Chip Architecture
- Main Contributions
- Chip Specifications and Comparisons
- Summary

#### **12-level Feature Pyramid**



## **2** Programmable Detectors



Programmable DPM model with a maximum template size of **128x128 pixels** 



Slide 12

## Outline

- Detection with Deformable Parts Models (DPM)
- Chip Architecture
- Main Contributions
- Chip Specifications and Comparisons
- Summary

## **Optimizations for Energy Efficiency**



#### <u>Goal:</u> Reducing the parts classification overhead

#### Methods:

 Reduce the number of classifications (Pruning & Vector Quantization)
 Reduce the cost of each classification (Basis Projection)

# Method 1

## Reduce the number of classifications

#### **Parts Classification in Region of Interests**



#### Slide 16

#### **Parts Classification in Region of Interests**



#### **Feature Storage for Parts Classification**

• Store features for reuse by parts to avoid re-computation



#### **Vector Quantization**

**16x** reduction in memory size (520 KB vs. 32 KB)

2x reduction in overall chip area

# Method 2

## Reduce the cost of each classification

## **Multiplication by Zero Can be Skipped**

Classification = Dot product



Dot product $\rightarrow$ 3 K multiplicationsHD image $\rightarrow$ 88 M multiplicationsHD pyramid $\rightarrow$ 235 M multiplications

#### With more zero weights:

- Fewer multiplications
- Smaller weights memory size and BW

#### **Project the Classification to a Sparse Space**



#### **Project the Classification to a Sparse Space**



## **Overall Optimizations Savings**



\*mAP: mean Average Precision, on PASCAL VOC2007 with 20 classes

## Outline

- Detection with Deformable Parts Models (DPM)
- Chip Architecture
- Main Contributions
- Chip Specifications and Comparisons
- Summary

### **Chip Die Photo and Specifications**



| Technology   | 65nm CMOS                 |  |
|--------------|---------------------------|--|
| Chip size    | 4.0 x 4.0 mm <sup>2</sup> |  |
| Logic gates  | 3283 kgates               |  |
| SRAM         | 280.1 KB                  |  |
| Supply       | 0.77 – 1.11 V             |  |
| Frequency    | 62.5 – 125 MHz            |  |
| Frame rate   | 30 – 60 fps               |  |
| Resolution   | 1920x1080                 |  |
| Power        | 58.6 – 216.5 mW           |  |
| Energy/pixel | 0.94 – 1.74 nJ            |  |

Two detectors, 97% pruning.

## **Energy Scalability**



- 1-detector power : 15% classification & 25% feature storage
- Adding an extra detector increases power by only **19%**

#### **Detection Examples with DPM Chip**

- Live video feed
- 1920x1080
- 30fps
- Detecting pedestrians



- Fixed frames
- 1920x1080
- Detecting cars
  & pedestrians



#### **Comparison with ASIC Object Detectors**

|                              | JSPS 2014 | This work |                       |
|------------------------------|-----------|-----------|-----------------------|
| Process                      | 65 nm     | 65 nm     |                       |
| Chip Size (mm <sup>2</sup> ) | 4.2×2.1   | 4.0x4.0   |                       |
| Voltage                      | 0.7V      | 0.77V     |                       |
| Resolution                   | 1920x1080 | 1920x1080 |                       |
| <b>#Object Classes</b>       | 2         | 2         |                       |
| Frame rate                   | 30        | 30        |                       |
| Multi-scale                  | No        | 12 levels |                       |
| Deformable Parts             | No        | 8 parts   |                       |
| Accuracy (AP)                | 0.166     | 0.80      | 4.7x more accura      |
| Power (mW)                   | 84        | 58.6      | *INRIA person dataset |
| Energy (nJ/pixel)            | 1.35      | 0.94      | 30% less energy       |

#### Summary

- A 58.6mW object detection accelerator that processes 1920x1080 videos at 30 fps
  - Uses **deformable parts** for 2x increase in accuracy
  - Two programmable object detectors supporting 12 scales
- Pruning, vector quantization and feature basis projection reduce the DPM classification cost
  - Reduce power by **5x** and memory size by **3.6x**
- This accelerator enables object detection to be as energyefficient as video compression at < 1nJ/pixel</li>

#### Acknowledgement

DARPA, Texas Instruments and TSMC University Shuttle