Digital AI Accelerator Hardware

Project motivation: There has been lots of effort to maximize peak compute efficiency (TFLOPS and TFLOPS/W) in AI accelerator cores. However, maintaining the high compute utilization (Fig. 2) in large-scale DL systems is a critical challenge. Moreover, achieving sustainable scalability (Fig. 2) is non trivial especially in many-core systems (Fig. 1). On the other hand, the AI accelerator core also needs to provide high flexibility to support various bit precisions, and a wide variety of deep learning models (Fig.3) with different key computing kernels and data flows. This project aims to achieve those goals without losing the peak compute efficiency across the comprehensive range of DL system stacks.

Project description: We exploit 2D systolic array (Fig. 4) with heterogeneous compute engines and architectural programmability to enable flexibility. In this core, we aim to maximize data reuse; and connects to neighboring cores (and memory) for near-ideal scaling to multi-core systems. We are also actively researching how to exploit the data sparsity while maintaining the origital regular 2D systolic array. In addition, we try to maximize the utilization factor of the processing elements in two different representative kernels: 1) Convolution, and 2) Fully-connected layers in the context of 2D systolic array. 

[1] J. Oh, S. Lee, M. Kang, et al., “A 3.0 TFLOPS 0.62V Scalable Processor Core for High Compute Utilization AI Training and Inference,” IEEE Symposium on VLSI Circuits (VLSI Symposium), June, 2020.

[2] A. Agrawal, S.K Lee, J. Silberman, M. Ziegler, M. Kang, et al., "A 7nm 4-Core AI Chip with 25.6TFLOPS Hybrid FP8 Training, 102.4TOPS INT4 Inference and Workload-Aware Throttling," IEEE International Solid-State Circuits Conference Digest of Technical Papers (ISSCC), Feb. 2021, to appear.

[3] S. Venkataramani, X. Sun, N. Wang, C.Y Chen, J. Choi, M. Kang, at al., “Efficient AI System Design with Cross-layer Approximate Computing,” Proceedings of the IEEE, [Invited], Vol. 108, Issue. 12, 2232-2250, Dec. 2020.

[4] S. Venkataramani, et al., M. Kang, et al., Kailash Gopalakrishnan, “RaPiD: AI Accelerator for Ultra-low Precision Training and Inference,” IEEE Symposium on Computer Architecture (ISCA), Jun, 2021.

The contents are adopted from IEEE publications © 2019 - 2021 IEEE

UCSD Electrical and Computer Engineering (ECE) Department