M. Kang, M. S Keel, N. R. Shanbhag, S. Eilert, and K. Curewitz, “An energy-efficient VLSI architecture for pattern recognition via deep embedding of computation in SRAM,” IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), May, 2014, pp. 8326-8330.
Motivation: Traditional von Neumann’s architectures suffers from explicit separation between memory and computation, which imposes serious bottlenecks on energy efficiency and throughput especially for big data processing. Therefore, the energy and delay efficiency of ML hardware are often limited by the memory accesses. We have been tackling this challenge via completely non-conventional route by proposing Deep in-memory architecture (DIMA), which overcomes the memory wall by deeply embedding analog computation into conventional memory (SRAM) array. Both memory read and computations are processed by low-voltage swing / low-signal to noise ratio (SNR) analog operations so that the processing at component-level operates in noisy and stochastic fashion. This budgeting of SNR is the reason behind the DIMA’s aggressive energy and throughput benefits. The low-SNR operations at component level do not affect system accuracy due to 1) ML algorithm’s inherent error resiliency, 2) the DIMA’s robustness, which comes from the unique signal processing flow, and 3) error-resiliency techniques such as on-chip training and ensemble classification. Since we presented the first concept of DIMA in 2014, we validated the DIMA's efficacy with three silicon prototypes in a 65nm CMOS process by demonstrating up to 100x reduction in energy-delay product with negligble accuracy degradations.
 M. Kang, S. K. Gonugondla, and N. R. Shanbhag, “Deep In-memory Architectures for Machine Learning,” Vol. 1, pp. 1–197, Springer, first print in Dec. 2019.
 M. Kang, S. Gonugondla, and N. R. Shanbhag, “Deep In-memory Architectures: A Shannon-inspired Approach to Approximate Computing,” Proceedings of the IEEE, [Invited], Dec. 2020.
1. Multi-functional Deep In-memory Architecture (DIMA)
Project description: After validating the use-case of deep in-memory architecture (DIMA) for various algorithms in signal-processing level with simulations, we decided to validate the concept of DIMA in the prototype IC. Here, we had a rule of thumb: not to alter the original SRAM bitcell to maintain the memory storage density, and maintain the normal memory read and write functionality without any penalty. In addition, multiple circuit-level innovations were introduced to enable the analog computation within the tight pitch constraints imposed by the memory bitcell to support column-wise parallel operation. As DIMA employs non-traditional read and computing in radically low-SNR regime, we had to build various test modes (total 14 test modes) to check the functionality of each processing stage. We also needed to design the probe circuitry to observe the highly sensitive floating analog value, which can be easily impaired during the observation. Then, we analyzed the commonality in the processing steps across various ML algorithms. Based on this observation, four ML algorithms (and applications): support vector machine (SVM) for face detection, k-nearest neighbor (k-NN) for hand-written number recognition, matched filter for gun-shot sound detection, and template matching for face recognition are implemented with reconfiguration in single IC in a 65 nm process. This chip has been fabricated with 1.2×1.2mm2 area, bit-cell size of 2.1×0.9μm2, four analog-to-digital converters (ADCs), and embedded 16KB SRAM. The 9.7X and 5.6x energy and throughput benefits are achieved at the same time with 480pJ/decision and 3.4M decisions/s.
 M. Kang, S. Gonugondla, A. Patil, and N. R. Shanbhag, “A Multi-Functional In-Memory Inference Processor Using a Standard 6T SRAM Array,” IEEE Journal of Solid-State Circuits, Vol. 53, Issue. 2, pp. 642-655, Jan. 2018.
 M. Kang, S. Gonugondla, A. D. Patil, and N. R. Shanbhag, “A 481pJ/decision 3.4M decision/s Multifunctional Deep In-memory Inference Processor,” ArXiv, https://arxiv.org/abs/1610.07501, Oct. 2016.
2. In-memory Computing for Random Forest
Project description: After the success of multi-functional deep in-memory architecture (DIMA) implementation, I tried to extend the concept of DIMA to more complicated ML algorithm, Random Forest (RF). Many of these IC realizations have focused on efficient implementation of deep neural network (DNN) algorithms due to its state-of-the-art performance in various decision making tasks. However, the high complexity of DNNs with an irregular data flow across multiple layers limits the achievable energy efficiency and throughput making them unsuitable for severely resource-limited embedded platforms. In contrast, the random forest (RF) algorithm is an attractive alternative due to the simplicity of its computations (mainly comparisons), applicability to multi-class problems, and high-accuracy. In addition, RF algorithm (Fig. 1) is an ensemble classifier to create the global decision by collecting multiple local decisions from weak classifiers (binary decision trees). The ensemble nature provides strong hardware error resiliency so that DIMA can be operated in aggressively low-swing and low-SNR regime for further energy savings. This IC (Fig. 2) was fabricated in a 65 nm CMOS process and tested with Belgium traffic sign (Fig. 3 ) recognition dataset to achieve 94% classification accuracy for eight class traffic signs with 6.8x reduction in energy-delay product. To the best of my knowledge, this is the world first ASIC implementation of RF algorithm.
 M. Kang, S. K. Gonugondla, and N. R. Shanbhag, “A 19.4 nJ/decision 364 K decisions/s in-memory random forest classifier in 6T SRAM array,” IEEE European Solid-State Circuits Conference (ESSCIRC), Sep. 2017, pp. 263–266.
 M. Kang, S. Gonugondla, S. Lim, and N. R. Shanbhag, “A 19.4 nJ/decision, 364K decisions/s, In-memory Random Forest Multi-class Inference Accelerator,” IEEE Journal of Solid-State Circuits, [Invited], July. 2018.
 Radu Timofte, Karel Zimmermann, and Luc van Gool, Multi-view traffic sign detection, recognition, and 3D localisation, Journal of Machine Vision and Applications (MVA 2011), DOI 10.1007/s00138-011-0391-3, December 2011, Springer-Verlag.
3. In-memory Computing with On-chip Training
 S. K. Gonugondla, M. Kang, and N. R. Shanbhag, “A 42pJ/Decision 3.12TOPS/W Robust In-Memory Machine Learning Classifier with On-Chip Training,” in IEEE International Solid-State Circuits Conference (ISSCC), Feb. 2018.
Research description: While previous DIMA realizations have achieved significant gains, e.g., up to 50x-to-175x in energy-delay product (EDP) over conventional (SRAM+fixed-function digital processor) architectures, these gains are limited by the die specific nature of process/voltage/temperature (PVT) variations and the potential for data statistics to vary over time in sensory systems. This research studies the use of an on-chip stochastic gradient descent (eq 1)-based trainer to adapt and track variations in PVT and data statistics and thereby realize a robust deep in-memory support vector machine (SVM) classifier IC using a standard 6T SRAM array in 65 nm CMOS (Fig. 1). Measurement results show that on-chip learned weights enable accurate inference in presence of scaled voltage swing thereby leading to a 2.4x greater energy savings over an off-chip trained DIMA implementation. Compared to a conventional fixed-function digital architecture with identical SRAM array, the prototype IC achieves up to 21x lower energy and 4.7x smaller delay leading to a 100x reduction in the EDP by overcoming the variations in PVT (Fig. 2) and data statistics (Fig. 3).
 S. Gonugondla, M. Kang, and N. R. Shanbhag, “A Variation-Tolerant In-Memory Machine Learning Classifier via On-Chip Training,” IEEE Journal of Solid-State Circuits, [Invited], Sept. 2018.