In previous deep in-memory architecture (DIMA) researches, we have achieved significant (up to 100x) energy and throughput efficiencies by operating the hardware in aggressively low signal-to-noise ratio (SNR) regime by exploiting the inherent algorithmic noise tolerance. We strongly believe that circuit and architecture based on emerging devices can provide a breakthrough to achieve additional order of magnitude improvement in the system efficiency. We also believe that such researches need to include serious consideration of the non-idealities that emerging devices typically have. Those non-idealities will be overcome by leveraging error resiliency techniques or energy vs. quality control mechanisms.
Research description: There is much interest in incorporating inference capabilities into sensor-rich embedded platforms such as autonomous vehicles, wearables, and others. This necessitates the design of energy-efficient realizations of machine learning systems for processing sensory data - sensory embedded system. A typical sensory embedded system enforces a physical separation between sensing and computational subsystems - a separation mandated by the differing requirements of the sensing and computational functions. As a consequence, the energy consumption in such systems tends to be dominated by the energy consumed in transferring data over the sensor-processor interface (communication energy) and the energy consumed in processing the data in digital processor (computational energy). In this research, we proposed an in-sensor computing architecture (Fig. 1) which (mostly) eliminates the sensor-processor interface by embedding inference computations in the noisy sensor fabric in analog and retraining the hyper parameters in order to compensate for non-ideal computations. This architecture supports two modes 1) in-sensor computing mode, and 2) normal mode. In in-sensor computing mode, a Compute Sensor processes both feature extraction and classification functions in the analog domain in close proximity to the CMOS active pixel sensor (APS) array to achieve 17× energy savings. On the other hand, the original functionality of CMOS image sensor is also preserved so that high quality image can be still generated in normal mode only when it is needed. We fabricated the silicon prototype (Fig. 2) of the CMOS image sensor (without embedded computing) to test the quality of obtained image.
 S. Zhang, M. Kang, C. Sakr, and N. R. Shanbhag, “Reducing the Energy Cost of Inference via In-sensor Information Processing,” ArXiv, https://arxiv.org/abs/1607.00667, July. 2016.
Compute in NAND Flash
Research description: Previous in-memory research on SRAM is useful in tackling problems in the kB scale. However, to address large scale machine learning problems (in the GB/TBs), in-memory computing architectures for high-density storage technologies are essential. NAND flash memories are an industry standard for large-scale storage. However, its throughput and energy consumption are primarily limited by its off-chip I/O interface and bandwidth (limited to 800MB/s). Furthermore, the external bandwidth of a typical SSD is 16× smaller than the internal bandwidth. Hence, techniques that enable processing on-chip would thus achieve large throughput and energy savings by minimizing the need to transfer data off-chip. This makes in-memory computing on NAND flash an attractive proposition. However, there are various challenges that need to be overcome in order to enable such techniques. These challenges primarily arise from large threshold voltage variation, small pitch, and technology limitations. This research proposes deep in-memory architecture for NAND flash (DIMA-F) (Fig. 1) which brings computing functionality into NAND flash memories. DIMA-F reads the stored data and processes highly parallel dot-products on single-level cell (SLC) NAND flash memories in the analog domain. This architecture can be used to perform classification, compression, filtering, and feature extraction. DIMA-F was evaluated in the context of face detection and face recognition on the Caltech 101 database and the Extended Yale B database, respectively. System level simulations including the models of circuit non-idealities and variations show marginal degradation in accuracy as compared to fixed point implementations, while achieving between 8×-to-23× energy savings, 9×-to-15× throughput gain, and 72×-to-345× improvements in energy delay product (EDP) compared to the conventional NAND flash architecture incorporating an external digital ASIC for computation.
 S. K. Gonugondla, M. Kang, and N. R. Shanbhag, “Energy-Efficient Deep In-memory Architecture for NAND Flash Memories” IEEE International Symposium on Circuit and System (ISCAS), Best paper awarded in “Neural System and Application”, May, 2018.