Neural Networks on Emerging Devices

The amount of data in our world is exploding at an astounding rate. We have entered the 'Era of Big Data'. Large scale neural networks, also known as the deep neural networks (DNNs) or deep learning, have demonstrated a great promise in processing Big Data. State-of-the-art performance has been reported in many unstructured data processing tasks, ranging from visual object classification, speech recognition, to nature language processing and information retrieval. Processing big data with large scale neural networks includes two phases: the training phase and the operation phase. Huge computing power is required to support the training phase. And the energy efficiency (power efficiency) is one of the major considerations of the operation phase. We explore the computing power of GPUs for big data analytics and demonstrate an efficient GPU implementation of the training phase of large scale neural networks. We also introduce a promising ultra-high energy efficient implementation of the operation phase by taking advantage of the emerging RRAM devices.

fig1.png

Energy Efficient Neural Network for Big Data Analytics

Neural Networks on Emerging Devices

RRAM-based Approximate Computation

Approximate computing provides a promising solution to close the gap of power efficiency between current capabilities and future requirements. In this work, we introduce an RRAM-based power efficient framework for analog approximate computing. A programmable RRAM-based approximate computing unit (RRAM-ACU) is introduced first to accelerate numeric computation and a scalable approximate computing framework is proposed on top of the RRAM-ACU. We also introduce a full scheme for RRAM-ACU configuration, including a neural approximator training, an approximator parameter to RRAM state mapping and an RRAM writing scheme. A predictive compact model is also developed to analyze the configuration overhead. The simulation results on a set of diverse benchmarks show that the RRAM-ACU achieves 10.26~491.02× speedup and power efficiency of 24.59~567.98 GFLOPS/W with quality loss of 8.72% on average. In addition, a system-level simulation of HMAX application atop our proposed RRAM-based approximate computing framework demonstrates >12.8x power efficiency improvements than its pure digital implementation counterpart (CPU, GPU, and FPGA).

fig2.png

RRAM-based Approximate Computation

 

Spiking Neural Network with RRAM for Real-World Application?

Inspired by the human brain's function and efficiency, neuromorphic computing offers a promising solution for a wide set of tasks, ranging from brain machine interfaces to real-time classification. The spiking neural network (SNN), which encodes and processes information with bionic spikes, is an emerging neuromorphic model with great potential to drastically promote the performance and efficiency of computing systems. However, an energy efficient hardware implementation and the difficulty of training the model significantly limit the application of the spiking neural network. In this work, we address these issues by building an SNN-based energy efficient system for real time classification with metal-oxide resistive switching random-access memory (RRAM) devices. We implement different training algorithms of SNN, including Spiking Time Dependent Plasticity (STDP) and Neural Sampling method. Our RRAM SNN systems for these two training algorithms show good power efficiency and recognition performance on real-time classification tasks, such as the MNIST digit recognition. Finally, we propose a possible direction to further improve the classification accuracy by boosting multiple SNNs.

structure.png

RRAM-based Approximate Computation

Large Scale Neural Network on GPU and FPGA

Training Neural Networks with GPU

Large scale artificial neural networks (ANNs) have been widely used in data processing applications. Training phase is the critical operation of neural network. In recent years, the use of graphics processing units (GPUs) becomes a significant advance to speed up the training process of large scale neural networks by taking advantage of the massive parallelism capabilities of GPUs. In our work, efficient parallel neural network training on servers, each equipped with multi GPUs, are being studied. Our early work includes an efficient GPU implementation of the large scale recurrent neural network. The recurrent neural network (RNN) is a special type of neural network equipped with additional recurrent connections. However, the large computation complexity makes it difficult to effectively train a recurrent neural network and therefore significantly limits the research on the recurrent neural network in the last 20 years. In this work, we explore the potential parallelism of the recurrent neural network and propose a fine-grained two-stage pipeline implementation. Experiment results show that the proposed GPU implementation can achieve 2~11x speed-up compared with the basic CPU implementation with the Intel Math Kernel Library.

Embedded CNN on FPGA

Convolutional Neural Network (CNN) has shown significant performance improvement in recent years. In ImageNet 2014, one of the largest and most challenging computer vision challenge in the world, GoogLeNet from Google has won the classification competition with 6.65% Top-5 error rate. The promising results have shown great possibilities of application scenarios for CNN in the near future. Recent CNN models, albeit with remarkable performance improvement, are becoming more and more complex in terms of model architecture, making it difficult to achieve real-time performance on embedded systems, which are bounded with energy and computation resources. Our group are aiming to use CNN to perform object classification tasks on FPGA, achieving real-time performance and high energy efficiency. Our contribution can be concluded as follows:

  • We cut down the computation workloads of CNN models from the arithmetic level, while trying to reduce the performance degradation as less as possible.
  • We use fixed-point numbers with changeable precision instead of float numbers in the hardware system due to limited bandwidth. The precision of fixed-point numbers is carefully studied to reduce performance degradation.
  • We design customizable architecture on hardware to fully utilize the computation resources and bandwidth of the available FPGA platform.

Publications

  • Lixue Xia, Wenqin Huangfu, Tianqi Tang, Xiling Yin, Krishnendu Chakrabarty, Yuan Xie, Yu Wang, Huazhong Yang, Stuck-at Fault Tolerance in RRAM Computing Systems , to appear in IEEE Journal on Emerging and Selected Topics in Circuits and Systems (JETCAS), vol.8, No.1, 2018, pp.102-115.
  • Kaiyuan Guo, Lingzhi Sui, Jiantao Qiu, Jincheng Yu, Junbin Wang, Song Yao, Song Han, Yu Wang, Huazhong Yang, Angel-Eye: A Complete Design Flow for Mapping CNN onto Embedded FPGA , in IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (TCAD), vol.37, No.1, 2018, pp.35-47. pdf
  • Lixue Xia, Boxun Li, Tianqi Tang, Peng Gu, Pai-yu Chen, Shimeng Yu, Yu Cao, Yu Wang, Yuan Xie, Huazhong Yang, MNSIM: Simulation Platform for Memristor-based Neuromorphic Computing System , in IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (TCAD), vol.37, No.5, 2018, pp.1009-1022. pdf
  • Yi Cai, Yujun Lin, Lixue Xia, Xiaoming Chen, Song Han, Yu Wang, Huazhong Yang, Long Live TIME: Improving Lifetime for Training-In-Memory Engines by Structured Gradient Sparsification , to appear in Design Automation Conference (DAC), 2018.
  • Jincheng Yu, Kaiyuan Guo, Yiming Hu, Xuefei Ning, Jiantao Qiu, Huizi Mao, Song Yao, Tianqi Tang, Boxun Li, Yu Wang, and Huazhong Yang, Real-time object detection towards high power efficiency , to appear in DATE 2018, 2018, pp.704-708. pdf
  • Yi Cai, Tianqi Tang, Lixue Xia, Ming Cheng, Zhenhua Zhu, Yu Wang, Huazhong Yang, Training Low Bitwidth Convolutional Neural Networks on RRAM , in Proceedings of the 23rd Asia and South Pacific Design Automation Conference (ASP-DAC), 2018, pp.117-122. pdf
  • Keni Qiu, Weiwen Chen, Yuanchao Xu, Lixue Xia, Yu Wang, Zili Shao, A Low Power Design Enabled by a Peripheral Circuit Reuse Structure integrated with a Retimed Data Flow for RRAM Crossbar-based Convolutional Neural Network , in DATE, 2018.
  • Jilan Lin, Lixue Xia, Zhenhua Zhu, Hanbo Sun, Yi Cai, Hui Gao, Ming Cheng, Xiaoming Chen, Yu Wang and Huazhong Yang, Rescuing Memristor-based Computing with Non-linear Resistance Levels , in DATE 2018, 2018, pp.407-412.
  • Mengyun Liu, Lixue Xia, Yu Wang, Krishnendu Chakrabarty, Design of Fault-Tolerant Neuromorphic Computing Systems , in European Test Symposium, 2018.
  • Yuanhui Ni, Keni Qiu, Weiwen Chen, Lixue Xia, Yu Wang, Low Power Driven and Multi-CLP aware Loop Tiling for RRAM Crossbar-based CNN , in ACM/SIGAPP Symposium On Applied Computing (SAC), 2018.
  • Kaiyuan Guo, Song Han, Song Yao, Yu Wang, Yuan Xie, Huazhong Yang, Software–Hardware Codesign for Efficient Neural Network Acceleration , in IEEE Micro, vol.37, No.2, 2017, pp.18-25. pdf
  • Miao Hu, Yiran Chen, J. Joshua Yang, Yu Wang, Hai Helen Li, A Compact Memristor-Based Dynamic Synapse for Spiking Neural Networks , in IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems , vol.36, No.8, 2017. pdf
  • Wenqin Huangfu, Lixue Xia, Ming Cheng, Xilin Yin, Tianqi Tang, Boxun Li, Krishnendu Chakrabarty, Yuan Xie, Yu Wang, Huazhong Yang, Computation-Oriented Fault-Tolerance Schemes for RRAM Computing Systems , in Proceedings of the 22nd Asia and South Pacific Design Automation Conference (ASP-DAC), 2017, pp.794-799. pdf slide
  • Tianqi Tang, Lixue Xia, Boxun Li, Yu Wang, Huazhong Yang, Binary Convolutional Neural Network on RRAM , in Proceedings of the 22nd Asia and South Pacific Design Automation Conference (ASP-DAC), 2017, pp.782-787. pdf slide
  • Lixue Xia, Mengyun Liu, Xuefei Ning, Krishnendu Chakrabarty, Yu Wang, Fault-Tolerant Training with On-Line Fault Detection for RRAM-Based Neural Computing Systems , in DAC, 2017. pdf
  • Ming Cheng, Lixue Xia, Zhenhua Zhu, Yi Cai, Yuan Xie, Yu Wang, Huazhong Yang, TIME:A Training-in-memory Architecture for Memristor-based Deep Neural Network , in Design Automation Conference (DAC), 2017, pp.26:1-26:6. pdf slide
  • Song Han, Junlong Kang, Huizi Mao, Yiming Hu, Xin Li, Yubin Li, Dongliang Xie, Hong Luo, Song Yao, Yu Wang, Huazhong Yang, William J. Dally, ESE: Efficient Speech Recognition Engine with Compressed LSTM on FPGA , in ACM International Symposium on FPGA, 2017, pp.75-84. pdf
  • Wei-Hao Chen, Win-San Khwa, Jun-Yi Li, Wei-Yu Lin1, Huan-Ting Lin, Yongpan Liu, Yu Wang, Huaqiang Wu, Huazhong Yang, Meng-Fan Chang, Circuit Design for Beyond Von Neumann Applications Using Emerging Memory: From Nonvolatile Logics to Neuromorphic Computing , in International Symposium on Quality Electronic Design (ISQED), 2017, pp.23-28. pdf
  • Fang Su, Wei-Hao Chen, Lixue Xia, Chieh-Pu Lo, Tianqi Tang, Zhibo Wang, Kuo-Hsiang Hsu, Ming Cheng, Jun-Yi Li, Yuan Xie, Yu Wang, Meng-Fan Chang, Huazhong Yang, Yongpan Liu, A 462GOPs/J RRAM-Based Nonvolatile Intelligent Processor for Energy Harvesting IoE System Featuring Nonvolatile Logics and Processing-In-Memory , in IEEE Symposium on VLSI Circuits (VLSIC), 2017. pdf
  • Boxun Li, Peng Gu, Yu Wang, Huazhong Yang, Exploring the Precision Limitation for RRAM-based Analog Approximate Computing , in IEEE Design & Test (D&T), vol.33, No.1, 2016, pp.51-58. pdf
  • Xiaoxiao Liu, Mengjie Mao, Beiye Liu, Boxun Li, Yu Wang, Hao Jiang, Mark Barnell, Qing Wu, Jianhua Yang, Hai Li, and Yiran Chen,, Harmonica: A Framework of Heterogeneous Computing Systems With Memristor-Based Neuromorphic Computing Accelerators , in IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, 2016. pdf
  • Huizi Mao, Song Yao, Tianqi Tang, Boxun Li, Jun Yao, Yu Wang, Towards Real-Time Object Detection on Embedded Systems , in IEEE Transactions on Emerging Topics in Computing, 2016. pdf
  • Lixue Xia, Peng Gu, Boxun Li, Tianqi Tang, Xiling Yin, Wenqin Huangfu, Shimeng Yu, Yu Cao, Yu Wang, Huazhong Yang, Technological Exploration of RRAM Crossbar Array for Matrix-Vector Multiplication , in Journal of Computer Science and Technology (JCST), vol.31, No.1, 2016, pp.3-19. pdf
  • Deming Zhang, Lang Zeng, Mengxing Wang, Youguang Zhang, Jacques-Olivier Klein, Yu Wang, and Weisheng Zhao, All-Spin Artificial Neural Network based on Compound Spintronic Synapse and Neuron , in IEEE Transactions on Biomedical Circuits and Systems, 2016. pdf
  • Yu Wang, Lixue Xia, Ming Cheng, Tianqi Tang, Boxun Li, Huazhong Yang, RRAM Based Learning Acceleration , in Compliers, Architectures, and Sythesis of Embedded Systems (CASES) invited talk, 2016, pp.1-2. pdf
  • Lixue Xia, Tianqi Tang, Wenqin Huangfu, Ming Cheng, Xiling Yin, Boxun Li, Yu Wang, Huazhong Yang, Switched by Input: Power Efficient Structure for RRAM-based Convolutional Neural Network , in Design Automation Conference (DAC), 2016, pp.125:1-125:6. pdf slide
  • Lixue Xia, Boxun Li, Tianqi Tang, Peng Gu, Xiling Yin, Wenqin Huangfu, Pai-Yu Chen, Shimeng Yu, Yu Cao, Yu Wang, Yuan Xie and Huazhong Yang, MNSIM: Simulation Platform for Memristor-based Neuromorphic Computing System , in DATE, 2016, pp.469-474. pdf slide
  • Jiantao Qiu, Jie Wang, Song Yao, Kaiyuan Guo, Boxun Li, Erjin Zhou, Jincheng Yu, Tianqi Tang, Ningyi Xu, Sen Song , Yu Wang, Huazhong Yang, Going Deeper with Embedded FPGA Platform for Convolutional Neural Network , in ACM International Symposium on FPGA, 2016, pp.26-35. pdf slide
  • Sicheng Li, Yu Wang, Hai Li, A Data Locality-aware Design Framework for Reconfigurable Sparse Matrix-Vector Multiplication Kernel , in International Conference On Computer Aided Design (ICCAD), 2016, pp.1-8. pdf
  • Ping Chi, Shuangchen Li, Cong Xu, Tao Zhang, Jishen Zhao, Yongpan Liu, Yu Wang, Yuan Xie, A Novel Processing-in-memory Architecture for Neural Network Computation in ReRAM-based Main Memory , in The 43rd ACM/IEEE International Symposium on Computer Architecture, 2016, pp.1-14. pdf
  • Yu Wang, Lixue Xia, Tianqi Tang, Boxun Li, Song Yao, Ming Cheng, Huazhong Yang, Low Power Convolutional Neural Networks on a Chip , in ISCAS, 2016, pp.129-132. pdf slide
  • Kaiyuan Guo, Lingzhi Sui, Jiantao Qiu, Song Yao, Song Han, Yu Wang and Huazhong Yang, Angel-Eye: A Complete Design Flow for Mapping CNN onto Customized Hardware , in IEEE Computer Society Annual Symposium on VLSI (ISVLSI), 2016, pp.24-29. pdf
  • Boxun Li, Peng Gu, Yi Shan, Yu Wang, Yiran Chen, Huazhong Yang, RRAM-based Analog Approximate Computing , in IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (TCAD), vol.34, No.12, 2015, pp.1905-1917. pdf
  • Peng Gu, Boxun Li, Tianqi Tang, Shimeng Yu, Yu Cao, Yu Wang, Huazhong Yang, Technological Exploration of RRAM Crossbar Array For Matrix-Vector Multiplication , in Proceedings of the 20th Asia and South Pacific Design Automation Conference (ASP-DAC), 2015, pp.106-111. pdf
  • Boxun Li, Lixue Xia, Peng Gu, Yu Wang, and Huazhong Yang, Merging the interface: Power, area and accuracy co-optimization for RRAM crossbar-based mixed-signal , in 52nd ACM/EDAC/IEEE Design Automation Conference (DAC), 2015, pp.13:1-13:6. pdf
  • Xiaoxiao Liu, Mengjie Mao, Beiye Liu, Boxun Li, Hao Jiang, Yu Wang, Mark Barnell, Qing Wu, J. Joshua, Reno: A Highly-efficient Reconfigurable Neuromorphic Computing Accelerator Design , in 52nd ACM/EDAC/IEEE Design Automation Conference (DAC), 2015, pp.1-6. pdf
  • Tianqi Tang, Lixue Xia, Boxun Li, Rong Luo, Yu Wang, Yiran Chen, Huangzhong Yang, Spiking Neural Network with RRAM : Can We Use it for Real-World Application? , in Proceedings of the Design, Automation & Test in Europe Conference & Exhibition (DATE), 2015, pp.860-865. pdf
  • Sicheng Li, Chunpeng Wu, Boxun Li, Yu Wang, Qinru Qiu and Hai Li, FPGA Acceleration for Recurrent Neural Network Language Model , in Proceedings of the IEEE 23rd Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM), 2015, pp.111-118. pdf
  • Yu Wang, Tianqi Tang, Lixue Xia, Boxun Li, Peng Gu, Hai Li, Yuan Xie, Huazhong Yang, Energy Efficient RRAM Spiking Neural Network for Real Time Classification , in Proceedings of the 25th Edition on Great Lakes Symposium on VLSI (GLSVLSI), 2015, pp.189-194. pdf
  • Yung-Hsiang Lu, Alan M. Kadin, Alexander C. Berg, Thomas M. Conte, Erik P. DeBenedictis, Rachit Garg, Ganesh Gingade, Bichlien Hoang, Yongzhen Huang, Boxun Li, Jingyu Liu, Wei Liu, Huizi Mao, Junran Peng, Tianqi Tang, Elie K. Track, Jingqiu Wang, Tao Wang, Yu Wang, Jun Yao, Rebooting Computing and Low-Power Image Recognition Challenge , in IEEE/ACM International Conference on Computer-Aided Design (ICCAD), 2015, pp.927-932. pdf
  • Shimeng Yu, Pai-Yu Chen, Yu Cao, Lixue Xia, Yu Wang, Huaqiang Wu, Scaling-up Resistive Synaptic Arrays for Neuro-inspired Architecture: Challenges and Prospect , in IEEE International Electron Devices Meeting (IEDM), 2015, pp.451-454. pdf
  • Hong Zhang, Xue Feng, Boxun Li, Yu Wang, Kaiyu Cui, Fang Liu, Weibei Dou, and Yidong Huang, Integrated photonic reservoir computing based on hierarchical time-multiplexing structure , in Optical Express, vol.22, No.25, 2014, pp.31356-31370. pdf
  • Boxun Li, Yuzhi Wang, Yu Wang, Yiran Chen, Huazhong Yang, Training itself: Mixed-signal training acceleration for memristor-based neural network. , in Proceedings of the 19th Asia and South Pacific Design Automation Conference (ASP-DAC), 2014, pp.361-366. pdf
  • Miao Hu, Yu Wang, Qinru Qiu, Yiran Chen, Hai Li, The stochastic modeling of TiO2 memristor and its usage in neuromorphic system design. , in Proceedings of the 19th Asia and South Pacific Design Automation Conference (ASP-DAC), 2014, pp.831-836. pdf
  • Boxun Li, Yu Wang, Yiran Chen, Hai Helen Li, Huazhong Yang, ICE: inline calibration for memristor crossbar-based computing engine , in Proceedings of the Design, Automation & Test in Europe Conference & Exhibition (DATE), 2014, pp.1-4. pdf
  • Yu Wang, Boxun Li, Rong Luo, Yiran Chen, Ningyi Xu, Huazhong Yang, Energy efficient neural networks for big data analytics , in Proceedings of the Design, Automation & Test in Europe Conference & Exhibition (DATE), 2014, pp.1-2. pdf
  • Boxun Li,Erjin Zhou,Bo Huang,Jiayi Duan,Yu Wang,Ningyi Xu,Jiaxing Zhang,Huazhong Yang, Large Scale Recurrent Neural Network on GPU , in Proceedings of the International Joint Conference on Neural Networks (IJCNN), 2014, pp.4062 - 4069. pdf
  • Tianqi Tang, Rong Luo, Boxun Li, Hai Li, Yu Wang, Huazhong Yang, Energy Efficient Spiking Neural Network Design with RRAM Devices , in Proceedings of the 14th International Symposium on Integrated Circuits (ISIC), 2014, pp.268 - 271. pdf
  • Boxun Li, Yi Shan, Miao Hu, Yu Wang, Yiran Chen, Huazhong Yang, Memristor-based approximated computation , in Proceedings of the International Symposium on Low Power Electronics and Design (ISLPED), 2013, pp.242-247. pdf

copyright 2019 © NICS Lab of Tsinghua University