2016:DianNao Family Energy-Efficient Hardware Accelerators for Machine Learning-程序员宅基地

技术标签: pr  

  • 这个发表在了
  • https://cacm.acm.org/
  • communication of the ACM

  • 参考文献链接是
Chen Y , Chen T , Xu Z , et al. 
DianNao Family: Energy-Efficient Hardware Accelerators for Machine Learning[J]. 
Communications of the Acm, 2016, 59(11):105-112
  • https://cacm.acm.org/magazines/2016/11/209123-diannao-family/fulltext
  • 果然找到了他

  • 特码的,我下载了,6不6
The original version of this paper is entitled “DianNao: A
Small-Footprint, High-Throughput Accelerator for Ubiq-
uitous Machine Learning” and was published in Proceed-
ings of the International Conference on Architectural Support
for Programming Languages and Operating Systems (ASPLOS)
49, 4 (March 2014), ACM, New York, NY, 269284.

Abstract

  • ML pervasive
    • broad range of applications
      • broad range of systems(embedded to data centers)

  • computer
    • toward heterogeneous multi-cores
    • a mix of cores and hardware accelerators,
  • designing hardware accelerators for ML
    • achieve high efficiency and broad application scope

第二段

  • efficient computational primitives
    • important for a hardware accelerator,
  • inefficient memory transfers can
    • potentially void the throughput, energy, or cost advantages of accelerators,
  • an Amdahl’s law effect
  • become a first-order concern,

  • like in processors,
    • rather than an element factored in accelerator design on a second step

  • a series of hardware accelerators
    • designed for ML(nn),
    • the impact of memory on accelerator design, performance, and energy.

  • representative neural network layers
  • 450.65x over GPU
  • energy by 150.31x on average
    • for 64-chip DaDianNao (a member of the DianNao family)

1 INTRODUCTION

  • designing hardware accelerators which realize the best possible tradeoff between flexibility and efficiency is becoming a prominent
    issue.

  • The first question is for which category of applications one should primarily design accelerators?
  • Together with the architecture trend towards accelerators, a second simultaneous and significant trend in high-performance and embedded applications is developing: many of the emerging high-performance and embedded applications, from image/video/audio recognition to automatic translation, business analytics, and robotics rely on machine learning
    techniques.
  • This trend in application comes together with a third trend in machine learning (ML) where a small number
    of techniques, based on neural networks (especially deep learning techniques 16, 26 ), have been proved in the past few
    years to be state-of-the-art across a broad range of applications.
  • As a result, there is a unique opportunity to design accelerators having significant application scope as well as
    high performance and efficiency. 4

第二段

  • Currently, ML workloads
  • mostly executed on
    • multicores using SIMD[44]
    • on GPUs[7]
    • or on FPGAs[2]

  • the aforementioned trends
    • have already been identified
    • by researchers who have proposed accelerators implementing,
  • CNNs[2]
  • Multi-Layer Perceptrons [43] ;

  • accelerators focusing on other domains,
    • image processing,
    • propose efficient implementations of some of the computational primitives used
    • by machine-learning techniques, such as convolutions[37]

  • There are also ASIC implementations of ML
    • such as Support Vector Machine and CNNs.

  • these works focused on
    • efficiently implementing the computational primitives
      • ignore memory transfers for the sake of simplicity[37,43]
      • plug their computational accelerator to memory via a more or less sophisticated DMA. [2,12,19]

第三段

  • While efficient implementation of computational primitives is a first and important step with promising results,
    inefficient memory transfers can potentially void the throughput, energy, or cost advantages of accelerators, that is, an
    Amdahl’s law effect, and thus, they should become a first-
    order concern, just like in processors, rather than an element
    factored in accelerator design on a second step.

  • Unlike in processors though, one can factor in the specific nature of
    memory transfers in target algorithms, just like it is done for accelerating computations.

  • This is especially important in the domain of ML where there is a clear trend towards scaling up the size of learning models in order to achieve better accuracy and more functionality. 16, 24

第四段

  • In this article, we introduce a series of hardware accelerators designed for ML (especially neural networks), including
    DianNao, DaDianNao, ShiDianNao, and PuDianNao as listed in Table 1.
  • We focus our study on memory usage, and we investigate the accelerator architecture to minimize memory
    transfers and to perform them as efficiently as possible.

2 DIANNAO: A NN ACCELERATOR

  • DianNao
    • first of DianNao accelerator family,
  • accommodates sota nn techniques (dl ),
  • inherits the broad application scope of nn.

2.1 Architecture

  • DianNao
    • input buffer for input (NBin)
    • output buffer for output (NBout)
    • buffer for synaptic(突触) weights (SB)
    • connected to a computational block (performing both synapses and neurons computations)
    • NFU, and CP, see Figure 1

NBin是存放输入神经元
SB是存放突触的权重的
这个NBout是存放输出神经元

我觉得图示的可以这样理解:2个输入神经元,2个突触,将这2个对应乘起来,输出是1个神经元啊。但是我的这个NFU牛逼啊,他可以一次性求两个输出神经元。

NFU

  • a functional block of T i T_i Ti inputs/synapses(突触)
    • T n T_n Tn output neurons,
  • time-shared by different algorithmic blocks of neurons.

这个NFU对 T i T_i Ti个输入和突触运算,得到 T n T_n Tn个输出神经元,突触不是应该是 T i × T n T_i\times T_n Ti×Tn个吗??,

  • Depending on the layer type,
    • computations at the NFU can be decomposed in either two or three stages

  • For classifier and convolutional:
    • multiplication of synapses × \times × inputs:NFU-1
    • , additions of all multiplications, :NFU-2
    • sigmoid. :NFU-3

如果是分类层或者卷积的话的话,那就是简单的突触 × \times × 输入,然后加起来,求sigmoid。这个我能理解哦,这种情况不就是卷积吗。

如果是分类层,那么输入就是

  • last stage (sigmoid or another nonlinear function) can vary.

  • For pooling, no multiplication(no synapse),
    • pooling can be average or max.

  • adders(加法器) have multiple inputs,
    • they are in fact adder trees,

  • the second stage also contains
    • shifters and max operators for pooling.

要啥移位啊??

  • the sigmoid function (for classifier and convolutional layers)can be efficiently implemented using ( f ( x ) = a i x × + b i , x ∈ [ x i , x i + 1 ] f(x) = a_i x \times + b_i , x \in [x_i , x_{i+1} ] f(x)=aix×+bi,x[xi,xi+1]) (16 segments are sufficient)

On-chip Storage

  • on-chip storage structures of DianNao
    • can be construed as modified buffers of scratchpads.

  • While a cache is an excellent storage structure for a general-purpose processor, it is a sub-optimal way to exploit reuse because of the cache access overhead (tag check, associativity, line size, speculative read, etc.) and cache conflicts.
  • The efficient alternative, scratchpad, is used in VLIW processors but it is known to be very difficult to compile for.
  • However a scratchpad in a dedicated accelerator realizes the best of both worlds: efficient
    storage, and both efficient and easy exploitation of locality because only a few algorithms have to be manually adapted.
第二段
  • on-chip storage into three (NBin, NBout,and SB), because there are three type of data (input neurons,output neurons and synapses) with different characteristics (read width and reuse distance).

  • The first benefit of splitting structures is to tailor the SRAMs to the appropriate
    read/write width,
  • and the second benefit of splitting storage structures is to avoid conflicts, as would occur in a cache.
  • Moreover, we implement three DMAs to exploit spatial locality of data, one for each buffer (two load DMAs for inputs, one store DMA for outputs).

2.2 Loop tiling

  • DianNao 用 loop tiling去减少memory access
    • so可容纳大的神经网络
  • 举例
    • 一个classifier 层
      • N n N_n Nn输出神经元
      • 全连接到 N i N_i Ni的输入
      • 如下图

N n N_n Nn个输出, N i N_i Ni个输入,sypase应该是 N n × N i N_n\times N_i Nn×Ni大小,用这个矩阵 × N i \times N_i ×Ni即可得到结果啊

  • 先取出来一块
    • 有点疑惑啊
    • 万一右边第一个元素和左边全部元素都有关
    • 你咋算啊 ()
    • 其实啊,我他妈算右边第一个时候
    • 只需要用到和synapse的一行呀!
    • 那你那个大大的synapse矩阵咋办啊
      在这里插入图片描述
  • 下面是原始代码和和
    • tiled代码
    • 他把分类层映射到DianNao

在这里插入图片描述

for(int n=0;n<Nn;n++)
	sum[n]=0;
for(int n=0;n<Nn;n++) //输出神经元
	for(int i=0;i<Ni;i++) //输入神经元
		sum[n]+=synapse[n][i]*neuron[i];
for(int n=0;n<Nn;n++)
	neuron[n]=Sigmoid(sum[n]);		
  • 俺的想法:
    • 一次来Tnn个输出
    • 和Tii个输入
    • 然后这个东西对于硬件还是太大了
    • 再拆
    • 来Tn个和Ti个吧
    • 就酱
for(int nnn=0;nnn<Nn;nnn+=Tnn){
    //tiling for output 神经元
//第一个for循环准备扔出去Tnn个输出
    for(int iii=0;iii<Ni;iii+=Tii){
    //tiling for input 神经元
//第二个for循环准备扔进来Tii个输入
//下面就这两个东西动手

        for(int nn=nnn;nn<nnn+Tnn;nn+=Tn){
    
//第三个for循环觉得觉得Tnn还是太大了,继续拆
//大小是Tn
//那么我们对每一个Tnn块!(开始位置是nn哦!!)
//我们如下求解

///
            for(int n=nn;n<nn+Tn;n++)
//第一步把中间结果全部搞成零!

            sum[n]=0;
//为求sum[n],sum[n]=synapse的第n行乘neuron的全部啊!
        for(int ii=iii;ii<iii+Tii;ii+=Ti)

//上面的for是对Ti进行拆

            for(int n=nn;n<nn+Tn;n++)
                for(int i=ii;i<ii+Ti;i++)
                    sum[n]+=synapse[n][i]*neuron[i];

    for(int nn=nnn;nn<nnn+Tnn;nn+=Tn)
        neuron[n]=sigmoid(sum[n]);
///

 }   }  }
  • 在tiled代码中, i i ii ii n n nn nn
    • 表示NFU有 T i T_i Ti个输入和突触
      • T n T_n Tn个输出神经元
  • 输入神经元被每个输出神经元需要重用
    • 但这个输入向量也太他妈大了
    • 塞不到Nbin块里啊
    • 所以也要对循环 i i ii ii分块,因子 T i i T_{ii} Tii

上面的代码肯定有问题,正确的如下:

	for (int nnn = 0; nnn < Nn; nnn += Tnn) {
    
		for (int nn = nnn; nn < nnn + Tnn; nn += Tn) {
    
			for (int n = nn; n < nn + Tn; n++)
				sum[n] = 0;
			for (int iii = 0; iii < Ni; iii += Tii) {
    
				for (int ii = iii; ii < iii + Tii; ii += Ti)
					for (int n = nn; n < nn + Tn; n++)
						for (int i = ii; i < ii + Ti; i++)
							sum[n] += synapse[n][i] * neuron[i];
			}
			for (int n = nn; n < nn + Tn; n++)
				printf("s%ds ", sum[n]);
		}
	}
	for (int index = 0; index < Nn; index++)
		printf("%d ", sum[index]);
版权声明:本文为博主原创文章,遵循 CC 4.0 BY-SA 版权协议,转载请附上原文出处链接和本声明。
本文链接:https://blog.csdn.net/zhoutianzi12/article/details/110244427

智能推荐

FTP命令字和返回码_ftp 登录返回230-程序员宅基地

文章浏览阅读3.5k次,点赞2次,收藏13次。为了从FTP服务器下载文件,需要要实现一个简单的FTP客户端。FTP(文件传输协议) 是 TCP/IP 协议组中的应用层协议。FTP协议使用字符串格式命令字,每条命令都是一行字符串,以“\r\n”结尾。客户端发送格式是:命令+空格+参数+"\r\n"的格式服务器返回格式是以:状态码+空格+提示字符串+"\r\n"的格式,代码只要解析状态码就可以了。读写文件需要登陆服务器,特殊用..._ftp 登录返回230

centos7安装rabbitmq3.6.5_centos7 安装rabbitmq3.6.5-程序员宅基地

文章浏览阅读648次。前提:systemctl stop firewalld 关闭防火墙关闭selinux查看getenforce临时关闭setenforce 0永久关闭sed-i'/SELINUX/s/enforcing/disabled/'/etc/selinux/configselinux的三种模式enforcing:强制模式,SELinux 运作中,且已经正确的开始限制..._centos7 安装rabbitmq3.6.5

idea导入android工程,idea怎样导入Android studio 项目?-程序员宅基地

文章浏览阅读5.8k次。满意答案s55f2avsx2017.09.05采纳率:46%等级:12已帮助:5646人新版Android Studio/IntelliJ IDEA可以直接导入eclipse项目,不再推荐使用eclipse导出gradle的方式2启动Android Studio/IntelliJ IDEA,选择 import project3选择eclipse 项目4选择 create project f..._android studio 项目导入idea 看不懂安卓项目

浅谈AI大模型技术:概念、发展和应用_ai大模型应用开发-程序员宅基地

文章浏览阅读860次,点赞2次,收藏6次。AI大模型技术已经在自然语言处理、计算机视觉、多模态交互等领域取得了显著的进展和成果,同时也引发了一系列新的挑战和问题,如数据质量、计算效率、知识可解释性、安全可靠性等。城市运维涉及到多个方面,如交通管理、环境监测、公共安全、社会治理等,它们需要处理和分析大量的多模态数据,如图像、视频、语音、文本等,并根据不同的场景和需求,提供合适的决策和响应。知识搜索有多种形式,如语义搜索、对话搜索、图像搜索、视频搜索等,它们可以根据用户的输入和意图,从海量的数据源中检索出最相关的信息,并以友好的方式呈现给用户。_ai大模型应用开发

非常详细的阻抗测试基础知识_阻抗实部和虚部-程序员宅基地

文章浏览阅读8.2k次,点赞12次,收藏121次。为什么要测量阻抗呢?阻抗能代表什么?阻抗测量的注意事项... ...很多人可能会带着一系列的问题来阅读本文。不管是数字电路工程师还是射频工程师,都在关注各类器件的阻抗,本文非常值得一读。全文13000多字,认真读完大概需要2小时。一、阻抗测试基本概念阻抗定义:阻抗是元器件或电路对周期的交流信号的总的反作用。AC 交流测试信号 (幅度和频率)。包括实部和虚部。​图1 阻抗的定义阻抗是评测电路、元件以及制作元件材料的重要参数。那么什么是阻抗呢?让我们先来看一下阻抗的定义。首先阻抗是一个矢量。通常,阻抗是_阻抗实部和虚部

小学生python游戏编程arcade----基本知识1_arcade语言 like-程序员宅基地

文章浏览阅读955次。前面章节分享试用了pyzero,pygame但随着想增加更丰富的游戏内容,好多还要进行自己编写类,从今天开始解绍一个新的python游戏库arcade模块。通过此次的《连连看》游戏实现,让我对swing的相关知识有了进一步的了解,对java这门语言也有了比以前更深刻的认识。java的一些基本语法,比如数据类型、运算符、程序流程控制和数组等,理解更加透彻。java最核心的核心就是面向对象思想,对于这一个概念,终于悟到了一些。_arcade语言 like

随便推点

【增强版短视频去水印源码】去水印微信小程序+去水印软件源码_去水印机要增强版-程序员宅基地

文章浏览阅读1.1k次。源码简介与安装说明:2021增强版短视频去水印源码 去水印微信小程序源码网站 去水印软件源码安装环境(需要材料):备案域名–服务器安装宝塔-安装 Nginx 或者 Apachephp5.6 以上-安装 sg11 插件小程序已自带解析接口,支持全网主流短视频平台,搭建好了就能用注:接口是公益的,那么多人用解析慢是肯定的,前段和后端源码已经打包,上传服务器之后在配置文件修改数据库密码。然后输入自己的域名,进入后台,创建小程序,输入自己的小程序配置即可安装说明:上传源码,修改data/_去水印机要增强版

verilog进阶语法-触发器原语_fdre #(.init(1'b0) // initial value of register (1-程序员宅基地

文章浏览阅读557次。1. 触发器是FPGA存储数据的基本单元2. 触发器作为时序逻辑的基本元件,官方提供了丰富的配置方式,以适应各种可能的应用场景。_fdre #(.init(1'b0) // initial value of register (1'b0 or 1'b1) ) fdce_osc (

嵌入式面试/笔试C相关总结_嵌入式面试笔试c语言知识点-程序员宅基地

文章浏览阅读560次。本该是不同编译器结果不同,但是尝试了g++ msvc都是先计算c,再计算b,最后得到a+b+c是经过赋值以后的b和c参与计算而不是6。由上表可知,将q复制到p数组可以表示为:*p++=*q++,*优先级高,先取到对应q数组的值,然后两个++都是在后面,该行运算完后执行++。在电脑端编译完后会分为text data bss三种,其中text为可执行程序,data为初始化过的ro+rw变量,bss为未初始化或初始化为0变量。_嵌入式面试笔试c语言知识点

57 Things I've Learned Founding 3 Tech Companies_mature-程序员宅基地

文章浏览阅读2.3k次。57 Things I've Learned Founding 3 Tech CompaniesJason Goldberg, Betashop | Oct. 29, 2010, 1:29 PMI’ve been founding andhelping run techn_mature

一个脚本搞定文件合并去重,大数据处理,可以合并几个G以上的文件_python 超大文本合并-程序员宅基地

文章浏览阅读1.9k次。问题:先讲下需求,有若干个文本文件(txt或者csv文件等),每行代表一条数据,现在希望能合并成 1 个文本文件,且需要去除重复行。分析:一向奉行简单原则,如无必要,绝不复杂。如果数据量不大,那么如下两条命令就可以搞定合并:cat a.txt >> new.txtcat b.txt >> new.txt……去重:cat new...._python 超大文本合并

支付宝小程序iOS端过渡页DFLoadingPageRootController分析_类似支付宝页面过度加载页-程序员宅基地

文章浏览阅读489次。这个过渡页是第一次打开小程序展示的,点击某个小程序前把手机的开发者->network link conditioner->enable & very bad network 就会在停在此页。比如《支付宝运动》这个小程序先看这个类的.h可以看到它继承于DTViewController点击左上角返回的方法- (void)back;#import "DTViewController.h"#import "APBaseLoadingV..._类似支付宝页面过度加载页

推荐文章

热门文章

相关标签