![]() These results indicate an opportunity for energy management of the space that is used for storing spilled variables, with the framework improving the performance by a geometric mean of 8.3%, 16.3%, and 29.2% for general shader processors in which variables are spilled to memory with 5-, 10-, and 20-cycle access latencies, respectively. Experimental results on a cycle-approximate simulator showed that the proposed framework decreased 92% of register spills in total and made 91.7% of 14 common shader programs spill-free. The framework involves two major components: (1) element-based register allocation that allocates registers based on the element requirement of variables and (2) register packing that rearranges elements of registers in order to increase the number of contiguous free elements, thereby keeping more live variables in registers. In this paper we present a vector-aware register allocation framework to improve register utilization on shader architectures. Programs that run with a low number of registers may suffer from high register pressure if register allocation is not properly designed, especially on a GPU in which a register is divided into four elements and each element can be accessed separately, because allocating a register for a vector-type variable that does not contain values in all elements wastes register spaces. The number of registers is one of the common limiting factors in an embedded GPU design. However, many em- bedded systems have to manage highly restricted hardware resources in order to achieve high performance or energy efficiency. Graphics processing units (GPUs) are now widely used in embedded systems for manipulating computer graphics and even for general-purpose computation. DenseNets obtain significant improvements over the state-of-the-art on most of them, whilst requiring less memory and computation to achieve high performance. ![]() We evaluate our proposed architecture on four highly competitive object recognition benchmark tasks (CIFAR-10, CIFAR-100, SVHN, and ImageNet). DenseNets have several compelling advantages: they alleviate the vanishing-gradient problem, strengthen feature propagation, encourage feature reuse, and substantially reduce the number of parameters. For each layer, the feature-maps of all preceding layers are used as inputs, and its own feature-maps are used as inputs into all subsequent layers. Whereas traditional convolutional networks with L layers have L connections - one between each layer and its subsequent layer - our network has L(L+1)/2 direct connections. In this paper, we embrace this observation and introduce the Dense Convolutional Network (DenseNet), which connects each layer to every other layer in a feed-forward fashion. Recent work has shown that convolutional networks can be substantially deeper, more accurate, and efficient to train if they contain shorter connections between layers close to the input and those close to the output. Experimental results on the memory requests of a representative NN accelerator demonstrate that the proposed approach can significantly reduce the memory consumption by 30% at most compared with state‐of‐the‐art SPM management techniques, and the memory usage is only 2% larger than that of the theoretical optimal allocation. In addition, by exploiting the variable‐length feature of SPM, we propose to divide the allocation process into two passes: the space assignment and the address assignment pass, which are simultaneously (and implicitly) performed in traditional one‐pass allocation techniques. The basic intuition is that NN computation/memory behaviors are predictable and relatively regular compared with traditional applications, and thus most information can be determined at compile time. In this article, we propose a novel SPM management approach for NN accelerators. The main reason is that traditional techniques are originally designed for managing fixed‐length registers rather than variable‐length memory blocks. However, traditional SPM management techniques cause memory fragmentation for NN accelerators, and thus lead to low utilization of precious SPM. ![]() Deep neural networks have been demonstrated to be useful in varieties of intelligent tasks, and various specialized NN accelerators have been proposed recently to improve the hardware efficiency, which are typically equipped with software‐managed scratchpad memory (SPM) for high performance and energy efficiency.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |