NUCA-2A: A New Adaptive and Behavior Aware Block Placement Process

Author(s): Mohamed Salah Souahi*, Mohamed Ben Mohammed.

Journal Name: Recent Patents on Computer Science

Volume 12 , Issue 2 , 2019

Become EABM
Become Reviewer

Graphical Abstract:


Abstract:

Background: The last three decades were marked by a spectacular evolution of CPUs. Both cores number on chip and shared Low Level Cache (LLC) size are increasing what makes LLC the bottleneck's system. One major weakness of future cache memory hierarchies will be to carry out memory blocks availability for vertical requests, with no consideration to horizontal proximity to cores. Simulations show that some LLC accesses cost more latency cycles than off-chip accesses.

Objective: This paper presents a new adaptive and blocks behavior aware process, called NUCA-2A. It manages blocks in LLC in a purpose of reducing it's latency, and it's inner bandwidth, by studying each block's behavior, and by placing it in the most suitable location among LLC banks.

Methods: LLC accesses are classified basing on each one's specific behavior. Authors establish also a two levels horizontal hierarchy in LLC. This work consists to place blocks in the zones that matches the best their behaviors.

Results: In contrast to the classic S-NUCA scheme, NUCA-2A makes a reduction of up to 60,39% of global LLC latency as well as 40,74% of average inner traffic. It makes also an average speedup of 17,89 % in term of number of instructions executed by cycle.

Conclusion: Behaviors study gives encouraging results. Several methods are in use in different fields to forecast a behavior basing on previous observations. We are working on a prefetching model that permits blocks migration to and from privileged banks.

Keywords: Processor, multicore, cache memory, LLC, blocks migration, latency, NUCA, CMP.

[1]
A. Benczur, "The digital universe – an information theoretical analyses", In: Proceedings of the 14th International Conference on Computer Systems and Technologies, CompSysTech’13, New York, NY, USA, 2013, pp. 1-10.
[2]
Intel Corporation. Intel Xeon Phi Processor 7290. Available from:, https://ark.intel.com/products/95831/Intel-Xeon-Phi-Processor-7290F-16GB-1_50-GHz-72-core,2018.\newblock Accessed on 2018- 06-30.
[3]
Mellanox Technologies.TILE-Gx72 Processor, 2018. Available from:, http://www.mellanox.com/page/products_dyn?product_family=238&mtag=tile_gx72 (Accessed: 30th Jun 2018)
[4]
"MIT Computer Science and Artificial Intelligence Laboratory. The Angstrom Project. Available from:", http://projects.csail.mit.edu/angstrom (Accessed: 23rd Jan 2012).
[5]
D. Modha, "The brains architecture, efficiency on a chip", Cognitive Computing, IBM Research-Almaden, Systems, 2016. Available from : .https://www.ibm.com/blogs/research/2016/12/the-brains-architecture-efficiency-on-a-chip/ Accessed on 4th Nov 2018.
[6]
N. Muralimanohar, and R. Balasubramonian, "Cacti 6.0: A tool to understand large caches. Available from:", http://citeseerx. ist.psu.edu/viewdoc/summary?doi=10.1.1.147.3834 (Accessed on 4th Nov 2018).
[7]
C. Kim, D. Burger, and S.W. Keckler, "An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches", SIGARCH Comput. Archit. News, vol. 30, no. (5), pp. 211-222, 2002.
[8]
Z. Chishti, M.D. Powell, and T.N. Vijaykumar, "Distance associativity for high-performance energy-efficient non-uniform cache architectures", In: Proceedings 36th Annual IEEE/ACM International Symposium on Microarchitecture, San Diego, CA, USA, 2003, pp. 55-66.
[9]
Y. Wang, L. Zhang, Y. Han, H. Li, and X. Li, "Address remapping for static nuca in noc-based degradable chip-multiprocessors", In: 2010 IEEE 16th Pacific Rim International Symposium on Dependable Computing, pp. 70-76. 2010
[10]
H. Dybdahl, and P. Stenstrom, "An adaptive shared/private NUCA cache partitioning scheme for chip multiprocessors", In: 2007 IEEE 13th International Symposium on High Performance Computer Architecture, Washington, DC, USA, 2007, pp. 2-12.
[11]
B.M. Beckmann, and D.A. Wood, "Managing wire delay in large chip-multiprocessor caches", In: Microarchitecture 37th International Symposium on Microarchitecture, Washington, DC, USA, pp. 319-330. 2004
[12]
A. Arora, M. Harne, H. Sultan, A. Bagaria, and S.R. Sarangi, "FP-NUCA: A fast NoC layer for implementing large nuca caches", IEEE Trans. Parallel Distributed. Syst., vol. 26, no. (9), pp. 2465-2478, 2015.
[13]
Y. Jin, E.J. Kim, and K.H. Yum, "Design and analysis of on-chip networks for large-scale cache systems", IEEE Trans. Comp., vol. 59, no. (3), pp. 332-344, 2010.
[14]
M. Zhang, and K. Asanovic, "Victim replication: Maximizing capacity while hiding wire delay in tiled chip multiprocessors", In: Proceedings of the 32nd Annual International Symposium on Computer Architecture, ISCA ’05, Washington, DC, USA, 2005, pp. 336-345.
[15]
J. Merino, V. Puente, P. Prieto, and J.N. Gregorio, "SP NUCA: A cost effective dynamic non-uniform cache architecture", ACM SIGARCH Comp. Architec. News, vol. 36, pp. 64-71, 2008.
[16]
J. Merino, V. Puente, and J.A. Gregorio, "ESP-NUCA: A low-cost adaptive non-uniform cache architecture", In: HPCA - 16 2010 The 16th International Symposium on High-Performance Computer Architecture, Bangalore, India,, 2010, pp. 1-10.
[17]
R.M. Yoo, C.J. Hughes, C. Kim, Y-K. Chen, and C. Kozyrakis, "Locality-aware task management for unstructured parallelism: A quantitative limit study", In: Proceedings of the Twenty-fifth Annual ACM Symposium on Parallelism in Algorithms and Architectures, SPAA ’13, New York, NY, USA, 2013, pp. 315-325.
[18]
M. Wang, and Z. Li, "“A spatial and temporal locality-aware adaptive cache design with network optimization for tiled many-core architectures”, IEEE Trans. Very Large Scale Integration (VLSI)", Syst, vol. 25, no. (9), pp. 2419-2433, 2017.
[19]
M. Zahran, and S.A. McKee, Global management of cache hierarchiesInternational Conference on Computing Frontiers, New York, NY, USA, 2010, pp. 131-140.
[20]
M. S. Souahi, S. Niar, M. Zahran, and M. Benmohammed, "Towards dynamic cache block placement for multi-processor NUCA", In: ICM 2011 Proceeding, Penang, Malaysia, 2011, pp. 1-3.
[21]
J. Liao, and S. Chen, "Optimization of reading data via classified block access patterns in file systems", IEEE Access, vol. 4, pp. 9421-9427, 2016.
[22]
M. Soltaniyeh, I. Kadayif, and O. Ozturk, "“Classifying data blocks at subpage granularity with an on-chip page table to improve coherence in tiled cmps”", IEEE Trans. Computer-Aided Design Integrated Circuits Syst, vol. 37, no. (4), pp. 806-819, 2018.
[23]
A. Jadidi, M. Arjomand, M.T. Kandemir, and C.R. Das, "Hybrid-comp: A criticality-aware compressed last-level cache", In: 2018 19th International Symposium on Quality Electronic Design (ISQED), pp. 25-30. 2018
[24]
M.R. Guthaus, J.S. Ringenberg, D. Ernst, T.M. Austin, T. Mudge, and R.B. Brown, "Mibench: A free, commercially representative embedded benchmark suite", In: Proceedings of the Workload Characterization, WWC-4. , 2001 IEEE International Workshop, WWC ’01, Washington, DC, USA, , 2001, pp. 3-14.
[25]
S. Bartolini, P. Foglia, and C.A. Prete, "Exploring the relationship between architectures and management policies in the design of nuca-based chip multicore systems", Future Gener. Comp. Syst., vol. 78, pp. 481-501, 2018.
[26]
A. Pathania, and J. Henkel, "Task scheduling for many-cores with s-nuca caches", In: 2018 Design, Automation Test in Europe Conference Exhibition (DATE),, pp. 557-562. 2018
[27]
A.K. Ziabari, R.U. Tena, D. Schaa, and D. Kaeli, "A framework for visualization of opencl applications execution: A tutorial", In: Proceedings of the 3rd International Workshop on OpenCL, IWOCL’15, New York, NY, USA, 2015, p. 22:. 1-22:2


Rights & PermissionsPrintExport Cite as

Article Details

VOLUME: 12
ISSUE: 2
Year: 2019
Page: [101 - 109]
Pages: 9
DOI: 10.2174/2213275911666181114113340
Price: $58

Article Metrics

PDF: 20
HTML: 3
EPUB: 1
PRC: 1