survey

Approximate Communication: Techniques for Reducing Communication Bottlenecks in Large-Scale Parallel Systems

Authors:
Filipe Betzel

University of Minnesota, Minneapolis, MN

University of Minnesota, Minneapolis, MN

0000-0001-6180-8490
View Profile

,
Karen Khatamifard

University of Minnesota, Minneapolis, MN

University of Minnesota, Minneapolis, MN
View Profile

,
Harini Suresh

University of Minnesota, Minneapolis, MN

University of Minnesota, Minneapolis, MN
View Profile

,
David J. Lilja

University of Minnesota, Minneapolis, MN

University of Minnesota, Minneapolis, MN
View Profile

,
John Sartori

University of Minnesota, Minneapolis, MN

University of Minnesota, Minneapolis, MN
View Profile

,
Ulya Karpuzcu

University of Minnesota, Minneapolis, MN

University of Minnesota, Minneapolis, MN
View Profile

Authors Info & Claims

ACM Computing Surveys Volume 51 Issue 1Article No.: 1pp 1–32https://doi.org/10.1145/3145812

Published:10 January 2018Publication History

ACM Computing Surveys

Abstract

Approximate computing has gained research attention recently as a way to increase energy efficiency and/or performance by exploiting some applications’ intrinsic error resiliency. However, little attention has been given to its potential for tackling the communication bottleneck that remains one of the looming challenges to be tackled for efficient parallelism. This article explores the potential benefits of approximate computing for communication reduction by surveying three promising techniques for approximate communication: compression, relaxed synchronization, and value prediction. The techniques are compared based on an evaluation framework composed of communication cost reduction, performance, energy reduction, applicability, overheads, and output degradation. Comparison results demonstrate that lossy link compression and approximate value prediction show great promise for reducing the communication bottleneck in bandwidth-constrained applications. Meanwhile, relaxed synchronization is found to provide large speedups for select error-tolerant applications, but suffers from limited general applicability and unreliable output degradation guarantees. Finally, this article concludes with several suggestions for future research on approximate communication techniques.

References

Tor M. Aamodt and Paul Chow. 2008. Compile-time and instruction-set methods for improving floating-to fixed-point conversion accuracy. ACM Transactions on Embedded Computing Systems 7, 3, 26.Google ScholarDigital Library
Bülent Abali, Hubertus Franke, Dan E. Poff, Robert A. Saccone, Jr., Charles O. Schulz, Lorraine M. Herger, and T. Basil Smith. 2001. Memory expansion technology (MXT): software support and performance. IBM Journal of Research and Development 45, 2, 287--301.Google ScholarDigital Library
Don Adams. 1993. CRAY T3D System Architecture Overview Manual. Retrieved November 29, 2017 from ftp://ftp.cray.com/product-info/mpp/T3D_Architecture_Over/T3D.overview.html.Google Scholar
Ismail Akturk, Karen Khatamifard, and Ulya R. Karpuzcu. 2015. On quantification of accuracy loss in approximate computing. In Workshop on Duplicating, Deconstructing and Debunking (WDDD’15). 15.Google Scholar
Alaa R. Alameldeen and David A. Wood. 2004. Adaptive cache compression for high-performance processors. In Proceedings of the 31st Annual International Symposium on Computer Architecture. IEEE, 212--223.Google Scholar
Alaa R. Alameldeen and David A. Wood. 2007. Interactions between compression and prefetching in chip multiprocessors. In IEEE 13th International Symposium on High Performance Computer Architecture (HPCA’07). IEEE, 228--239.Google Scholar
George Almási, Philip Heidelberger, Charles J. Archer, Xavier Martorell, C. Chris Erway, José E. Moreira, B. Steinmacher-Burow, and Yili Zheng. 2005. Optimization of MPI collective communication on BlueGene/L systems. In Proceedings of the 19th Annual International Conference on Supercomputing (ICS’05). ACM, New York, NY, 253--262. DOI:http://dx.doi.org/10.1145/1088149.1088183Google ScholarDigital Library
Carlos Alvarez, Jesus Corbal, and Mateo Valero. 2005. Fuzzy memoization for floating-point multimedia applications. IEEE Transactions on Computers 54, 7, 922--927.Google ScholarDigital Library
Gene M. Amdahl. 1967. Validity of the single processor approach to achieving large scale computing capabilities. In Proceedings of the April 18-20, 1967, Spring Joint Computer Conference. ACM, 483--485.Google ScholarDigital Library
Baik Song An, Manhee Lee, Ki Hwan Yum, and Eun Jung Kim. 2012. Efficient data packet compression for cache coherent multiprocessor systems. In Data Compression Conference (DCC’12). IEEE, 129--138.Google ScholarDigital Library
Mohammad Ashraful Anam, Paul N. Whatmough, and Yiannis Andreopoulos. 2013. Precision-energy-throughput scaling of generic matrix multiplication and discrete convolution kernels via linear projections. In IEEE 11th Symposium on Embedded Systems for Real-time Multimedia (ESTIMedia’13). IEEE, 21--30.Google ScholarCross Ref
Jason Ansel, Cy Chan, Yee Lok Wong, Marek Olszewski, Qin Zhao, Alan Edelman, and Saman Amarasinghe. 2009. PetaBricks: A Language and Compiler for Algorithmic Choice. Vol. 44. ACM.Google Scholar
Jason Ansel, Yee Lok Wong, Cy Chan, Marek Olszewski, Alan Edelman, and Saman Amarasinghe. 2011. Language and compiler support for auto-tuning variable-accuracy algorithms. In Proceedings of the 9th Annual IEEE/ACM International Symposium on Code Generation and Optimization. IEEE Computer Society, 85--96.Google ScholarDigital Library
Woongki Baek and Trishul M. Chilimbi. 2010. Green: A framework for supporting energy-conscious programming using controlled approximation. In ACM SIGPLAN Notices, Vol. 45. ACM, 198--209.Google Scholar
Arnab Banerjee, Pascal T. Wolkotte, Robert D. Mullins, Simon W. Moore, and Gerard J. M. Smit. 2009. An energy and performance exploration of network-on-chip architectures. IEEE Transactions on Very Large Scale Integration (VLSI) Systems 17, 3, 319--329.Google ScholarDigital Library
Carl J. Beckmann and Constantine D. Polychronopoulos. 1990. Fast barrier synchronization hardware. In Proceedings of the 1990 ACM/IEEE Conference on Supercomputing (Supercomputing’90). IEEE Computer Society, Washington, DC, USA, 180--189.Google Scholar
K. Bergman and others. 2008. Exascale computing study: Technology challenges in achieving exascale systems. Defense Advanced Research Projects Agency Information Processing Techniques Office (DARPA IPTO), Tech. Rep 15 (2008).Google Scholar
Tekin Bicer, Jian Yin, Dereck Chiu, Gagan Agrawal, and Karen Schuchardt. 2013. Integrating online compression to accelerate large-scale data analytics applications. In IEEE 27th International Symposium on Parallel 8 Distributed Processing (IPDPS’13). IEEE, 1205--1216.Google ScholarDigital Library
Mark Buckler, Wayne Burleson, and Greg Sadowski. 2013. Low-power networks-on-chip: Progress and remaining challenges. In 2013 IEEE International Symposium on Low Power Electronics and Design (ISLPED’13). IEEE, 132--134.Google ScholarCross Ref
Huy Bui, Hal Finkel, Venkatram Vishwanath, Salman Habib, Katrin Heitmann, Jason Leigh, Michael Papka, and Kevin Harms. 2014. Scalable parallel I/O on a blue gene/Q supercomputer using compression, topology-aware data aggregation, and subfiling. In 22nd Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP’14). IEEE, 107--111.Google ScholarDigital Library
Surendra Byna, Jiayuan Meng, Anand Raghunathan, Srimat Chakradhar, and Srihari Cadambi. 2010. Best-effort semantic document search on GPUs. In Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units. ACM, 86--93.Google ScholarDigital Library
Brad Calder, Glenn Reinman, and Dean M. Tullsen. 1999. Selective value prediction. In Proceedings of the 26th International Symposium on Computer Architecture. IEEE, 64--74.Google Scholar
Ramon Canal, Antonio González, and James E. Smith. 2000. Very low power pipelines using significance compression. In Proceedings of the 33rd Annual ACM/IEEE International Symposium on Microarchitecture. ACM, 181--190.Google Scholar
Vito Cappellini. 1985. Data Compression and Error Control Techniques with Applications. Academic Press, Inc., Cambridge, MA.Google Scholar
Michael Carbin, Sasa Misailovic, and Martin C. Rinard. 2013. Verifying quantitative reliability for programs that execute on unreliable hardware. In ACM SIGPLAN Notices, Vol. 48. ACM, 33--52.Google Scholar
Srimat T. Chakradhar and Anand Raghunathan. 2010. Best-effort computing: Re-thinking parallel software and hardware. In 47th ACM/IEEE Design Automation Conference (DAC’10). IEEE, 865--870.Google Scholar
Jie Chen and W. Watson. 2008. Software barrier performance on dual quad-core opterons. International Conference on Networking, Architecture, and Storage, 2008 (NAS’08). 303--309.Google ScholarDigital Library
Yen-Kuang Chen, Jatin Chhugani, Pradeep Dubey, Christopher J. Hughes, Daehyun Kim, Sanjeev Kumar, Victor W. Lee, Anthony D. Nguyen, and Mikhail Smelyanskiy. 2008. Convergence of recognition, mining, and synthesis workloads and its implications. Proceedings of IEEE 96, 5, 790--807.Google ScholarCross Ref
Vinay K. Chippa, Hrishikesh Jayakumar, Debabrata Mohapatra, Kaushik Roy, and Anand Raghunathan. 2013. Energy-efficient recognition and mining processor using scalable effort design. In IEEE Custom Integrated Circuits Conference (CICC’13). IEEE, 1--4.Google ScholarCross Ref
Vinay Kumar Chippa, Debabrata Mohapatra, Kaushik Roy, Srimat T. Chakradhar, and Anand Raghunathan. 2014. Scalable effort hardware design. IEEE Transactions on Very Large Scale Integration (VLSI) Systems 22, 9, 2004--2016.Google ScholarCross Ref
Vinay K. Chippa, Swagath Venkataramani, Srimat T. Chakradhar, Kaushik Roy, and Anand Raghunathan. 2013. Approximate computing: An integrated hardware approach. In Asilomar Conference on Signals, Systems and Computers. IEEE, 111--117.Google ScholarCross Ref
Vinay K. Chippa, Swagath Venkataramani, Kaushik Roy, and Anand Raghunathan. 2014. StoRM: A stochastic recognition and mining processor. In Proceedings of the 2014 International Symposium on Low Power Electronics and Design. ACM, 39--44.Google ScholarDigital Library
Kyungsang Cho, Yongjun Lee, Young H. Oh, Gyoo-cheol Hwang, and Jae W. Lee. 2014. eDRAM-based tiered-reliability memory with applications to low-power frame buffers. In IEEE/ACM International Symposium on Low Power Electronics and Design (ISLPED’14). IEEE, 333--338.Google Scholar
Marcelo Cintra and Josep Torrellas. 2002. Eliminating squashes through learning cross-thread violations in speculative parallelization for multiprocessors. In Proceedings of the 8th International Symposium on High-Performance Computer Architecture. IEEE, 43--54.Google ScholarDigital Library
R. J. Cintra. 2011. An integer approximation method for discrete sinusoidal transforms. Circuits, Systems, and Signal Processing 30, 6, 1481--1501.Google ScholarDigital Library
Renato J. Cintra and Fábio M. Bayer. 2011. A DCT approximation for image compression. IEEE Signal Processing Letters 18, 10, 579--582.Google ScholarCross Ref
Daniel Citron and Larry Rudolph. 1995. Creating a wider bus using caching techniques. In Proceedings of the 1st IEEE Symposium on High-Performance Computer Architecture. IEEE, 90--99.Google ScholarDigital Library
Paul Coteus, H. Randall Bickford, Thomas M. Cipolla, Paul Crumley, Alan Gara, Shawn Hall, Gerard V. Kopcsay, Alphonso P. Lanzetta, Lawrence S. Mok, Rick A. Rand, Richard A. Swetz, Todd Takken, Paul La Rocca, Christopher Marroquin, Philip R. Germann, and Mark J. Jeanson. 2005. Packaging the blue gene/L supercomputer. IBM Journal of Research and Development 49, 2--3, 213--248.Google ScholarDigital Library
David E. Culler, Jaswinder Pal Singh, and Anoop Gupta. 1999. Parallel Computer Architecture: A Hardware/Software Approach. Gulf Professional Publishing, Houston, TX.Google Scholar
William J. Dally and Brian Towles. 2001. Route packets, not wires: On-chip interconnection networks. In Proceedings of the Design Automation Conference. IEEE, 684--689.Google Scholar
Reetuparna Das, Asit K. Mishra, Chrysostomos Nicopoulos, Dongkook Park, Vijaykrishnan Narayanan, Ravishankar Iyer, Mazin S. Yousif, and Chita R. Das. 2008. Performance and power optimization through data compression in network-on-chip architectures. In IEEE 14th International Symposium on High Performance Computer Architecture (HPCA’08). IEEE, 215--225.Google Scholar
Marc De Kruijf, Shuou Nomura, and Karthikeyan Sankaralingam. 2010. Relax: An architectural framework for software recovery of hardware faults. In ACM SIGARCH Computer Architecture News, Vol. 38. ACM, 497--508.Google ScholarDigital Library
Li Deng and Douglas O’Shaughnessy. 2003. Speech Processing: A Dynamic and Optimization-Oriented Approach. CRC Press, Boca Raton, FL.Google ScholarCross Ref
J. Dongarra, P. Luszczek, and A. Petitet. 2003. The LINPACK benchmark: Past, present, and future. Concurrency and Computation: Practice and Experience 15, 9, 803--820.Google ScholarCross Ref
Zidong Du, Avinash Lingamneni, Yunji Chen, Krishna Palem, Olivier Temam, and Chengyong Wu. 2014. Leveraging the error resilience of machine-learning applications for designing highly energy efficient accelerators. In 19th Asia and South Pacific Design Automation Conference (ASP-DAC’14). IEEE, 201--206.Google Scholar
Peter Düben, Jeremy Schlachter, Sreelatha Yenugula, John Augustine, Christian Enz, K. Palem, T. N. Palmer, and others. 2015. Opportunities for energy efficient computing: A study of inexact general purpose processors for high-performance and big-data applications. In Proceedings of the 2015 Design, Automation 8 Test in Europe Conference 8 Exhibition. EDA Consortium, 764--769.Google ScholarCross Ref
Pradeep Dubey. 2005. Recognition, mining and synthesis moves computers to the era of Tera. Technology@ Intel Magazine 9, 2, 1--10.Google Scholar
Peter Elias. 1955. Predictive coding--I. IRE Transactions on Information Theory 1, 1, 16--24.Google ScholarCross Ref
Hadi Esmaeilzadeh, Emily Blem, Renee St. Amant, Karthikeyan Sankaralingam, and Doug Burger. 2011. Dark silicon and the end of multicore scaling. In International Symposium on Computer Architecture.Google ScholarDigital Library
Hadi Esmaeilzadeh, Adrian Sampson, Luis Ceze, and Doug Burger. 2012. Architecture support for disciplined approximate programming. In Proceedings of the 17th International Conference on Architectural Support for Programming Languages and Operating Systems.Google ScholarDigital Library
Hadi Esmaeilzadeh, Adrian Sampson, Luis Ceze, and Doug Burger. 2012. Neural acceleration for general-purpose approximate programs. In Proceedings of the 45th Annual IEEE/ACM International Symposium on Microarchitecture. IEEE Computer Society, 449--460.Google ScholarDigital Library
Marius Evers, Po-Yung Chang, and Yale N. Patt. 1996. Using hybrid branch predictors to improve branch prediction accuracy in the presence of context switches. In ACM SIGARCH Computer Architecture News, Vol. 24. ACM, 3--11.Google Scholar
Yuntan Fang, Huawei Li, and Xiaowei Li. 2012. SoftPCM: Enhancing energy efficiency and lifetime of phase change memory in video applications via approximate write. In IEEE 21st Asian Test Symposium (ATS’12). IEEE, 131--136.Google ScholarDigital Library
Eric Freudenthal and Olivier Peze. 1988. Efficient Synchronization Algorithms Using Fetch-and-Add on Multiple Bitfield Integers. Ultracomputer Note 148.Google Scholar
Shrikanth Ganapathy, Georgios Karakonstantis, Adam Teman, and Andreas Burg. 2015. Mitigating the impact of faults in unreliable memories for error-resilient applications. In Proceedings of the 52nd Annual Design Automation Conference. ACM, 102.Google ScholarDigital Library
Bart Goeman, Hans Vandierendonck, and Koen De Bosschere. 2001. Differential FCM: Increasing value prediction accuracy by improving table usage efficiency. In 7th International Symposium on High-Performance Computer Architecture (HPCA’01). IEEE, 207--216.Google ScholarCross Ref
Inigo Goiri, Ricardo Bianchini, Santosh Nagarakatte, and Thu D. Nguyen. 2015. Approxhadoop: Bringing approximations to mapreduce frameworks. In ACM SIGARCH Computer Architecture News, Vol. 43. ACM, 383--397.Google Scholar
Jill R. Goldschneider. 1997. Lossy Compression of Scientific Data Via Wavelets and Vector Quantization. Ph.D. thesis, University of Washington, Seattle, WA. https://digital.lib.washington.edu/researchworks/handle/1773/5881?show=full.Google Scholar
Beayna Grigorian and Glenn Reinman. 2015. Accelerating divergent applications on SIMD architectures using neural networks. ACM Transactions on Architecture and Code Optimization 12, 1, 2.Google ScholarDigital Library
Vaibhav Gupta, Debabrata Mohapatra, Sang Phill Park, Anand Raghunathan, and Kaushik Roy. 2011. IMPACT: Imprecise adders for low-power approximate computing. In Proceedings of the 17th IEEE/ACM International Symposium on Low-power Electronics and Design. IEEE Press, 409--414.Google ScholarDigital Library
Erik G. Hallnor and Steven K. Reinhardt. 2004. A compressed memory hierarchy using an indirect index cache. In Proceedings of the 3rd Workshop on Memory Performance Issues: In Conjunction with the 31st International Symposium on Computer Architecture. ACM, 9--15.Google Scholar
Maurice Herlihy, J. Eliot, and B. Moss. 1993. Transactional Memory: Architectural Support for Lock-Free Data Structures. Vol. 21. ACM.Google ScholarDigital Library
T. Hoefler, T. Mehlan, F. Mietke, and W. Rehm. 2004. A survey of barrier algorithms for coarse grained supercomputers. Chemnitzer Informatik Berichte 4, 3 (2004).Google Scholar
Henry Hoffmann, Sasa Misailovic, Stelios Sidiroglou, Anant Agarwal, and Martin Rinard. 2009. Using code perforation to improve performance, reduce energy consumption, and respond to failures. Technical Report MIT-CSAIL-TR-2209-037, EECS, MIT.Google Scholar
Henry Hoffmann, Stelios Sidiroglou, Michael Carbin, Sasa Misailovic, Anant Agarwal, and Martin Rinard. 2011. Dynamic knobs for responsive power-aware computing. In ACM SIGPLAN Notices, Vol. 46. ACM, 199--212.Google ScholarDigital Library
Chih-Chieh Hsiao, Slo-Li Chu, and Chen-Yu Chen. 2013. Energy-aware hybrid precision selection framework for mobile GPUs. Computers 8 Graphics 37, 5, 431--444.Google Scholar
Jiawei Huang, John Lach, and Gabriel Robins. 2012. A methodology for energy-quality tradeoff using imprecise hardware. In Proceedings of the 49th Annual Design Automation Conference. ACM, 504--509.Google ScholarDigital Library
Jeremy Iverson, Chandrika Kamath, and George Karypis. 2012. Fast and effective lossy compression algorithms for scientific datasets. In Euro-Par 2012 Parallel Processing. Springer, 843--856.Google Scholar
Yuho Jin, Ki Hwan Yum, and Eun Jung Kim. 2008. Adaptive data compression for high-performance low-power on-chip networks. In Proceedings of the 41st Annual IEEE/ACM International Symposium on Microarchitecture. IEEE Computer Society, 354--363.Google Scholar
Andrew B. Kahng and Seokhyeong Kang. 2012. Accuracy-configurable adder for approximate arithmetic designs. In Proceedings of the 49th Annual Design Automation Conference. ACM, 820--825.Google Scholar
Georgios Karakonstantis, Debabrata Mohapatra, and Kaushik Roy. 2012. Logic and memory design based on unequal error protection for voltage-scalable, robust and adaptive DSP systems. Journal of Signal Processing Systems 68, 3, 415--431.Google ScholarDigital Library
Georgios Keramidas, Chrysa Kokkala, and Iakovos Stamoulis. 2015. Clumsy value cache: An approximate memoization technique for mobile GPU fragment shaders. In Workshop on Approximate Computing (WAPCO’15).Google Scholar
Daya Shanker Khudia and Scott Mahlke. 2014. Harnessing soft computations for low-budget fault tolerance. In 47th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’14). IEEE, 319--330.Google ScholarDigital Library
Hyungjun Kim, Pritha Ghoshal, Boris Grot, Paul V. Gratz, and Daniel A. Jiménez. 2011. Reducing network-on-chip energy consumption through spatial locality speculation. In Proceedings of the 5th ACM/IEEE International Symposium on Networks-on-Chip. ACM, 233--240.Google Scholar
Chandra Krintz and Sezgin Sucu. 2006. Adaptive on-the-fly compression. IEEE Transactions on Parallel and Distributed Systems 17, 1, 15--24.Google ScholarDigital Library
Parag Kulkarni, Puneet Gupta, and Milos Ercegovac. 2011. Trading accuracy for power with an underdesigned multiplier architecture. In 24th International Conference on VLSI Design (VLSI Design’11). IEEE, 346--351.Google ScholarDigital Library
Didier Le Gall. 1991. MPEG: A video compression standard for multimedia applications. Communications of the ACM 34, 4, 46--58.Google ScholarDigital Library
Jae Bum Lee and Chu Shik Jhon. 1998. Reducing coherence overhead of barrier synchronization in software DSMs. In Supercomputing’98: Proceedings of the 1998 ACM/IEEE Conference on Supercomputing (CDROM). IEEE Computer Society, Washington, DC, USA, 1--18.Google Scholar
Jang-Soo Lee, Won-Kee Hong, and Shin-Dug Kim. 1999. Design and evaluation of a selective compressed memory system. In International Conference on Computer Design (ICCD’99). IEEE, 184--191.Google Scholar
Kangmin Lee, Se-Joong Lee, and Hoi-Jun Yoo. 2006. Low-power network-on-chip for high-performance SoC design. IEEE Transactions on Very Large Scale Integration (VLSI) Systems 14, 2, 148--160.Google ScholarDigital Library
Moon-Sang Lee, Young-Jae Kang, Joon-Won Lee, and Seung-Ryoul Maeng. 2002. OPTS: Increasing branch prediction accuracy under context switch. Microprocessors and Microsystems 26, 6, 291--300.Google ScholarCross Ref
Sungju Lee, Heegon Kim, Yongwha Chung, and Daihee Park. 2012. Energy efficient image/video data transmission on commercial multi-core processors. Sensors 12, 11, 14647--14670.Google ScholarCross Ref
Sangpil Lee, Keunsoo Kim, Gunjae Koo, Hyeran Jeon, Won Woo Ro, and Murali Annavaram. 2015. Warped-compression: Enabling power efficient GPUs through register compression. In Proceedings of the 42nd Annual International Symposium on Computer Architecture. ACM, 502--514.Google ScholarDigital Library
Larkhoon Leem, Hyungmin Cho, Jason Bau, Quinn A. Jacobson, and Subhasish Mitra. 2010. ERSA: Error resilient system architecture for probabilistic applications. In Design, Automation 8 Test in Europe Conference 8 Exhibition (DATE’10). IEEE, 1560--1565.Google Scholar
Debra A. Lelewer and Daniel S. Hirschberg. 1987. Data compression. ACM Computing Surveys 19, 3, 261--296.Google ScholarDigital Library
Krisda Lengwehasatit and Antonio Ortega. 2004. Scalable variable complexity approximate forward DCT. IEEE Transactions on Circuits and Systems for Video Technology 14, 11, 1236--1248.Google ScholarDigital Library
Mikko H. Lipasti and John Paul Shen. 1996. Exceeding the dataflow limit via value prediction. In Proceedings of the 29th Annual ACM/IEEE International Symposium on Microarchitecture. IEEE Computer Society, 226--237.Google Scholar
Mikko H. Lipasti, Christopher B. Wilkerson, and John Paul Shen. 1996. Value locality and load value prediction. ACM SIGOPS Operating Systems Review 30, 5, 138--147.Google ScholarDigital Library
Shaoshan Liu, Christine Eisenbeis, and Jean-Luc Gaudiot. 2010. A theoretical framework for value prediction in parallel systems. In 39th International Conference on Parallel Processing (ICPP’10). IEEE, 11--20.Google ScholarDigital Library
Song Liu, Karthik Pattabiraman, Thomas Moscibroda, and Benjamin G. Zorn. 2009. Flicker: Saving refresh-power in mobile devices through critical data partitioning. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’09).Google Scholar
Gabriel H. Loh, Nuwan Jayasena, M. Oskin, Mark Nutter, David Roberts, Mitesh Meswani, Dong Ping Zhang, and Mike Ignatowski. 2013. A processing in memory taxonomy and a case for studying fixed-function pim. In Workshop on Near-Data Processing (WoNDP’13).Google Scholar
Enrico Magli and Gabriella Olmo. 2003. Lossy predictive coding of SAR raw data. IEEE Transactions on Geoscience and Remote Sensing 41, 5, 977--987.Google ScholarCross Ref
Milo M. K. Martin, Daniel J. Sorin, Harold W. Cain, Mark D. Hill, and Mikko H. Lipasti. 2001. Correctly implementing value prediction in microprocessors that support multithreading or multiprocessing. In Proceedings of the 34th Annual ACM/IEEE International Symposium on Microarchitecture. IEEE Computer Society, 328--337.Google Scholar
John M. Mellor-Crummey and Michael L. Scott. 1991. Algorithms for scalable synchronization on shared-memory multiprocessors. ACM Transactions on Computer Systems 9, 1, 21--65. DOI:http://dx.doi.org/10.1145/103727.103729Google ScholarDigital Library
Jiayuan Meng, Srimat Chakradhar, and Anand Raghunathan. 2009. Best-effort parallel execution framework for recognition and mining applications. In IEEE International Symposium on Parallel 8 Distributed Processing (IPDPS’09). IEEE, 1--12.Google Scholar
Jiayuan Mengte, Anand Raghunathan, Srimat Chakradhar, and Surendra Byna. 2010. Exploiting the forgiving nature of applications for scalable parallel execution. IEEE International Symposium on Parallel 8 Distributed Processing (IPDPS). IEEE.Google ScholarCross Ref
Joshua San Miguel, Mario Badr, and Natalie Enright Jerger. 2014. Load value approximation. In Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture. IEEE Computer Society, 127--139.Google ScholarDigital Library
Sasa Misailovic, Michael Carbin, Sara Achour, Zichao Qi, and Martin C. Rinard. 2014. Chisel: Reliability-and accuracy-aware optimization of approximate computational kernels. In ACM SIGPLAN Notices, Vol. 49. ACM, 309--328.Google Scholar
Sasa Misailovic, Deokhwan Kim, and Martin Rinard. 2013. Parallelizing sequential programs with statistical accuracy tests. ACM Transactions on Embedded Computing Systems 12, 2s, 88.Google ScholarDigital Library
Sasa Misailovic, Stelios Sidiroglou, Henry Hoffmann, and Martin Rinard. 2010. Quality of service profiling. In Proceedings of the 32nd ACM/IEEE International Conference on Software Engineering-Volume 1. ACM, 25--34.Google ScholarDigital Library
Sasa Misailovic, Stelios Sidiroglou, and Martin C. Rinard. 2012. Dancing with uncertainty. In Proceedings of the 2012 ACM Workshop on Relaxing Synchronization for Multicore and Manycore Scalability. ACM, 51--60.Google Scholar
Sparsh Mittal. 2016. A survey of techniques for approximate computing. ACM Computing Surveys 48, 4, 62.Google ScholarDigital Library
Debabrata Mohapatra, Vinay K. Chippa, Anand Raghunathan, and Kaushik Roy. 2011. Design of voltage-scalable meta-functions for approximate computing. In Design, Automation 8 Test in Europe Conference 8 Exhibition (DATE’11). IEEE, 1--6.Google Scholar
Debabrata Mohapatra, Georgios Karakonstantis, and Kaushik Roy. 2009. Significance driven computation: A voltage-scalable, variation-aware, quality-tuning motion estimator. In Proceedings of the 2009 ACM/IEEE International Symposium on Low Power Electronics and Design. ACM, 195--200.Google ScholarDigital Library
Thierry Moreau, Mark Wyse, Jacob Nelson, Adrian Sampson, Hadi Esmaeilzadeh, Luis Ceze, and Mark Oskin. 2015. SNNAP: Approximate computing on programmable socs via neural acceleration. In IEEE 21st International Symposium on High Performance Computer Architecture (HPCA’15). IEEE, 603--614.Google ScholarCross Ref
Michel Mouly, Marie-Bernadette Pautet, and Thomas Foreword By-Haug. 1992. The GSM System for Mobile Communications. Telecom Publishing.Google Scholar
Tarun Nakra, Rajiv Gupta, and Mary Lou Soffa. 1999. Global context-based value prediction. In Proceedings of the 5th International Symposium on High-Performance Computer Architecture. IEEE, 4--12.Google ScholarCross Ref
Sriram Narayanan, John Sartori, Rakesh Kumar, and Douglas L. Jones. 2010. Scalable stochastic processors. In Proceedings of the Conference on Design, Automation and Test in Europe. European Design and Automation Association, 335--338.Google Scholar
D. Nikolopoulos and T. Papatheodorou. 2000. Fast synchronization on scalable cache-coherent multiprocessors using hybrid primitives. In Proceedings of the 14th International Symposium on Parallel and Distributed Processing (IPDPS’00). IEEE Computer Society, Washington, DC, USA, 711.Google Scholar
Peter Noll. 1997. MPEG digital audio coding. IEEE Signal Processing Magazine 14, 5, 59--81.Google ScholarCross Ref
NVIDIA. 2014. NVIDIA GTX 980 Whitepaper. Retrieved November 29, 2017 from https://international.download.nvidia.com/geforce-com/international/pdfs/GeForce_GTX_980_Whitepaper_FINAL.PDF.Google Scholar
Simon Ogg and Bashir Al-Hashimi. 2006. Improved data compression for serial interconnected network on chip through unused significant bit removal. In 19th International Conference on VLSI Design. Held jointly with 5th International Conference on Embedded Systems and Design. IEEE, 5 pp.Google ScholarDigital Library
Soontorn Oraintara, Ying-Jui Chen, and Truong Q. Nguyen. 2002. Integer fast Fourier transform. IEEE Transactions on Signal Processing 50, 3, 607--618.Google ScholarDigital Library
David J. Palframan, Nam Sung Kim, and Mikko H. Lipasti. 2014. Precision-aware soft error protection for GPUs. In IEEE 20th International Symposium on High Performance Computer Architecture (HPCA’14). IEEE, 49--59.Google Scholar
J. Thomas Pawlowski. 2011. Hybrid memory cube (HMC). In Hot Chips, Vol. 23.Google ScholarCross Ref
Gennady Pekhimenko, Evgeny Bolotin, Mike O’Connor, Onur Mutlu, Todd C. Mowry, and Steve Keckler. 2015. Toggle-aware compression for GPUs. In IEEE Computer Architecture Letters. 14, 2 (2015), 164--168. DOI:10.1109/LCA.2015.2430853Google ScholarDigital Library
Gennady Pekhimenko, Evgeny Bolotin, Nandita Vijaykumar, Onur Mutlu, Todd C. Mowry, and Stephen W. Keckler. 2016. A case for toggle-aware compression for GPU systems. In IEEE International Symposium on High Performance Computer Architecture (HPCA’16). IEEE, 188--200.Google Scholar
Arthur Perais and André Seznec. 2014. EOLE: Paving the way for an effective implementation of value prediction. In ACM SIGARCH Computer Architecture News, Vol. 42. IEEE Press, 481--492.Google ScholarDigital Library
Arthur Perais and André Seznec. 2014. Practical data value speculation for future high-end processors. In IEEE 20th International Symposium on High Performance Computer Architecture (HPCA’14). IEEE, 428--439.Google ScholarCross Ref
Arthur Perais and André Seznec. 2015. BeBoP: A cost effective predictor infrastructure for superscalar value prediction. In IEEE 21st International Symposium on High Performance Computer Architecture (HPCA’15). IEEE, 13--25.Google ScholarCross Ref
Calton Pu and Lenin Singaravelu. 2005. Fine-grain adaptive compression in dynamically variable networks. In Proceedings of the 25th IEEE International Conference on Distributed Computing Systems (ICDCS’05). IEEE, 685--694.Google ScholarDigital Library
Abbas Rahimi, Amirali Ghofrani, Kwang-Ting Cheng, Luca Benini, and Rajesh K. Gupta. 2015. Approximate associative memristive memory for energy-efficient GPUs. In Proceedings of the 2015 Design, Automation 8 Test in Europe Conference 8 Exhibition. EDA Consortium, 1497--1502.Google Scholar
Vara Ramakrishnan and Isaac D. Scherson. 1999. Efficient techniques for nested and disjoint barrier synchronization. Journal of Parallel and Distributed Computing 58, 2, 333--356. DOI:http://dx.doi.org/10.1006/jpdc.1999.1556Google ScholarDigital Library
Easwaran Raman, Ram Rangan, David I. August, and others. 2008. Spice: Speculative parallel iteration chunk execution. In Proceedings of the 6th Annual IEEE/ACM International Symposium on Code Generation and Optimization. ACM, 175--184.Google ScholarDigital Library
Ashish Ranjan, Swagath Venkataramani, Xuanyao Fong, Kaushik Roy, and Anand Raghunathan. 2015. Approximate storage for energy efficient spintronic memories. In Proceedings of the 52nd Annual Design Automation Conference. ACM, 195.Google ScholarDigital Library
Lakshminarayanan Renganarayana, Vijayalakshmi Srinivasan, Ravi Nair, and Daniel Prener. 2012. Programming with relaxed synchronization. In Proceedings of the 2012 ACM Workshop on Relaxing Synchronization for Multicore and Manycore Scalability. ACM, 41--50.Google ScholarDigital Library
Martin Rinard. 2013. Parallel synchronization-free approximate data structure construction. In HotPar.Google Scholar
Martin C. Rinard. 2012. Unsynchronized techniques for approximate parallel computing. In RACES Workshop.Google Scholar
Antonio Roldao-Lopes, Amir Shahzad, George A. Constantinides, and Eric C. Kerrigan. 2009. More flops or more precision? Accuracy parameterizable linear equation solvers for model predictive control. In 17th IEEE Symposium on Field Programmable Custom Computing Machines (FCCM’09). IEEE, 209--216.Google Scholar
Cindy Rubio-González, Cuong Nguyen, Hong Diep Nguyen, James Demmel, William Kahan, Koushik Sen, David H. Bailey, Costin Iancu, and David Hough. 2013. Precimonious: Tuning assistant for floating-point precision. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis. ACM, 27.Google ScholarDigital Library
David Salomon. 2004. Data Compression: The Complete Reference. Springer Science 8 Business Media, New York, NY.Google ScholarDigital Library
Mehrzad Samadi, Davoud Anoushe Jamshidi, Janghaeng Lee, and Scott Mahlke. 2014. Paraprox: Pattern-based approximation for data parallel applications. In ACM SIGARCH Computer Architecture News, Vol. 42. ACM, 35--50.Google Scholar
Mehrzad Samadi, Janghaeng Lee, D. Anoushe Jamshidi, Amir Hormati, and Scott Mahlke. 2013. Sage: Self-tuning approximation for graphics engines. In Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture. ACM, 13--24.Google ScholarDigital Library
Adrian Sampson, Werner Dietl, Emily Fortuna, Danushen Gnanapragasam, Luis Ceze, and Dan Grossman. 2011. EnerJ: Approximate data types for safe and general low-power computation. In ACM SIGPLAN Notices, Vol. 46. ACM, 164--174.Google ScholarDigital Library
Adrian Sampson, Jacob Nelson, Karin Strauss, and Luis Ceze. 2014. Approximate storage in solid-state memories. ACM Transactions on Computer Systems 32, 3, 9.Google ScholarDigital Library
Jack Sampson, Ruben Gonzalez, Jean-Francois Collard, Norman P. Jouppi, Mike Schlansker, and Brad Calder. 2006. Exploiting fine-grained data parallelism with chip multiprocessors and fast barriers. In MICRO 39: Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture. IEEE Computer Society, Washington, DC, USA, 235--246. DOI:http://dx.doi.org/10.1109/MICRO.2006.23Google ScholarDigital Library
Joshua San Miguel and N. Enright Jerger. 2014. Load value approximation: Approaching the ideal memory access latency. In Workshop on Approximate Computing Across the System Stack.Google Scholar
J. Sartori and R. Kumar. 2010. Low-overhead, high-speed multi-core barrier synchronization. In HiPEAC. 18--34.Google Scholar
John Sartori and Rakesh Kumar. 2013. Branch and data herding: Reducing control and memory divergence for error-tolerant GPU applications. IEEE Transactions on Multimedia 15, 2, 279--290.Google ScholarDigital Library
Vijay Sathish, Michael J. Schulte, and Nam Sung Kim. 2012. Lossless and lossy memory I/O link compression for improving performance of GPGPU workloads. In Proceedings of the 21st International Conference on Parallel Architectures and Compilation Techniques. ACM, 325--334.Google ScholarDigital Library
Yiannakis Sazeides and James E. Smith. 1997. The predictability of data values. In Proceedings of the 30th Annual IEEE/ACM International Symposium on Microarchitecture. IEEE, 248--258.Google Scholar
Steven L. Scott. 1996. Synchronization and communication in the T3E multiprocessor. SIGOPS Operating Systems Review 30, 5, 26--36. DOI:http://dx.doi.org/10.1145/248208.237144Google ScholarDigital Library
Marko Scrbak, Mahzabeen Islam, Krishna M. Kavi, Mike Ignatowski, and Nuwan Jayasena. 2015. Processing-in-memory: Exploring the design space. In Architecture of Computing Systems (ARCS’15). Springer, 43--54.Google Scholar
André Seznec. 2011. A new case for the TAGE branch predictor. In Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture. ACM, 117--127.Google ScholarDigital Library
Ali Shafiee, Meysam Taassori, Rajeev Balasubramonian, and A. K. Davis. 2014. MemZip: Exploring unconventional benefits from memory compression. In IEEE 20th International Symposium on High Performance Computer Architecture (HPCA’14). IEEE, 638--649.Google Scholar
Li Shang, Li-Shiuan Peh, and Niraj K. Jha. 2003. Dynamic voltage scaling with links for power optimization of interconnection networks. In Proceedings of the 9th International Symposium on High-Performance Computer Architecture (HPCA’03). IEEE, 91--102.Google Scholar
Shisheng Shang and Kai Hwang. 1995. Distributed hardwired barrier synchronization for scalable multiprocessor clusters. IEEE Transactions on Parallel Distributed Systems 6, 6, 591--605. DOI:http://dx.doi.org/10.1109/71.388040Google ScholarDigital Library
Majid Shoushtari, Abbas BanaiyanMofrad, and Nikil Dutt. 2015. Exploiting partially-forgetful memories for approximate computing. IEEE Embedded Systems Letters 7, 1, 19--22.Google ScholarDigital Library
Stelios Sidiroglou-Douskos, Sasa Misailovic, Henry Hoffmann, and Martin Rinard. 2011. Managing performance vs. accuracy trade-offs with loop perforation. In Proceedings of the 19th ACM SIGSOFT Symposium and the 13th European Conference on Foundations of Software Engineering. ACM, 124--134.Google ScholarDigital Library
María Soler and José Flich. 2013. Power saving by NoC traffic compression. In European Conference on Parallel Processing. Springer, 465--476.Google Scholar
Renée St. Amant, Amir Yazdanbakhsh, Jongse Park, Bradley Thwaites, Hadi Esmaeilzadeh, Arjang Hassibi, Luis Ceze, and Doug Burger. 2014. General-purpose code acceleration with limited-precision analog computation. ACM SIGARCH Computer Architecture News 42, 3, 505--516.Google ScholarDigital Library
J. Gregory Steffan, Christopher B. Colohan, Antonia Zhai, and Todd C. Mowry. 2002. Improving value communication for thread-level speculation. In Proceedings of the 8th International Symposium on High-Performance Computer Architecture. IEEE, 65--75.Google Scholar
Ayswarya Sundaram, Ameen Aakel, Derek Lockhart, Darshan Thaker, and Diana Franklin. 2008. Efficient fault tolerance in multi-media applications through selective instruction replication. In Proceedings of the 2008 Workshop on Radiation Effects and Fault Tolerance in Nanometer Technologies. ACM, 339--346.Google ScholarDigital Library
Mark Sutherland, Joshua San Miguel, and Natalie Enright Jerger. 2015. Texture cache approximation on GPUs. In Workshop on Approximate Computing Across the Stack.Google Scholar
M. B. Taylor. 2012. Is dark silicon useful? Harnessing the four horsemen of the coming dark silicon apocalypse. In Design Automation Conference.Google Scholar
Bradley Thwaites, Gennady Pekhimenko, Hadi Esmaeilzadeh, Amir Yazdanbakhsh, Onur Mutlu, Jongse Park, Girish Mururu, and Todd Mowry. 2014. Rollback-free value prediction with approximate loads. In Proceedings of the 23rd International Conference on Parallel Architectures and Compilation. ACM, 493--494.Google ScholarDigital Library
Ye Tian, Qian Zhang, Ting Wang, Feng Yuan, and Qiang Xu. 2015. Approxma: Approximate memory access for dynamic precision scaling. In Proceedings of the 25th Edition on Great Lakes Symposium on VLSI. ACM, 337--342.Google ScholarDigital Library
Vassilis Vassiliadis, Konstantinos Parasyris, Charalambos Chalios, Christos D. Antonopoulos, Spyros Lalis, Nikolaos Bellas, Hans Vandierendonck, and Dimitrios S. Nikolopoulos. 2015. A programming model and runtime system for significance-aware energy-efficient computing. In ACM SIGPLAN Notices, Vol. 50. ACM, 275--276.Google Scholar
Swagath Venkataramani, Srimat T. Chakradhar, Kaushik Roy, and Anand Raghunathan. 2015. Approximate computing and the quest for computing efficiency. In Proceedings of the 52nd Annual Design Automation Conference. ACM, 120.Google ScholarDigital Library
Swagath Venkataramani, Vinay K. Chippa, Srimat T. Chakradhar, Kaushik Roy, and Anand Raghunathan. 2013. Quality programmable vector processors for approximate computing. In Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture. ACM, 1--12.Google ScholarDigital Library
Swagath Venkataramani, Ashish Ranjan, Kaushik Roy, and Anand Raghunathan. 2014. AxNN: Energy-efficient neuromorphic systems using approximate computing. In Proceedings of the 2014 International Symposium on Low Power Electronics and Design. ACM, 27--32.Google ScholarDigital Library
Swagath Venkataramani, Kaushik Roy, and Anand Raghunathan. 2013. Substitute-and-simplify: A unified design paradigm for approximate and quality configurable circuits. In Proceedings of the Conference on Design, Automation and Test in Europe. EDA Consortium, 1367--1372.Google ScholarCross Ref
Nandita Vijaykumar, Gennady Pekhimenko, Adwait Jog, Abhishek Bhowmick, Rachata Ausavarungnirun, Chita Das, Mahmut Kandemir, Todd C. Mowry, and Onur Mutlu. 2015. A case for core-assisted bottleneck acceleration in GPUs: enabling flexible data compression with assist warps. In ACM SIGARCH Computer Architecture News, Vol. 43. ACM, 41--53.Google ScholarDigital Library
Oreste Villa, Gianluca Palermo, and Cristina Silvano. 2008. Efficiency and scalability of barrier synchronization on NoC based many-core architectures. In CASES’08: Proceedings of the 2008 International Conference on Compilers, Architectures and Synthesis for Embedded Systems. ACM, New York, NY, USA, 81--90. DOI:http://dx.doi.org/10.1145/1450095.1450110Google ScholarDigital Library
Gregory K. Wallace. 1992. The JPEG still picture compression standard. IEEE Transactions on Consumer Electronics 38, 1, xviii--xxxiv.Google ScholarDigital Library
Kai Wang and Manoj Franklin. 1997. Highly accurate data value prediction using hybrid predictors. In Proceedings of the 30th Annual ACM/IEEE International Symposium on Microarchitecture. IEEE Computer Society, 281--290.Google ScholarCross Ref
Zhou Wang, Eero P. Simoncelli, and Alan C. Bovik. 2003. Multiscale structural similarity for image quality assessment. In Conference Record of the 37th Asilomar Conference on Signals, Systems and Computers, Vol. 2. IEEE, 1398--1402.Google Scholar
Terry A. Welch. 1984. A technique for high-performance data compression. Computer 6, 17, 8--19.Google ScholarDigital Library
Benjamin Welton, Dries Kimpe, Jason Cope, Christina M. Patrick, Kamil Iskra, and Robert Ross. 2011. Improving I/O forwarding throughput with data compression. In IEEE International Conference on Cluster Computing (CLUSTER’11). IEEE, 438--445.Google ScholarDigital Library
Yair Wiseman, Karsten Schwan, and Patrick Widener. 2005. Efficient end to end data exchange using configurable compression. ACM SIGOPS Operating Systems Review 39, 3, 4--23.Google ScholarDigital Library
Qiang Xu, Todd Mytkowicz, and Nam Sung Kim. 2016. Approximate computing: A survey. IEEE Design 8 Test 33, 1, 8--22.Google Scholar
Xin Xu and H. Howie Huang. 2015. Exploring data-level error tolerance in high-performance solid-state drives. IEEE Transactions on Reliability 64, 1, 15--30.Google ScholarCross Ref
Amir Yazdanbakhsh, Gennady Pekhimenko, Bradley Thwaites, Hadi Esmaeilzadeh, Onur Mutlu, and Todd C. Mowry. 2016. RFVP: Rollback-free value prediction with safe-to-approximate loads. ACM Transactions on Architecture and Code Optimization 12, 4, 62.Google ScholarDigital Library
Rong Ye, Ting Wang, Feng Yuan, Rakesh Kumar, and Qiang Xu. 2013. On reconfiguration-oriented approximate adder design and its application. In Proceedings of the International Conference on Computer-Aided Design. IEEE Press, 48--54.Google ScholarCross Ref
Yavuz Yetim, Sharad Malik, and Margaret Martonosi. 2015. CommGuard: Mitigating communication errors in error-prone parallel execution. In ACM SIGPLAN Notices, Vol. 50. ACM, 311--323.Google ScholarDigital Library
Yavuz Yetim, Margaret Martonosi, and Sharad Malik. 2013. Extracting useful computation from error-prone processors for streaming applications. In Design, Automation 8 Test in Europe Conference 8 Exhibition (DATE’13). IEEE, 202--207.Google Scholar
Felix Zahn, Steffen Lammel, and Holger Fröning. 2017. Early experiences with saving energy in direct interconnection networks. In IEEE 3rd International Workshop on High-Performance Interconnection Networks in the Exascale and Big-Data Era (HiPINEB’17). IEEE, 33--40.Google ScholarCross Ref
Felix Zahn, Pedro Yebenes, Steffen Lammel, Pedro J. Garcia, and Holger Fröning. 2016. Analyzing the energy (dis-)proportionality of scalable interconnection networks. In 2nd IEEE International Workshop on High-Performance Interconnection Networks in the Exascale and Big-Data Era (HiPINEB’16). IEEE, 25--32.Google ScholarCross Ref
Hang Zhang, Mateja Putic, and John Lach. 2014. Low power gpgpu computation with imprecise hardware. In Proceedings of the 51st Annual Design Automation Conference. ACM, 1--6.Google ScholarDigital Library
Qian Zhang, Ting Wang, Ye Tian, Feng Yuan, and Qiang Xu. 2015. ApproxANN: An approximate computing framework for artificial neural network. In Proceedings of the 2015 Design, Automation 8 Test in Europe Conference 8 Exhibition. EDA Consortium, 701--706.Google ScholarDigital Library
Huiyang Zhou and Thomas M. Conte. 2005. Enhancing memory-level parallelism via recovery-free value prediction. IEEE Transactions on Computers 54, 7, 897--912.Google ScholarDigital Library
Qiuling Zhu, Bilal Akin, H. Ekin Sumbul, Fazle Sadi, James C. Hoe, Larry Pileggi, and Franz Franchetti. 2013. A 3D-stacked logic-in-memory accelerator for application-specific data intensive computing. In IEEE International 3D Systems Integration Conference (3DIC’13). IEEE, 1--7.Google Scholar
Weirong Zhu, Vugranam C. Sreedhar, Ziang Hu, and Guang R. Gao. 2007. Synchronization state buffer: Supporting efficient fine-grain synchronization on many-core architectures. In Proceedings of the 34th Annual International Symposium on Computer Architecture (ISCA’07). ACM, New York, NY, USA, 35--45. DOI:http://dx.doi.org/10.1145/1250662.1250668Google Scholar

Index Terms

Approximate Communication: Techniques for Reducing Communication Bottlenecks in Large-Scale Parallel Systems
1. Computer systems organization
  1. Architectures
    1. Parallel architectures
2. Computing methodologies
  1. Parallel computing methodologies
    1. Parallel algorithms
      1. Massively parallel algorithms

Recommendations

An online quality management framework for approximate communication in network-on-chips
ICS '19: Proceedings of the ACM International Conference on Supercomputing

Approximate communication is being seriously considered as an effective technique for reducing power consumption and improving the communication efficiency of network-on-chips (NoCs). A major problem faced by these techniques is quality control: how do ...
Read More
Energy efficient 3D network-on-chip based on approximate communication
Abstract
Technology advancement and integration of many cores into a chip lead to high-performance parallel architectures in computing systems. Three-dimensional Network-on-Chips (3D NoCs) have been adopted as a promising architecture in the ...
Read More
Distributed Traffic Simulation and the Reduction of Inter-process Communication Using Traffic Flow Characteristics Transfer
UKSIM '08: Proceedings of the Tenth International Conference on Computer Modeling and Simulation

In this paper, an adaptation of the Java Urban Traffic Simulator (JUTS) for distributed computing environment is described. Because one of the main bottlenecks of any distributed application is the communication among its particular processes, the ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Computing Surveys Volume 51, Issue 1
January 2019
743 pages
ISSN:0360-0300
EISSN:1557-7341
DOI:10.1145/3177787
Editor:
Sartaj Sahni
Department of Computer and Information Science and Engineering / University of Florida / Gainesville, FL
Issue’s Table of Contents
Copyright © 2018 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 10 January 2018
- Accepted: 1 September 2017
- Revised: 1 July 2017
- Received: 1 July 2016
Published in csur Volume 51, Issue 1

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Approximate communication
approximate computing
communication reduction
scalability
Qualifiers
- survey
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 53
  Total Citations
  View Citations
- 1,254
  Total Downloads
- Downloads (Last 12 months)117
- Downloads (Last 6 weeks)7
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Approximate Communication: Techniques for Reducing Communication Bottlenecks in Large-Scale Parallel Systems

ACM Computing Surveys

Abstract

References

Cited By

Index Terms

Recommendations

An online quality management framework for approximate communication in network-on-chips

Energy efficient 3D network-on-chip based on approximate communication

Distributed Traffic Simulation and the Reduction of Inter-process Communication Using Traffic Flow Characteristics Transfer