skip to main content
survey

Approximate Communication: Techniques for Reducing Communication Bottlenecks in Large-Scale Parallel Systems

Published:10 January 2018Publication History
Skip Abstract Section

Abstract

Approximate computing has gained research attention recently as a way to increase energy efficiency and/or performance by exploiting some applications’ intrinsic error resiliency. However, little attention has been given to its potential for tackling the communication bottleneck that remains one of the looming challenges to be tackled for efficient parallelism. This article explores the potential benefits of approximate computing for communication reduction by surveying three promising techniques for approximate communication: compression, relaxed synchronization, and value prediction. The techniques are compared based on an evaluation framework composed of communication cost reduction, performance, energy reduction, applicability, overheads, and output degradation. Comparison results demonstrate that lossy link compression and approximate value prediction show great promise for reducing the communication bottleneck in bandwidth-constrained applications. Meanwhile, relaxed synchronization is found to provide large speedups for select error-tolerant applications, but suffers from limited general applicability and unreliable output degradation guarantees. Finally, this article concludes with several suggestions for future research on approximate communication techniques.

References

  1. Tor M. Aamodt and Paul Chow. 2008. Compile-time and instruction-set methods for improving floating-to fixed-point conversion accuracy. ACM Transactions on Embedded Computing Systems 7, 3, 26.Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Bülent Abali, Hubertus Franke, Dan E. Poff, Robert A. Saccone, Jr., Charles O. Schulz, Lorraine M. Herger, and T. Basil Smith. 2001. Memory expansion technology (MXT): software support and performance. IBM Journal of Research and Development 45, 2, 287--301.Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Don Adams. 1993. CRAY T3D System Architecture Overview Manual. Retrieved November 29, 2017 from ftp://ftp.cray.com/product-info/mpp/T3D_Architecture_Over/T3D.overview.html.Google ScholarGoogle Scholar
  4. Ismail Akturk, Karen Khatamifard, and Ulya R. Karpuzcu. 2015. On quantification of accuracy loss in approximate computing. In Workshop on Duplicating, Deconstructing and Debunking (WDDD’15). 15.Google ScholarGoogle Scholar
  5. Alaa R. Alameldeen and David A. Wood. 2004. Adaptive cache compression for high-performance processors. In Proceedings of the 31st Annual International Symposium on Computer Architecture. IEEE, 212--223.Google ScholarGoogle Scholar
  6. Alaa R. Alameldeen and David A. Wood. 2007. Interactions between compression and prefetching in chip multiprocessors. In IEEE 13th International Symposium on High Performance Computer Architecture (HPCA’07). IEEE, 228--239.Google ScholarGoogle Scholar
  7. George Almási, Philip Heidelberger, Charles J. Archer, Xavier Martorell, C. Chris Erway, José E. Moreira, B. Steinmacher-Burow, and Yili Zheng. 2005. Optimization of MPI collective communication on BlueGene/L systems. In Proceedings of the 19th Annual International Conference on Supercomputing (ICS’05). ACM, New York, NY, 253--262. DOI:http://dx.doi.org/10.1145/1088149.1088183Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Carlos Alvarez, Jesus Corbal, and Mateo Valero. 2005. Fuzzy memoization for floating-point multimedia applications. IEEE Transactions on Computers 54, 7, 922--927.Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Gene M. Amdahl. 1967. Validity of the single processor approach to achieving large scale computing capabilities. In Proceedings of the April 18-20, 1967, Spring Joint Computer Conference. ACM, 483--485.Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Baik Song An, Manhee Lee, Ki Hwan Yum, and Eun Jung Kim. 2012. Efficient data packet compression for cache coherent multiprocessor systems. In Data Compression Conference (DCC’12). IEEE, 129--138.Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Mohammad Ashraful Anam, Paul N. Whatmough, and Yiannis Andreopoulos. 2013. Precision-energy-throughput scaling of generic matrix multiplication and discrete convolution kernels via linear projections. In IEEE 11th Symposium on Embedded Systems for Real-time Multimedia (ESTIMedia’13). IEEE, 21--30.Google ScholarGoogle ScholarCross RefCross Ref
  12. Jason Ansel, Cy Chan, Yee Lok Wong, Marek Olszewski, Qin Zhao, Alan Edelman, and Saman Amarasinghe. 2009. PetaBricks: A Language and Compiler for Algorithmic Choice. Vol. 44. ACM.Google ScholarGoogle Scholar
  13. Jason Ansel, Yee Lok Wong, Cy Chan, Marek Olszewski, Alan Edelman, and Saman Amarasinghe. 2011. Language and compiler support for auto-tuning variable-accuracy algorithms. In Proceedings of the 9th Annual IEEE/ACM International Symposium on Code Generation and Optimization. IEEE Computer Society, 85--96.Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Woongki Baek and Trishul M. Chilimbi. 2010. Green: A framework for supporting energy-conscious programming using controlled approximation. In ACM SIGPLAN Notices, Vol. 45. ACM, 198--209.Google ScholarGoogle Scholar
  15. Arnab Banerjee, Pascal T. Wolkotte, Robert D. Mullins, Simon W. Moore, and Gerard J. M. Smit. 2009. An energy and performance exploration of network-on-chip architectures. IEEE Transactions on Very Large Scale Integration (VLSI) Systems 17, 3, 319--329.Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Carl J. Beckmann and Constantine D. Polychronopoulos. 1990. Fast barrier synchronization hardware. In Proceedings of the 1990 ACM/IEEE Conference on Supercomputing (Supercomputing’90). IEEE Computer Society, Washington, DC, USA, 180--189.Google ScholarGoogle Scholar
  17. K. Bergman and others. 2008. Exascale computing study: Technology challenges in achieving exascale systems. Defense Advanced Research Projects Agency Information Processing Techniques Office (DARPA IPTO), Tech. Rep 15 (2008).Google ScholarGoogle Scholar
  18. Tekin Bicer, Jian Yin, Dereck Chiu, Gagan Agrawal, and Karen Schuchardt. 2013. Integrating online compression to accelerate large-scale data analytics applications. In IEEE 27th International Symposium on Parallel 8 Distributed Processing (IPDPS’13). IEEE, 1205--1216.Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Mark Buckler, Wayne Burleson, and Greg Sadowski. 2013. Low-power networks-on-chip: Progress and remaining challenges. In 2013 IEEE International Symposium on Low Power Electronics and Design (ISLPED’13). IEEE, 132--134.Google ScholarGoogle ScholarCross RefCross Ref
  20. Huy Bui, Hal Finkel, Venkatram Vishwanath, Salman Habib, Katrin Heitmann, Jason Leigh, Michael Papka, and Kevin Harms. 2014. Scalable parallel I/O on a blue gene/Q supercomputer using compression, topology-aware data aggregation, and subfiling. In 22nd Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP’14). IEEE, 107--111.Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Surendra Byna, Jiayuan Meng, Anand Raghunathan, Srimat Chakradhar, and Srihari Cadambi. 2010. Best-effort semantic document search on GPUs. In Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units. ACM, 86--93.Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Brad Calder, Glenn Reinman, and Dean M. Tullsen. 1999. Selective value prediction. In Proceedings of the 26th International Symposium on Computer Architecture. IEEE, 64--74.Google ScholarGoogle Scholar
  23. Ramon Canal, Antonio González, and James E. Smith. 2000. Very low power pipelines using significance compression. In Proceedings of the 33rd Annual ACM/IEEE International Symposium on Microarchitecture. ACM, 181--190.Google ScholarGoogle Scholar
  24. Vito Cappellini. 1985. Data Compression and Error Control Techniques with Applications. Academic Press, Inc., Cambridge, MA.Google ScholarGoogle Scholar
  25. Michael Carbin, Sasa Misailovic, and Martin C. Rinard. 2013. Verifying quantitative reliability for programs that execute on unreliable hardware. In ACM SIGPLAN Notices, Vol. 48. ACM, 33--52.Google ScholarGoogle Scholar
  26. Srimat T. Chakradhar and Anand Raghunathan. 2010. Best-effort computing: Re-thinking parallel software and hardware. In 47th ACM/IEEE Design Automation Conference (DAC’10). IEEE, 865--870.Google ScholarGoogle Scholar
  27. Jie Chen and W. Watson. 2008. Software barrier performance on dual quad-core opterons. International Conference on Networking, Architecture, and Storage, 2008 (NAS’08). 303--309.Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Yen-Kuang Chen, Jatin Chhugani, Pradeep Dubey, Christopher J. Hughes, Daehyun Kim, Sanjeev Kumar, Victor W. Lee, Anthony D. Nguyen, and Mikhail Smelyanskiy. 2008. Convergence of recognition, mining, and synthesis workloads and its implications. Proceedings of IEEE 96, 5, 790--807.Google ScholarGoogle ScholarCross RefCross Ref
  29. Vinay K. Chippa, Hrishikesh Jayakumar, Debabrata Mohapatra, Kaushik Roy, and Anand Raghunathan. 2013. Energy-efficient recognition and mining processor using scalable effort design. In IEEE Custom Integrated Circuits Conference (CICC’13). IEEE, 1--4.Google ScholarGoogle ScholarCross RefCross Ref
  30. Vinay Kumar Chippa, Debabrata Mohapatra, Kaushik Roy, Srimat T. Chakradhar, and Anand Raghunathan. 2014. Scalable effort hardware design. IEEE Transactions on Very Large Scale Integration (VLSI) Systems 22, 9, 2004--2016.Google ScholarGoogle ScholarCross RefCross Ref
  31. Vinay K. Chippa, Swagath Venkataramani, Srimat T. Chakradhar, Kaushik Roy, and Anand Raghunathan. 2013. Approximate computing: An integrated hardware approach. In Asilomar Conference on Signals, Systems and Computers. IEEE, 111--117.Google ScholarGoogle ScholarCross RefCross Ref
  32. Vinay K. Chippa, Swagath Venkataramani, Kaushik Roy, and Anand Raghunathan. 2014. StoRM: A stochastic recognition and mining processor. In Proceedings of the 2014 International Symposium on Low Power Electronics and Design. ACM, 39--44.Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Kyungsang Cho, Yongjun Lee, Young H. Oh, Gyoo-cheol Hwang, and Jae W. Lee. 2014. eDRAM-based tiered-reliability memory with applications to low-power frame buffers. In IEEE/ACM International Symposium on Low Power Electronics and Design (ISLPED’14). IEEE, 333--338.Google ScholarGoogle Scholar
  34. Marcelo Cintra and Josep Torrellas. 2002. Eliminating squashes through learning cross-thread violations in speculative parallelization for multiprocessors. In Proceedings of the 8th International Symposium on High-Performance Computer Architecture. IEEE, 43--54.Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. R. J. Cintra. 2011. An integer approximation method for discrete sinusoidal transforms. Circuits, Systems, and Signal Processing 30, 6, 1481--1501.Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Renato J. Cintra and Fábio M. Bayer. 2011. A DCT approximation for image compression. IEEE Signal Processing Letters 18, 10, 579--582.Google ScholarGoogle ScholarCross RefCross Ref
  37. Daniel Citron and Larry Rudolph. 1995. Creating a wider bus using caching techniques. In Proceedings of the 1st IEEE Symposium on High-Performance Computer Architecture. IEEE, 90--99.Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Paul Coteus, H. Randall Bickford, Thomas M. Cipolla, Paul Crumley, Alan Gara, Shawn Hall, Gerard V. Kopcsay, Alphonso P. Lanzetta, Lawrence S. Mok, Rick A. Rand, Richard A. Swetz, Todd Takken, Paul La Rocca, Christopher Marroquin, Philip R. Germann, and Mark J. Jeanson. 2005. Packaging the blue gene/L supercomputer. IBM Journal of Research and Development 49, 2--3, 213--248.Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. David E. Culler, Jaswinder Pal Singh, and Anoop Gupta. 1999. Parallel Computer Architecture: A Hardware/Software Approach. Gulf Professional Publishing, Houston, TX.Google ScholarGoogle Scholar
  40. William J. Dally and Brian Towles. 2001. Route packets, not wires: On-chip interconnection networks. In Proceedings of the Design Automation Conference. IEEE, 684--689.Google ScholarGoogle Scholar
  41. Reetuparna Das, Asit K. Mishra, Chrysostomos Nicopoulos, Dongkook Park, Vijaykrishnan Narayanan, Ravishankar Iyer, Mazin S. Yousif, and Chita R. Das. 2008. Performance and power optimization through data compression in network-on-chip architectures. In IEEE 14th International Symposium on High Performance Computer Architecture (HPCA’08). IEEE, 215--225.Google ScholarGoogle Scholar
  42. Marc De Kruijf, Shuou Nomura, and Karthikeyan Sankaralingam. 2010. Relax: An architectural framework for software recovery of hardware faults. In ACM SIGARCH Computer Architecture News, Vol. 38. ACM, 497--508.Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Li Deng and Douglas O’Shaughnessy. 2003. Speech Processing: A Dynamic and Optimization-Oriented Approach. CRC Press, Boca Raton, FL.Google ScholarGoogle ScholarCross RefCross Ref
  44. J. Dongarra, P. Luszczek, and A. Petitet. 2003. The LINPACK benchmark: Past, present, and future. Concurrency and Computation: Practice and Experience 15, 9, 803--820.Google ScholarGoogle ScholarCross RefCross Ref
  45. Zidong Du, Avinash Lingamneni, Yunji Chen, Krishna Palem, Olivier Temam, and Chengyong Wu. 2014. Leveraging the error resilience of machine-learning applications for designing highly energy efficient accelerators. In 19th Asia and South Pacific Design Automation Conference (ASP-DAC’14). IEEE, 201--206.Google ScholarGoogle Scholar
  46. Peter Düben, Jeremy Schlachter, Sreelatha Yenugula, John Augustine, Christian Enz, K. Palem, T. N. Palmer, and others. 2015. Opportunities for energy efficient computing: A study of inexact general purpose processors for high-performance and big-data applications. In Proceedings of the 2015 Design, Automation 8 Test in Europe Conference 8 Exhibition. EDA Consortium, 764--769.Google ScholarGoogle ScholarCross RefCross Ref
  47. Pradeep Dubey. 2005. Recognition, mining and synthesis moves computers to the era of Tera. Technology@ Intel Magazine 9, 2, 1--10.Google ScholarGoogle Scholar
  48. Peter Elias. 1955. Predictive coding--I. IRE Transactions on Information Theory 1, 1, 16--24.Google ScholarGoogle ScholarCross RefCross Ref
  49. Hadi Esmaeilzadeh, Emily Blem, Renee St. Amant, Karthikeyan Sankaralingam, and Doug Burger. 2011. Dark silicon and the end of multicore scaling. In International Symposium on Computer Architecture.Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. Hadi Esmaeilzadeh, Adrian Sampson, Luis Ceze, and Doug Burger. 2012. Architecture support for disciplined approximate programming. In Proceedings of the 17th International Conference on Architectural Support for Programming Languages and Operating Systems.Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. Hadi Esmaeilzadeh, Adrian Sampson, Luis Ceze, and Doug Burger. 2012. Neural acceleration for general-purpose approximate programs. In Proceedings of the 45th Annual IEEE/ACM International Symposium on Microarchitecture. IEEE Computer Society, 449--460.Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. Marius Evers, Po-Yung Chang, and Yale N. Patt. 1996. Using hybrid branch predictors to improve branch prediction accuracy in the presence of context switches. In ACM SIGARCH Computer Architecture News, Vol. 24. ACM, 3--11.Google ScholarGoogle Scholar
  53. Yuntan Fang, Huawei Li, and Xiaowei Li. 2012. SoftPCM: Enhancing energy efficiency and lifetime of phase change memory in video applications via approximate write. In IEEE 21st Asian Test Symposium (ATS’12). IEEE, 131--136.Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. Eric Freudenthal and Olivier Peze. 1988. Efficient Synchronization Algorithms Using Fetch-and-Add on Multiple Bitfield Integers. Ultracomputer Note 148.Google ScholarGoogle Scholar
  55. Shrikanth Ganapathy, Georgios Karakonstantis, Adam Teman, and Andreas Burg. 2015. Mitigating the impact of faults in unreliable memories for error-resilient applications. In Proceedings of the 52nd Annual Design Automation Conference. ACM, 102.Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. Bart Goeman, Hans Vandierendonck, and Koen De Bosschere. 2001. Differential FCM: Increasing value prediction accuracy by improving table usage efficiency. In 7th International Symposium on High-Performance Computer Architecture (HPCA’01). IEEE, 207--216.Google ScholarGoogle ScholarCross RefCross Ref
  57. Inigo Goiri, Ricardo Bianchini, Santosh Nagarakatte, and Thu D. Nguyen. 2015. Approxhadoop: Bringing approximations to mapreduce frameworks. In ACM SIGARCH Computer Architecture News, Vol. 43. ACM, 383--397.Google ScholarGoogle Scholar
  58. Jill R. Goldschneider. 1997. Lossy Compression of Scientific Data Via Wavelets and Vector Quantization. Ph.D. thesis, University of Washington, Seattle, WA. https://digital.lib.washington.edu/researchworks/handle/1773/5881?show=full.Google ScholarGoogle Scholar
  59. Beayna Grigorian and Glenn Reinman. 2015. Accelerating divergent applications on SIMD architectures using neural networks. ACM Transactions on Architecture and Code Optimization 12, 1, 2.Google ScholarGoogle ScholarDigital LibraryDigital Library
  60. Vaibhav Gupta, Debabrata Mohapatra, Sang Phill Park, Anand Raghunathan, and Kaushik Roy. 2011. IMPACT: Imprecise adders for low-power approximate computing. In Proceedings of the 17th IEEE/ACM International Symposium on Low-power Electronics and Design. IEEE Press, 409--414.Google ScholarGoogle ScholarDigital LibraryDigital Library
  61. Erik G. Hallnor and Steven K. Reinhardt. 2004. A compressed memory hierarchy using an indirect index cache. In Proceedings of the 3rd Workshop on Memory Performance Issues: In Conjunction with the 31st International Symposium on Computer Architecture. ACM, 9--15.Google ScholarGoogle Scholar
  62. Maurice Herlihy, J. Eliot, and B. Moss. 1993. Transactional Memory: Architectural Support for Lock-Free Data Structures. Vol. 21. ACM.Google ScholarGoogle ScholarDigital LibraryDigital Library
  63. T. Hoefler, T. Mehlan, F. Mietke, and W. Rehm. 2004. A survey of barrier algorithms for coarse grained supercomputers. Chemnitzer Informatik Berichte 4, 3 (2004).Google ScholarGoogle Scholar
  64. Henry Hoffmann, Sasa Misailovic, Stelios Sidiroglou, Anant Agarwal, and Martin Rinard. 2009. Using code perforation to improve performance, reduce energy consumption, and respond to failures. Technical Report MIT-CSAIL-TR-2209-037, EECS, MIT.Google ScholarGoogle Scholar
  65. Henry Hoffmann, Stelios Sidiroglou, Michael Carbin, Sasa Misailovic, Anant Agarwal, and Martin Rinard. 2011. Dynamic knobs for responsive power-aware computing. In ACM SIGPLAN Notices, Vol. 46. ACM, 199--212.Google ScholarGoogle ScholarDigital LibraryDigital Library
  66. Chih-Chieh Hsiao, Slo-Li Chu, and Chen-Yu Chen. 2013. Energy-aware hybrid precision selection framework for mobile GPUs. Computers 8 Graphics 37, 5, 431--444.Google ScholarGoogle Scholar
  67. Jiawei Huang, John Lach, and Gabriel Robins. 2012. A methodology for energy-quality tradeoff using imprecise hardware. In Proceedings of the 49th Annual Design Automation Conference. ACM, 504--509.Google ScholarGoogle ScholarDigital LibraryDigital Library
  68. Jeremy Iverson, Chandrika Kamath, and George Karypis. 2012. Fast and effective lossy compression algorithms for scientific datasets. In Euro-Par 2012 Parallel Processing. Springer, 843--856.Google ScholarGoogle Scholar
  69. Yuho Jin, Ki Hwan Yum, and Eun Jung Kim. 2008. Adaptive data compression for high-performance low-power on-chip networks. In Proceedings of the 41st Annual IEEE/ACM International Symposium on Microarchitecture. IEEE Computer Society, 354--363.Google ScholarGoogle Scholar
  70. Andrew B. Kahng and Seokhyeong Kang. 2012. Accuracy-configurable adder for approximate arithmetic designs. In Proceedings of the 49th Annual Design Automation Conference. ACM, 820--825.Google ScholarGoogle Scholar
  71. Georgios Karakonstantis, Debabrata Mohapatra, and Kaushik Roy. 2012. Logic and memory design based on unequal error protection for voltage-scalable, robust and adaptive DSP systems. Journal of Signal Processing Systems 68, 3, 415--431.Google ScholarGoogle ScholarDigital LibraryDigital Library
  72. Georgios Keramidas, Chrysa Kokkala, and Iakovos Stamoulis. 2015. Clumsy value cache: An approximate memoization technique for mobile GPU fragment shaders. In Workshop on Approximate Computing (WAPCO’15).Google ScholarGoogle Scholar
  73. Daya Shanker Khudia and Scott Mahlke. 2014. Harnessing soft computations for low-budget fault tolerance. In 47th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’14). IEEE, 319--330.Google ScholarGoogle ScholarDigital LibraryDigital Library
  74. Hyungjun Kim, Pritha Ghoshal, Boris Grot, Paul V. Gratz, and Daniel A. Jiménez. 2011. Reducing network-on-chip energy consumption through spatial locality speculation. In Proceedings of the 5th ACM/IEEE International Symposium on Networks-on-Chip. ACM, 233--240.Google ScholarGoogle Scholar
  75. Chandra Krintz and Sezgin Sucu. 2006. Adaptive on-the-fly compression. IEEE Transactions on Parallel and Distributed Systems 17, 1, 15--24.Google ScholarGoogle ScholarDigital LibraryDigital Library
  76. Parag Kulkarni, Puneet Gupta, and Milos Ercegovac. 2011. Trading accuracy for power with an underdesigned multiplier architecture. In 24th International Conference on VLSI Design (VLSI Design’11). IEEE, 346--351.Google ScholarGoogle ScholarDigital LibraryDigital Library
  77. Didier Le Gall. 1991. MPEG: A video compression standard for multimedia applications. Communications of the ACM 34, 4, 46--58.Google ScholarGoogle ScholarDigital LibraryDigital Library
  78. Jae Bum Lee and Chu Shik Jhon. 1998. Reducing coherence overhead of barrier synchronization in software DSMs. In Supercomputing’98: Proceedings of the 1998 ACM/IEEE Conference on Supercomputing (CDROM). IEEE Computer Society, Washington, DC, USA, 1--18.Google ScholarGoogle Scholar
  79. Jang-Soo Lee, Won-Kee Hong, and Shin-Dug Kim. 1999. Design and evaluation of a selective compressed memory system. In International Conference on Computer Design (ICCD’99). IEEE, 184--191.Google ScholarGoogle Scholar
  80. Kangmin Lee, Se-Joong Lee, and Hoi-Jun Yoo. 2006. Low-power network-on-chip for high-performance SoC design. IEEE Transactions on Very Large Scale Integration (VLSI) Systems 14, 2, 148--160.Google ScholarGoogle ScholarDigital LibraryDigital Library
  81. Moon-Sang Lee, Young-Jae Kang, Joon-Won Lee, and Seung-Ryoul Maeng. 2002. OPTS: Increasing branch prediction accuracy under context switch. Microprocessors and Microsystems 26, 6, 291--300.Google ScholarGoogle ScholarCross RefCross Ref
  82. Sungju Lee, Heegon Kim, Yongwha Chung, and Daihee Park. 2012. Energy efficient image/video data transmission on commercial multi-core processors. Sensors 12, 11, 14647--14670.Google ScholarGoogle ScholarCross RefCross Ref
  83. Sangpil Lee, Keunsoo Kim, Gunjae Koo, Hyeran Jeon, Won Woo Ro, and Murali Annavaram. 2015. Warped-compression: Enabling power efficient GPUs through register compression. In Proceedings of the 42nd Annual International Symposium on Computer Architecture. ACM, 502--514.Google ScholarGoogle ScholarDigital LibraryDigital Library
  84. Larkhoon Leem, Hyungmin Cho, Jason Bau, Quinn A. Jacobson, and Subhasish Mitra. 2010. ERSA: Error resilient system architecture for probabilistic applications. In Design, Automation 8 Test in Europe Conference 8 Exhibition (DATE’10). IEEE, 1560--1565.Google ScholarGoogle Scholar
  85. Debra A. Lelewer and Daniel S. Hirschberg. 1987. Data compression. ACM Computing Surveys 19, 3, 261--296.Google ScholarGoogle ScholarDigital LibraryDigital Library
  86. Krisda Lengwehasatit and Antonio Ortega. 2004. Scalable variable complexity approximate forward DCT. IEEE Transactions on Circuits and Systems for Video Technology 14, 11, 1236--1248.Google ScholarGoogle ScholarDigital LibraryDigital Library
  87. Mikko H. Lipasti and John Paul Shen. 1996. Exceeding the dataflow limit via value prediction. In Proceedings of the 29th Annual ACM/IEEE International Symposium on Microarchitecture. IEEE Computer Society, 226--237.Google ScholarGoogle Scholar
  88. Mikko H. Lipasti, Christopher B. Wilkerson, and John Paul Shen. 1996. Value locality and load value prediction. ACM SIGOPS Operating Systems Review 30, 5, 138--147.Google ScholarGoogle ScholarDigital LibraryDigital Library
  89. Shaoshan Liu, Christine Eisenbeis, and Jean-Luc Gaudiot. 2010. A theoretical framework for value prediction in parallel systems. In 39th International Conference on Parallel Processing (ICPP’10). IEEE, 11--20.Google ScholarGoogle ScholarDigital LibraryDigital Library
  90. Song Liu, Karthik Pattabiraman, Thomas Moscibroda, and Benjamin G. Zorn. 2009. Flicker: Saving refresh-power in mobile devices through critical data partitioning. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’09).Google ScholarGoogle Scholar
  91. Gabriel H. Loh, Nuwan Jayasena, M. Oskin, Mark Nutter, David Roberts, Mitesh Meswani, Dong Ping Zhang, and Mike Ignatowski. 2013. A processing in memory taxonomy and a case for studying fixed-function pim. In Workshop on Near-Data Processing (WoNDP’13).Google ScholarGoogle Scholar
  92. Enrico Magli and Gabriella Olmo. 2003. Lossy predictive coding of SAR raw data. IEEE Transactions on Geoscience and Remote Sensing 41, 5, 977--987.Google ScholarGoogle ScholarCross RefCross Ref
  93. Milo M. K. Martin, Daniel J. Sorin, Harold W. Cain, Mark D. Hill, and Mikko H. Lipasti. 2001. Correctly implementing value prediction in microprocessors that support multithreading or multiprocessing. In Proceedings of the 34th Annual ACM/IEEE International Symposium on Microarchitecture. IEEE Computer Society, 328--337.Google ScholarGoogle Scholar
  94. John M. Mellor-Crummey and Michael L. Scott. 1991. Algorithms for scalable synchronization on shared-memory multiprocessors. ACM Transactions on Computer Systems 9, 1, 21--65. DOI:http://dx.doi.org/10.1145/103727.103729Google ScholarGoogle ScholarDigital LibraryDigital Library
  95. Jiayuan Meng, Srimat Chakradhar, and Anand Raghunathan. 2009. Best-effort parallel execution framework for recognition and mining applications. In IEEE International Symposium on Parallel 8 Distributed Processing (IPDPS’09). IEEE, 1--12.Google ScholarGoogle Scholar
  96. Jiayuan Mengte, Anand Raghunathan, Srimat Chakradhar, and Surendra Byna. 2010. Exploiting the forgiving nature of applications for scalable parallel execution. IEEE International Symposium on Parallel 8 Distributed Processing (IPDPS). IEEE.Google ScholarGoogle ScholarCross RefCross Ref
  97. Joshua San Miguel, Mario Badr, and Natalie Enright Jerger. 2014. Load value approximation. In Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture. IEEE Computer Society, 127--139.Google ScholarGoogle ScholarDigital LibraryDigital Library
  98. Sasa Misailovic, Michael Carbin, Sara Achour, Zichao Qi, and Martin C. Rinard. 2014. Chisel: Reliability-and accuracy-aware optimization of approximate computational kernels. In ACM SIGPLAN Notices, Vol. 49. ACM, 309--328.Google ScholarGoogle Scholar
  99. Sasa Misailovic, Deokhwan Kim, and Martin Rinard. 2013. Parallelizing sequential programs with statistical accuracy tests. ACM Transactions on Embedded Computing Systems 12, 2s, 88.Google ScholarGoogle ScholarDigital LibraryDigital Library
  100. Sasa Misailovic, Stelios Sidiroglou, Henry Hoffmann, and Martin Rinard. 2010. Quality of service profiling. In Proceedings of the 32nd ACM/IEEE International Conference on Software Engineering-Volume 1. ACM, 25--34.Google ScholarGoogle ScholarDigital LibraryDigital Library
  101. Sasa Misailovic, Stelios Sidiroglou, and Martin C. Rinard. 2012. Dancing with uncertainty. In Proceedings of the 2012 ACM Workshop on Relaxing Synchronization for Multicore and Manycore Scalability. ACM, 51--60.Google ScholarGoogle Scholar
  102. Sparsh Mittal. 2016. A survey of techniques for approximate computing. ACM Computing Surveys 48, 4, 62.Google ScholarGoogle ScholarDigital LibraryDigital Library
  103. Debabrata Mohapatra, Vinay K. Chippa, Anand Raghunathan, and Kaushik Roy. 2011. Design of voltage-scalable meta-functions for approximate computing. In Design, Automation 8 Test in Europe Conference 8 Exhibition (DATE’11). IEEE, 1--6.Google ScholarGoogle Scholar
  104. Debabrata Mohapatra, Georgios Karakonstantis, and Kaushik Roy. 2009. Significance driven computation: A voltage-scalable, variation-aware, quality-tuning motion estimator. In Proceedings of the 2009 ACM/IEEE International Symposium on Low Power Electronics and Design. ACM, 195--200.Google ScholarGoogle ScholarDigital LibraryDigital Library
  105. Thierry Moreau, Mark Wyse, Jacob Nelson, Adrian Sampson, Hadi Esmaeilzadeh, Luis Ceze, and Mark Oskin. 2015. SNNAP: Approximate computing on programmable socs via neural acceleration. In IEEE 21st International Symposium on High Performance Computer Architecture (HPCA’15). IEEE, 603--614.Google ScholarGoogle ScholarCross RefCross Ref
  106. Michel Mouly, Marie-Bernadette Pautet, and Thomas Foreword By-Haug. 1992. The GSM System for Mobile Communications. Telecom Publishing.Google ScholarGoogle Scholar
  107. Tarun Nakra, Rajiv Gupta, and Mary Lou Soffa. 1999. Global context-based value prediction. In Proceedings of the 5th International Symposium on High-Performance Computer Architecture. IEEE, 4--12.Google ScholarGoogle ScholarCross RefCross Ref
  108. Sriram Narayanan, John Sartori, Rakesh Kumar, and Douglas L. Jones. 2010. Scalable stochastic processors. In Proceedings of the Conference on Design, Automation and Test in Europe. European Design and Automation Association, 335--338.Google ScholarGoogle Scholar
  109. D. Nikolopoulos and T. Papatheodorou. 2000. Fast synchronization on scalable cache-coherent multiprocessors using hybrid primitives. In Proceedings of the 14th International Symposium on Parallel and Distributed Processing (IPDPS’00). IEEE Computer Society, Washington, DC, USA, 711.Google ScholarGoogle Scholar
  110. Peter Noll. 1997. MPEG digital audio coding. IEEE Signal Processing Magazine 14, 5, 59--81.Google ScholarGoogle ScholarCross RefCross Ref
  111. NVIDIA. 2014. NVIDIA GTX 980 Whitepaper. Retrieved November 29, 2017 from https://international.download.nvidia.com/geforce-com/international/pdfs/GeForce_GTX_980_Whitepaper_FINAL.PDF.Google ScholarGoogle Scholar
  112. Simon Ogg and Bashir Al-Hashimi. 2006. Improved data compression for serial interconnected network on chip through unused significant bit removal. In 19th International Conference on VLSI Design. Held jointly with 5th International Conference on Embedded Systems and Design. IEEE, 5 pp.Google ScholarGoogle ScholarDigital LibraryDigital Library
  113. Soontorn Oraintara, Ying-Jui Chen, and Truong Q. Nguyen. 2002. Integer fast Fourier transform. IEEE Transactions on Signal Processing 50, 3, 607--618.Google ScholarGoogle ScholarDigital LibraryDigital Library
  114. David J. Palframan, Nam Sung Kim, and Mikko H. Lipasti. 2014. Precision-aware soft error protection for GPUs. In IEEE 20th International Symposium on High Performance Computer Architecture (HPCA’14). IEEE, 49--59.Google ScholarGoogle Scholar
  115. J. Thomas Pawlowski. 2011. Hybrid memory cube (HMC). In Hot Chips, Vol. 23.Google ScholarGoogle ScholarCross RefCross Ref
  116. Gennady Pekhimenko, Evgeny Bolotin, Mike O’Connor, Onur Mutlu, Todd C. Mowry, and Steve Keckler. 2015. Toggle-aware compression for GPUs. In IEEE Computer Architecture Letters. 14, 2 (2015), 164--168. DOI:10.1109/LCA.2015.2430853Google ScholarGoogle ScholarDigital LibraryDigital Library
  117. Gennady Pekhimenko, Evgeny Bolotin, Nandita Vijaykumar, Onur Mutlu, Todd C. Mowry, and Stephen W. Keckler. 2016. A case for toggle-aware compression for GPU systems. In IEEE International Symposium on High Performance Computer Architecture (HPCA’16). IEEE, 188--200.Google ScholarGoogle Scholar
  118. Arthur Perais and André Seznec. 2014. EOLE: Paving the way for an effective implementation of value prediction. In ACM SIGARCH Computer Architecture News, Vol. 42. IEEE Press, 481--492.Google ScholarGoogle ScholarDigital LibraryDigital Library
  119. Arthur Perais and André Seznec. 2014. Practical data value speculation for future high-end processors. In IEEE 20th International Symposium on High Performance Computer Architecture (HPCA’14). IEEE, 428--439.Google ScholarGoogle ScholarCross RefCross Ref
  120. Arthur Perais and André Seznec. 2015. BeBoP: A cost effective predictor infrastructure for superscalar value prediction. In IEEE 21st International Symposium on High Performance Computer Architecture (HPCA’15). IEEE, 13--25.Google ScholarGoogle ScholarCross RefCross Ref
  121. Calton Pu and Lenin Singaravelu. 2005. Fine-grain adaptive compression in dynamically variable networks. In Proceedings of the 25th IEEE International Conference on Distributed Computing Systems (ICDCS’05). IEEE, 685--694.Google ScholarGoogle ScholarDigital LibraryDigital Library
  122. Abbas Rahimi, Amirali Ghofrani, Kwang-Ting Cheng, Luca Benini, and Rajesh K. Gupta. 2015. Approximate associative memristive memory for energy-efficient GPUs. In Proceedings of the 2015 Design, Automation 8 Test in Europe Conference 8 Exhibition. EDA Consortium, 1497--1502.Google ScholarGoogle Scholar
  123. Vara Ramakrishnan and Isaac D. Scherson. 1999. Efficient techniques for nested and disjoint barrier synchronization. Journal of Parallel and Distributed Computing 58, 2, 333--356. DOI:http://dx.doi.org/10.1006/jpdc.1999.1556Google ScholarGoogle ScholarDigital LibraryDigital Library
  124. Easwaran Raman, Ram Rangan, David I. August, and others. 2008. Spice: Speculative parallel iteration chunk execution. In Proceedings of the 6th Annual IEEE/ACM International Symposium on Code Generation and Optimization. ACM, 175--184.Google ScholarGoogle ScholarDigital LibraryDigital Library
  125. Ashish Ranjan, Swagath Venkataramani, Xuanyao Fong, Kaushik Roy, and Anand Raghunathan. 2015. Approximate storage for energy efficient spintronic memories. In Proceedings of the 52nd Annual Design Automation Conference. ACM, 195.Google ScholarGoogle ScholarDigital LibraryDigital Library
  126. Lakshminarayanan Renganarayana, Vijayalakshmi Srinivasan, Ravi Nair, and Daniel Prener. 2012. Programming with relaxed synchronization. In Proceedings of the 2012 ACM Workshop on Relaxing Synchronization for Multicore and Manycore Scalability. ACM, 41--50.Google ScholarGoogle ScholarDigital LibraryDigital Library
  127. Martin Rinard. 2013. Parallel synchronization-free approximate data structure construction. In HotPar.Google ScholarGoogle Scholar
  128. Martin C. Rinard. 2012. Unsynchronized techniques for approximate parallel computing. In RACES Workshop.Google ScholarGoogle Scholar
  129. Antonio Roldao-Lopes, Amir Shahzad, George A. Constantinides, and Eric C. Kerrigan. 2009. More flops or more precision? Accuracy parameterizable linear equation solvers for model predictive control. In 17th IEEE Symposium on Field Programmable Custom Computing Machines (FCCM’09). IEEE, 209--216.Google ScholarGoogle Scholar
  130. Cindy Rubio-González, Cuong Nguyen, Hong Diep Nguyen, James Demmel, William Kahan, Koushik Sen, David H. Bailey, Costin Iancu, and David Hough. 2013. Precimonious: Tuning assistant for floating-point precision. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis. ACM, 27.Google ScholarGoogle ScholarDigital LibraryDigital Library
  131. David Salomon. 2004. Data Compression: The Complete Reference. Springer Science 8 Business Media, New York, NY.Google ScholarGoogle ScholarDigital LibraryDigital Library
  132. Mehrzad Samadi, Davoud Anoushe Jamshidi, Janghaeng Lee, and Scott Mahlke. 2014. Paraprox: Pattern-based approximation for data parallel applications. In ACM SIGARCH Computer Architecture News, Vol. 42. ACM, 35--50.Google ScholarGoogle Scholar
  133. Mehrzad Samadi, Janghaeng Lee, D. Anoushe Jamshidi, Amir Hormati, and Scott Mahlke. 2013. Sage: Self-tuning approximation for graphics engines. In Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture. ACM, 13--24.Google ScholarGoogle ScholarDigital LibraryDigital Library
  134. Adrian Sampson, Werner Dietl, Emily Fortuna, Danushen Gnanapragasam, Luis Ceze, and Dan Grossman. 2011. EnerJ: Approximate data types for safe and general low-power computation. In ACM SIGPLAN Notices, Vol. 46. ACM, 164--174.Google ScholarGoogle ScholarDigital LibraryDigital Library
  135. Adrian Sampson, Jacob Nelson, Karin Strauss, and Luis Ceze. 2014. Approximate storage in solid-state memories. ACM Transactions on Computer Systems 32, 3, 9.Google ScholarGoogle ScholarDigital LibraryDigital Library
  136. Jack Sampson, Ruben Gonzalez, Jean-Francois Collard, Norman P. Jouppi, Mike Schlansker, and Brad Calder. 2006. Exploiting fine-grained data parallelism with chip multiprocessors and fast barriers. In MICRO 39: Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture. IEEE Computer Society, Washington, DC, USA, 235--246. DOI:http://dx.doi.org/10.1109/MICRO.2006.23Google ScholarGoogle ScholarDigital LibraryDigital Library
  137. Joshua San Miguel and N. Enright Jerger. 2014. Load value approximation: Approaching the ideal memory access latency. In Workshop on Approximate Computing Across the System Stack.Google ScholarGoogle Scholar
  138. J. Sartori and R. Kumar. 2010. Low-overhead, high-speed multi-core barrier synchronization. In HiPEAC. 18--34.Google ScholarGoogle Scholar
  139. John Sartori and Rakesh Kumar. 2013. Branch and data herding: Reducing control and memory divergence for error-tolerant GPU applications. IEEE Transactions on Multimedia 15, 2, 279--290.Google ScholarGoogle ScholarDigital LibraryDigital Library
  140. Vijay Sathish, Michael J. Schulte, and Nam Sung Kim. 2012. Lossless and lossy memory I/O link compression for improving performance of GPGPU workloads. In Proceedings of the 21st International Conference on Parallel Architectures and Compilation Techniques. ACM, 325--334.Google ScholarGoogle ScholarDigital LibraryDigital Library
  141. Yiannakis Sazeides and James E. Smith. 1997. The predictability of data values. In Proceedings of the 30th Annual IEEE/ACM International Symposium on Microarchitecture. IEEE, 248--258.Google ScholarGoogle Scholar
  142. Steven L. Scott. 1996. Synchronization and communication in the T3E multiprocessor. SIGOPS Operating Systems Review 30, 5, 26--36. DOI:http://dx.doi.org/10.1145/248208.237144Google ScholarGoogle ScholarDigital LibraryDigital Library
  143. Marko Scrbak, Mahzabeen Islam, Krishna M. Kavi, Mike Ignatowski, and Nuwan Jayasena. 2015. Processing-in-memory: Exploring the design space. In Architecture of Computing Systems (ARCS’15). Springer, 43--54.Google ScholarGoogle Scholar
  144. André Seznec. 2011. A new case for the TAGE branch predictor. In Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture. ACM, 117--127.Google ScholarGoogle ScholarDigital LibraryDigital Library
  145. Ali Shafiee, Meysam Taassori, Rajeev Balasubramonian, and A. K. Davis. 2014. MemZip: Exploring unconventional benefits from memory compression. In IEEE 20th International Symposium on High Performance Computer Architecture (HPCA’14). IEEE, 638--649.Google ScholarGoogle Scholar
  146. Li Shang, Li-Shiuan Peh, and Niraj K. Jha. 2003. Dynamic voltage scaling with links for power optimization of interconnection networks. In Proceedings of the 9th International Symposium on High-Performance Computer Architecture (HPCA’03). IEEE, 91--102.Google ScholarGoogle Scholar
  147. Shisheng Shang and Kai Hwang. 1995. Distributed hardwired barrier synchronization for scalable multiprocessor clusters. IEEE Transactions on Parallel Distributed Systems 6, 6, 591--605. DOI:http://dx.doi.org/10.1109/71.388040Google ScholarGoogle ScholarDigital LibraryDigital Library
  148. Majid Shoushtari, Abbas BanaiyanMofrad, and Nikil Dutt. 2015. Exploiting partially-forgetful memories for approximate computing. IEEE Embedded Systems Letters 7, 1, 19--22.Google ScholarGoogle ScholarDigital LibraryDigital Library
  149. Stelios Sidiroglou-Douskos, Sasa Misailovic, Henry Hoffmann, and Martin Rinard. 2011. Managing performance vs. accuracy trade-offs with loop perforation. In Proceedings of the 19th ACM SIGSOFT Symposium and the 13th European Conference on Foundations of Software Engineering. ACM, 124--134.Google ScholarGoogle ScholarDigital LibraryDigital Library
  150. María Soler and José Flich. 2013. Power saving by NoC traffic compression. In European Conference on Parallel Processing. Springer, 465--476.Google ScholarGoogle Scholar
  151. Renée St. Amant, Amir Yazdanbakhsh, Jongse Park, Bradley Thwaites, Hadi Esmaeilzadeh, Arjang Hassibi, Luis Ceze, and Doug Burger. 2014. General-purpose code acceleration with limited-precision analog computation. ACM SIGARCH Computer Architecture News 42, 3, 505--516.Google ScholarGoogle ScholarDigital LibraryDigital Library
  152. J. Gregory Steffan, Christopher B. Colohan, Antonia Zhai, and Todd C. Mowry. 2002. Improving value communication for thread-level speculation. In Proceedings of the 8th International Symposium on High-Performance Computer Architecture. IEEE, 65--75.Google ScholarGoogle Scholar
  153. Ayswarya Sundaram, Ameen Aakel, Derek Lockhart, Darshan Thaker, and Diana Franklin. 2008. Efficient fault tolerance in multi-media applications through selective instruction replication. In Proceedings of the 2008 Workshop on Radiation Effects and Fault Tolerance in Nanometer Technologies. ACM, 339--346.Google ScholarGoogle ScholarDigital LibraryDigital Library
  154. Mark Sutherland, Joshua San Miguel, and Natalie Enright Jerger. 2015. Texture cache approximation on GPUs. In Workshop on Approximate Computing Across the Stack.Google ScholarGoogle Scholar
  155. M. B. Taylor. 2012. Is dark silicon useful? Harnessing the four horsemen of the coming dark silicon apocalypse. In Design Automation Conference.Google ScholarGoogle Scholar
  156. Bradley Thwaites, Gennady Pekhimenko, Hadi Esmaeilzadeh, Amir Yazdanbakhsh, Onur Mutlu, Jongse Park, Girish Mururu, and Todd Mowry. 2014. Rollback-free value prediction with approximate loads. In Proceedings of the 23rd International Conference on Parallel Architectures and Compilation. ACM, 493--494.Google ScholarGoogle ScholarDigital LibraryDigital Library
  157. Ye Tian, Qian Zhang, Ting Wang, Feng Yuan, and Qiang Xu. 2015. Approxma: Approximate memory access for dynamic precision scaling. In Proceedings of the 25th Edition on Great Lakes Symposium on VLSI. ACM, 337--342.Google ScholarGoogle ScholarDigital LibraryDigital Library
  158. Vassilis Vassiliadis, Konstantinos Parasyris, Charalambos Chalios, Christos D. Antonopoulos, Spyros Lalis, Nikolaos Bellas, Hans Vandierendonck, and Dimitrios S. Nikolopoulos. 2015. A programming model and runtime system for significance-aware energy-efficient computing. In ACM SIGPLAN Notices, Vol. 50. ACM, 275--276.Google ScholarGoogle Scholar
  159. Swagath Venkataramani, Srimat T. Chakradhar, Kaushik Roy, and Anand Raghunathan. 2015. Approximate computing and the quest for computing efficiency. In Proceedings of the 52nd Annual Design Automation Conference. ACM, 120.Google ScholarGoogle ScholarDigital LibraryDigital Library
  160. Swagath Venkataramani, Vinay K. Chippa, Srimat T. Chakradhar, Kaushik Roy, and Anand Raghunathan. 2013. Quality programmable vector processors for approximate computing. In Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture. ACM, 1--12.Google ScholarGoogle ScholarDigital LibraryDigital Library
  161. Swagath Venkataramani, Ashish Ranjan, Kaushik Roy, and Anand Raghunathan. 2014. AxNN: Energy-efficient neuromorphic systems using approximate computing. In Proceedings of the 2014 International Symposium on Low Power Electronics and Design. ACM, 27--32.Google ScholarGoogle ScholarDigital LibraryDigital Library
  162. Swagath Venkataramani, Kaushik Roy, and Anand Raghunathan. 2013. Substitute-and-simplify: A unified design paradigm for approximate and quality configurable circuits. In Proceedings of the Conference on Design, Automation and Test in Europe. EDA Consortium, 1367--1372.Google ScholarGoogle ScholarCross RefCross Ref
  163. Nandita Vijaykumar, Gennady Pekhimenko, Adwait Jog, Abhishek Bhowmick, Rachata Ausavarungnirun, Chita Das, Mahmut Kandemir, Todd C. Mowry, and Onur Mutlu. 2015. A case for core-assisted bottleneck acceleration in GPUs: enabling flexible data compression with assist warps. In ACM SIGARCH Computer Architecture News, Vol. 43. ACM, 41--53.Google ScholarGoogle ScholarDigital LibraryDigital Library
  164. Oreste Villa, Gianluca Palermo, and Cristina Silvano. 2008. Efficiency and scalability of barrier synchronization on NoC based many-core architectures. In CASES’08: Proceedings of the 2008 International Conference on Compilers, Architectures and Synthesis for Embedded Systems. ACM, New York, NY, USA, 81--90. DOI:http://dx.doi.org/10.1145/1450095.1450110Google ScholarGoogle ScholarDigital LibraryDigital Library
  165. Gregory K. Wallace. 1992. The JPEG still picture compression standard. IEEE Transactions on Consumer Electronics 38, 1, xviii--xxxiv.Google ScholarGoogle ScholarDigital LibraryDigital Library
  166. Kai Wang and Manoj Franklin. 1997. Highly accurate data value prediction using hybrid predictors. In Proceedings of the 30th Annual ACM/IEEE International Symposium on Microarchitecture. IEEE Computer Society, 281--290.Google ScholarGoogle ScholarCross RefCross Ref
  167. Zhou Wang, Eero P. Simoncelli, and Alan C. Bovik. 2003. Multiscale structural similarity for image quality assessment. In Conference Record of the 37th Asilomar Conference on Signals, Systems and Computers, Vol. 2. IEEE, 1398--1402.Google ScholarGoogle Scholar
  168. Terry A. Welch. 1984. A technique for high-performance data compression. Computer 6, 17, 8--19.Google ScholarGoogle ScholarDigital LibraryDigital Library
  169. Benjamin Welton, Dries Kimpe, Jason Cope, Christina M. Patrick, Kamil Iskra, and Robert Ross. 2011. Improving I/O forwarding throughput with data compression. In IEEE International Conference on Cluster Computing (CLUSTER’11). IEEE, 438--445.Google ScholarGoogle ScholarDigital LibraryDigital Library
  170. Yair Wiseman, Karsten Schwan, and Patrick Widener. 2005. Efficient end to end data exchange using configurable compression. ACM SIGOPS Operating Systems Review 39, 3, 4--23.Google ScholarGoogle ScholarDigital LibraryDigital Library
  171. Qiang Xu, Todd Mytkowicz, and Nam Sung Kim. 2016. Approximate computing: A survey. IEEE Design 8 Test 33, 1, 8--22.Google ScholarGoogle Scholar
  172. Xin Xu and H. Howie Huang. 2015. Exploring data-level error tolerance in high-performance solid-state drives. IEEE Transactions on Reliability 64, 1, 15--30.Google ScholarGoogle ScholarCross RefCross Ref
  173. Amir Yazdanbakhsh, Gennady Pekhimenko, Bradley Thwaites, Hadi Esmaeilzadeh, Onur Mutlu, and Todd C. Mowry. 2016. RFVP: Rollback-free value prediction with safe-to-approximate loads. ACM Transactions on Architecture and Code Optimization 12, 4, 62.Google ScholarGoogle ScholarDigital LibraryDigital Library
  174. Rong Ye, Ting Wang, Feng Yuan, Rakesh Kumar, and Qiang Xu. 2013. On reconfiguration-oriented approximate adder design and its application. In Proceedings of the International Conference on Computer-Aided Design. IEEE Press, 48--54.Google ScholarGoogle ScholarCross RefCross Ref
  175. Yavuz Yetim, Sharad Malik, and Margaret Martonosi. 2015. CommGuard: Mitigating communication errors in error-prone parallel execution. In ACM SIGPLAN Notices, Vol. 50. ACM, 311--323.Google ScholarGoogle ScholarDigital LibraryDigital Library
  176. Yavuz Yetim, Margaret Martonosi, and Sharad Malik. 2013. Extracting useful computation from error-prone processors for streaming applications. In Design, Automation 8 Test in Europe Conference 8 Exhibition (DATE’13). IEEE, 202--207.Google ScholarGoogle Scholar
  177. Felix Zahn, Steffen Lammel, and Holger Fröning. 2017. Early experiences with saving energy in direct interconnection networks. In IEEE 3rd International Workshop on High-Performance Interconnection Networks in the Exascale and Big-Data Era (HiPINEB’17). IEEE, 33--40.Google ScholarGoogle ScholarCross RefCross Ref
  178. Felix Zahn, Pedro Yebenes, Steffen Lammel, Pedro J. Garcia, and Holger Fröning. 2016. Analyzing the energy (dis-)proportionality of scalable interconnection networks. In 2nd IEEE International Workshop on High-Performance Interconnection Networks in the Exascale and Big-Data Era (HiPINEB’16). IEEE, 25--32.Google ScholarGoogle ScholarCross RefCross Ref
  179. Hang Zhang, Mateja Putic, and John Lach. 2014. Low power gpgpu computation with imprecise hardware. In Proceedings of the 51st Annual Design Automation Conference. ACM, 1--6.Google ScholarGoogle ScholarDigital LibraryDigital Library
  180. Qian Zhang, Ting Wang, Ye Tian, Feng Yuan, and Qiang Xu. 2015. ApproxANN: An approximate computing framework for artificial neural network. In Proceedings of the 2015 Design, Automation 8 Test in Europe Conference 8 Exhibition. EDA Consortium, 701--706.Google ScholarGoogle ScholarDigital LibraryDigital Library
  181. Huiyang Zhou and Thomas M. Conte. 2005. Enhancing memory-level parallelism via recovery-free value prediction. IEEE Transactions on Computers 54, 7, 897--912.Google ScholarGoogle ScholarDigital LibraryDigital Library
  182. Qiuling Zhu, Bilal Akin, H. Ekin Sumbul, Fazle Sadi, James C. Hoe, Larry Pileggi, and Franz Franchetti. 2013. A 3D-stacked logic-in-memory accelerator for application-specific data intensive computing. In IEEE International 3D Systems Integration Conference (3DIC’13). IEEE, 1--7.Google ScholarGoogle Scholar
  183. Weirong Zhu, Vugranam C. Sreedhar, Ziang Hu, and Guang R. Gao. 2007. Synchronization state buffer: Supporting efficient fine-grain synchronization on many-core architectures. In Proceedings of the 34th Annual International Symposium on Computer Architecture (ISCA’07). ACM, New York, NY, USA, 35--45. DOI:http://dx.doi.org/10.1145/1250662.1250668Google ScholarGoogle Scholar

Index Terms

  1. Approximate Communication: Techniques for Reducing Communication Bottlenecks in Large-Scale Parallel Systems

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in

          Full Access

          • Published in

            cover image ACM Computing Surveys
            ACM Computing Surveys  Volume 51, Issue 1
            January 2019
            743 pages
            ISSN:0360-0300
            EISSN:1557-7341
            DOI:10.1145/3177787
            • Editor:
            • Sartaj Sahni
            Issue’s Table of Contents

            Copyright © 2018 ACM

            Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 10 January 2018
            • Accepted: 1 September 2017
            • Revised: 1 July 2017
            • Received: 1 July 2016
            Published in csur Volume 51, Issue 1

            Permissions

            Request permissions about this article.

            Request Permissions

            Check for updates

            Qualifiers

            • survey
            • Research
            • Refereed

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader