Abstract
Approximate computing has gained research attention recently as a way to increase energy efficiency and/or performance by exploiting some applications’ intrinsic error resiliency. However, little attention has been given to its potential for tackling the communication bottleneck that remains one of the looming challenges to be tackled for efficient parallelism. This article explores the potential benefits of approximate computing for communication reduction by surveying three promising techniques for approximate communication: compression, relaxed synchronization, and value prediction. The techniques are compared based on an evaluation framework composed of communication cost reduction, performance, energy reduction, applicability, overheads, and output degradation. Comparison results demonstrate that lossy link compression and approximate value prediction show great promise for reducing the communication bottleneck in bandwidth-constrained applications. Meanwhile, relaxed synchronization is found to provide large speedups for select error-tolerant applications, but suffers from limited general applicability and unreliable output degradation guarantees. Finally, this article concludes with several suggestions for future research on approximate communication techniques.
- Tor M. Aamodt and Paul Chow. 2008. Compile-time and instruction-set methods for improving floating-to fixed-point conversion accuracy. ACM Transactions on Embedded Computing Systems 7, 3, 26.Google ScholarDigital Library
- Bülent Abali, Hubertus Franke, Dan E. Poff, Robert A. Saccone, Jr., Charles O. Schulz, Lorraine M. Herger, and T. Basil Smith. 2001. Memory expansion technology (MXT): software support and performance. IBM Journal of Research and Development 45, 2, 287--301.Google ScholarDigital Library
- Don Adams. 1993. CRAY T3D System Architecture Overview Manual. Retrieved November 29, 2017 from ftp://ftp.cray.com/product-info/mpp/T3D_Architecture_Over/T3D.overview.html.Google Scholar
- Ismail Akturk, Karen Khatamifard, and Ulya R. Karpuzcu. 2015. On quantification of accuracy loss in approximate computing. In Workshop on Duplicating, Deconstructing and Debunking (WDDD’15). 15.Google Scholar
- Alaa R. Alameldeen and David A. Wood. 2004. Adaptive cache compression for high-performance processors. In Proceedings of the 31st Annual International Symposium on Computer Architecture. IEEE, 212--223.Google Scholar
- Alaa R. Alameldeen and David A. Wood. 2007. Interactions between compression and prefetching in chip multiprocessors. In IEEE 13th International Symposium on High Performance Computer Architecture (HPCA’07). IEEE, 228--239.Google Scholar
- George Almási, Philip Heidelberger, Charles J. Archer, Xavier Martorell, C. Chris Erway, José E. Moreira, B. Steinmacher-Burow, and Yili Zheng. 2005. Optimization of MPI collective communication on BlueGene/L systems. In Proceedings of the 19th Annual International Conference on Supercomputing (ICS’05). ACM, New York, NY, 253--262. DOI:http://dx.doi.org/10.1145/1088149.1088183Google ScholarDigital Library
- Carlos Alvarez, Jesus Corbal, and Mateo Valero. 2005. Fuzzy memoization for floating-point multimedia applications. IEEE Transactions on Computers 54, 7, 922--927.Google ScholarDigital Library
- Gene M. Amdahl. 1967. Validity of the single processor approach to achieving large scale computing capabilities. In Proceedings of the April 18-20, 1967, Spring Joint Computer Conference. ACM, 483--485.Google ScholarDigital Library
- Baik Song An, Manhee Lee, Ki Hwan Yum, and Eun Jung Kim. 2012. Efficient data packet compression for cache coherent multiprocessor systems. In Data Compression Conference (DCC’12). IEEE, 129--138.Google ScholarDigital Library
- Mohammad Ashraful Anam, Paul N. Whatmough, and Yiannis Andreopoulos. 2013. Precision-energy-throughput scaling of generic matrix multiplication and discrete convolution kernels via linear projections. In IEEE 11th Symposium on Embedded Systems for Real-time Multimedia (ESTIMedia’13). IEEE, 21--30.Google ScholarCross Ref
- Jason Ansel, Cy Chan, Yee Lok Wong, Marek Olszewski, Qin Zhao, Alan Edelman, and Saman Amarasinghe. 2009. PetaBricks: A Language and Compiler for Algorithmic Choice. Vol. 44. ACM.Google Scholar
- Jason Ansel, Yee Lok Wong, Cy Chan, Marek Olszewski, Alan Edelman, and Saman Amarasinghe. 2011. Language and compiler support for auto-tuning variable-accuracy algorithms. In Proceedings of the 9th Annual IEEE/ACM International Symposium on Code Generation and Optimization. IEEE Computer Society, 85--96.Google ScholarDigital Library
- Woongki Baek and Trishul M. Chilimbi. 2010. Green: A framework for supporting energy-conscious programming using controlled approximation. In ACM SIGPLAN Notices, Vol. 45. ACM, 198--209.Google Scholar
- Arnab Banerjee, Pascal T. Wolkotte, Robert D. Mullins, Simon W. Moore, and Gerard J. M. Smit. 2009. An energy and performance exploration of network-on-chip architectures. IEEE Transactions on Very Large Scale Integration (VLSI) Systems 17, 3, 319--329.Google ScholarDigital Library
- Carl J. Beckmann and Constantine D. Polychronopoulos. 1990. Fast barrier synchronization hardware. In Proceedings of the 1990 ACM/IEEE Conference on Supercomputing (Supercomputing’90). IEEE Computer Society, Washington, DC, USA, 180--189.Google Scholar
- K. Bergman and others. 2008. Exascale computing study: Technology challenges in achieving exascale systems. Defense Advanced Research Projects Agency Information Processing Techniques Office (DARPA IPTO), Tech. Rep 15 (2008).Google Scholar
- Tekin Bicer, Jian Yin, Dereck Chiu, Gagan Agrawal, and Karen Schuchardt. 2013. Integrating online compression to accelerate large-scale data analytics applications. In IEEE 27th International Symposium on Parallel 8 Distributed Processing (IPDPS’13). IEEE, 1205--1216.Google ScholarDigital Library
- Mark Buckler, Wayne Burleson, and Greg Sadowski. 2013. Low-power networks-on-chip: Progress and remaining challenges. In 2013 IEEE International Symposium on Low Power Electronics and Design (ISLPED’13). IEEE, 132--134.Google ScholarCross Ref
- Huy Bui, Hal Finkel, Venkatram Vishwanath, Salman Habib, Katrin Heitmann, Jason Leigh, Michael Papka, and Kevin Harms. 2014. Scalable parallel I/O on a blue gene/Q supercomputer using compression, topology-aware data aggregation, and subfiling. In 22nd Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP’14). IEEE, 107--111.Google ScholarDigital Library
- Surendra Byna, Jiayuan Meng, Anand Raghunathan, Srimat Chakradhar, and Srihari Cadambi. 2010. Best-effort semantic document search on GPUs. In Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units. ACM, 86--93.Google ScholarDigital Library
- Brad Calder, Glenn Reinman, and Dean M. Tullsen. 1999. Selective value prediction. In Proceedings of the 26th International Symposium on Computer Architecture. IEEE, 64--74.Google Scholar
- Ramon Canal, Antonio González, and James E. Smith. 2000. Very low power pipelines using significance compression. In Proceedings of the 33rd Annual ACM/IEEE International Symposium on Microarchitecture. ACM, 181--190.Google Scholar
- Vito Cappellini. 1985. Data Compression and Error Control Techniques with Applications. Academic Press, Inc., Cambridge, MA.Google Scholar
- Michael Carbin, Sasa Misailovic, and Martin C. Rinard. 2013. Verifying quantitative reliability for programs that execute on unreliable hardware. In ACM SIGPLAN Notices, Vol. 48. ACM, 33--52.Google Scholar
- Srimat T. Chakradhar and Anand Raghunathan. 2010. Best-effort computing: Re-thinking parallel software and hardware. In 47th ACM/IEEE Design Automation Conference (DAC’10). IEEE, 865--870.Google Scholar
- Jie Chen and W. Watson. 2008. Software barrier performance on dual quad-core opterons. International Conference on Networking, Architecture, and Storage, 2008 (NAS’08). 303--309.Google ScholarDigital Library
- Yen-Kuang Chen, Jatin Chhugani, Pradeep Dubey, Christopher J. Hughes, Daehyun Kim, Sanjeev Kumar, Victor W. Lee, Anthony D. Nguyen, and Mikhail Smelyanskiy. 2008. Convergence of recognition, mining, and synthesis workloads and its implications. Proceedings of IEEE 96, 5, 790--807.Google ScholarCross Ref
- Vinay K. Chippa, Hrishikesh Jayakumar, Debabrata Mohapatra, Kaushik Roy, and Anand Raghunathan. 2013. Energy-efficient recognition and mining processor using scalable effort design. In IEEE Custom Integrated Circuits Conference (CICC’13). IEEE, 1--4.Google ScholarCross Ref
- Vinay Kumar Chippa, Debabrata Mohapatra, Kaushik Roy, Srimat T. Chakradhar, and Anand Raghunathan. 2014. Scalable effort hardware design. IEEE Transactions on Very Large Scale Integration (VLSI) Systems 22, 9, 2004--2016.Google ScholarCross Ref
- Vinay K. Chippa, Swagath Venkataramani, Srimat T. Chakradhar, Kaushik Roy, and Anand Raghunathan. 2013. Approximate computing: An integrated hardware approach. In Asilomar Conference on Signals, Systems and Computers. IEEE, 111--117.Google ScholarCross Ref
- Vinay K. Chippa, Swagath Venkataramani, Kaushik Roy, and Anand Raghunathan. 2014. StoRM: A stochastic recognition and mining processor. In Proceedings of the 2014 International Symposium on Low Power Electronics and Design. ACM, 39--44.Google ScholarDigital Library
- Kyungsang Cho, Yongjun Lee, Young H. Oh, Gyoo-cheol Hwang, and Jae W. Lee. 2014. eDRAM-based tiered-reliability memory with applications to low-power frame buffers. In IEEE/ACM International Symposium on Low Power Electronics and Design (ISLPED’14). IEEE, 333--338.Google Scholar
- Marcelo Cintra and Josep Torrellas. 2002. Eliminating squashes through learning cross-thread violations in speculative parallelization for multiprocessors. In Proceedings of the 8th International Symposium on High-Performance Computer Architecture. IEEE, 43--54.Google ScholarDigital Library
- R. J. Cintra. 2011. An integer approximation method for discrete sinusoidal transforms. Circuits, Systems, and Signal Processing 30, 6, 1481--1501.Google ScholarDigital Library
- Renato J. Cintra and Fábio M. Bayer. 2011. A DCT approximation for image compression. IEEE Signal Processing Letters 18, 10, 579--582.Google ScholarCross Ref
- Daniel Citron and Larry Rudolph. 1995. Creating a wider bus using caching techniques. In Proceedings of the 1st IEEE Symposium on High-Performance Computer Architecture. IEEE, 90--99.Google ScholarDigital Library
- Paul Coteus, H. Randall Bickford, Thomas M. Cipolla, Paul Crumley, Alan Gara, Shawn Hall, Gerard V. Kopcsay, Alphonso P. Lanzetta, Lawrence S. Mok, Rick A. Rand, Richard A. Swetz, Todd Takken, Paul La Rocca, Christopher Marroquin, Philip R. Germann, and Mark J. Jeanson. 2005. Packaging the blue gene/L supercomputer. IBM Journal of Research and Development 49, 2--3, 213--248.Google ScholarDigital Library
- David E. Culler, Jaswinder Pal Singh, and Anoop Gupta. 1999. Parallel Computer Architecture: A Hardware/Software Approach. Gulf Professional Publishing, Houston, TX.Google Scholar
- William J. Dally and Brian Towles. 2001. Route packets, not wires: On-chip interconnection networks. In Proceedings of the Design Automation Conference. IEEE, 684--689.Google Scholar
- Reetuparna Das, Asit K. Mishra, Chrysostomos Nicopoulos, Dongkook Park, Vijaykrishnan Narayanan, Ravishankar Iyer, Mazin S. Yousif, and Chita R. Das. 2008. Performance and power optimization through data compression in network-on-chip architectures. In IEEE 14th International Symposium on High Performance Computer Architecture (HPCA’08). IEEE, 215--225.Google Scholar
- Marc De Kruijf, Shuou Nomura, and Karthikeyan Sankaralingam. 2010. Relax: An architectural framework for software recovery of hardware faults. In ACM SIGARCH Computer Architecture News, Vol. 38. ACM, 497--508.Google ScholarDigital Library
- Li Deng and Douglas O’Shaughnessy. 2003. Speech Processing: A Dynamic and Optimization-Oriented Approach. CRC Press, Boca Raton, FL.Google ScholarCross Ref
- J. Dongarra, P. Luszczek, and A. Petitet. 2003. The LINPACK benchmark: Past, present, and future. Concurrency and Computation: Practice and Experience 15, 9, 803--820.Google ScholarCross Ref
- Zidong Du, Avinash Lingamneni, Yunji Chen, Krishna Palem, Olivier Temam, and Chengyong Wu. 2014. Leveraging the error resilience of machine-learning applications for designing highly energy efficient accelerators. In 19th Asia and South Pacific Design Automation Conference (ASP-DAC’14). IEEE, 201--206.Google Scholar
- Peter Düben, Jeremy Schlachter, Sreelatha Yenugula, John Augustine, Christian Enz, K. Palem, T. N. Palmer, and others. 2015. Opportunities for energy efficient computing: A study of inexact general purpose processors for high-performance and big-data applications. In Proceedings of the 2015 Design, Automation 8 Test in Europe Conference 8 Exhibition. EDA Consortium, 764--769.Google ScholarCross Ref
- Pradeep Dubey. 2005. Recognition, mining and synthesis moves computers to the era of Tera. Technology@ Intel Magazine 9, 2, 1--10.Google Scholar
- Peter Elias. 1955. Predictive coding--I. IRE Transactions on Information Theory 1, 1, 16--24.Google ScholarCross Ref
- Hadi Esmaeilzadeh, Emily Blem, Renee St. Amant, Karthikeyan Sankaralingam, and Doug Burger. 2011. Dark silicon and the end of multicore scaling. In International Symposium on Computer Architecture.Google ScholarDigital Library
- Hadi Esmaeilzadeh, Adrian Sampson, Luis Ceze, and Doug Burger. 2012. Architecture support for disciplined approximate programming. In Proceedings of the 17th International Conference on Architectural Support for Programming Languages and Operating Systems.Google ScholarDigital Library
- Hadi Esmaeilzadeh, Adrian Sampson, Luis Ceze, and Doug Burger. 2012. Neural acceleration for general-purpose approximate programs. In Proceedings of the 45th Annual IEEE/ACM International Symposium on Microarchitecture. IEEE Computer Society, 449--460.Google ScholarDigital Library
- Marius Evers, Po-Yung Chang, and Yale N. Patt. 1996. Using hybrid branch predictors to improve branch prediction accuracy in the presence of context switches. In ACM SIGARCH Computer Architecture News, Vol. 24. ACM, 3--11.Google Scholar
- Yuntan Fang, Huawei Li, and Xiaowei Li. 2012. SoftPCM: Enhancing energy efficiency and lifetime of phase change memory in video applications via approximate write. In IEEE 21st Asian Test Symposium (ATS’12). IEEE, 131--136.Google ScholarDigital Library
- Eric Freudenthal and Olivier Peze. 1988. Efficient Synchronization Algorithms Using Fetch-and-Add on Multiple Bitfield Integers. Ultracomputer Note 148.Google Scholar
- Shrikanth Ganapathy, Georgios Karakonstantis, Adam Teman, and Andreas Burg. 2015. Mitigating the impact of faults in unreliable memories for error-resilient applications. In Proceedings of the 52nd Annual Design Automation Conference. ACM, 102.Google ScholarDigital Library
- Bart Goeman, Hans Vandierendonck, and Koen De Bosschere. 2001. Differential FCM: Increasing value prediction accuracy by improving table usage efficiency. In 7th International Symposium on High-Performance Computer Architecture (HPCA’01). IEEE, 207--216.Google ScholarCross Ref
- Inigo Goiri, Ricardo Bianchini, Santosh Nagarakatte, and Thu D. Nguyen. 2015. Approxhadoop: Bringing approximations to mapreduce frameworks. In ACM SIGARCH Computer Architecture News, Vol. 43. ACM, 383--397.Google Scholar
- Jill R. Goldschneider. 1997. Lossy Compression of Scientific Data Via Wavelets and Vector Quantization. Ph.D. thesis, University of Washington, Seattle, WA. https://digital.lib.washington.edu/researchworks/handle/1773/5881?show=full.Google Scholar
- Beayna Grigorian and Glenn Reinman. 2015. Accelerating divergent applications on SIMD architectures using neural networks. ACM Transactions on Architecture and Code Optimization 12, 1, 2.Google ScholarDigital Library
- Vaibhav Gupta, Debabrata Mohapatra, Sang Phill Park, Anand Raghunathan, and Kaushik Roy. 2011. IMPACT: Imprecise adders for low-power approximate computing. In Proceedings of the 17th IEEE/ACM International Symposium on Low-power Electronics and Design. IEEE Press, 409--414.Google ScholarDigital Library
- Erik G. Hallnor and Steven K. Reinhardt. 2004. A compressed memory hierarchy using an indirect index cache. In Proceedings of the 3rd Workshop on Memory Performance Issues: In Conjunction with the 31st International Symposium on Computer Architecture. ACM, 9--15.Google Scholar
- Maurice Herlihy, J. Eliot, and B. Moss. 1993. Transactional Memory: Architectural Support for Lock-Free Data Structures. Vol. 21. ACM.Google ScholarDigital Library
- T. Hoefler, T. Mehlan, F. Mietke, and W. Rehm. 2004. A survey of barrier algorithms for coarse grained supercomputers. Chemnitzer Informatik Berichte 4, 3 (2004).Google Scholar
- Henry Hoffmann, Sasa Misailovic, Stelios Sidiroglou, Anant Agarwal, and Martin Rinard. 2009. Using code perforation to improve performance, reduce energy consumption, and respond to failures. Technical Report MIT-CSAIL-TR-2209-037, EECS, MIT.Google Scholar
- Henry Hoffmann, Stelios Sidiroglou, Michael Carbin, Sasa Misailovic, Anant Agarwal, and Martin Rinard. 2011. Dynamic knobs for responsive power-aware computing. In ACM SIGPLAN Notices, Vol. 46. ACM, 199--212.Google ScholarDigital Library
- Chih-Chieh Hsiao, Slo-Li Chu, and Chen-Yu Chen. 2013. Energy-aware hybrid precision selection framework for mobile GPUs. Computers 8 Graphics 37, 5, 431--444.Google Scholar
- Jiawei Huang, John Lach, and Gabriel Robins. 2012. A methodology for energy-quality tradeoff using imprecise hardware. In Proceedings of the 49th Annual Design Automation Conference. ACM, 504--509.Google ScholarDigital Library
- Jeremy Iverson, Chandrika Kamath, and George Karypis. 2012. Fast and effective lossy compression algorithms for scientific datasets. In Euro-Par 2012 Parallel Processing. Springer, 843--856.Google Scholar
- Yuho Jin, Ki Hwan Yum, and Eun Jung Kim. 2008. Adaptive data compression for high-performance low-power on-chip networks. In Proceedings of the 41st Annual IEEE/ACM International Symposium on Microarchitecture. IEEE Computer Society, 354--363.Google Scholar
- Andrew B. Kahng and Seokhyeong Kang. 2012. Accuracy-configurable adder for approximate arithmetic designs. In Proceedings of the 49th Annual Design Automation Conference. ACM, 820--825.Google Scholar
- Georgios Karakonstantis, Debabrata Mohapatra, and Kaushik Roy. 2012. Logic and memory design based on unequal error protection for voltage-scalable, robust and adaptive DSP systems. Journal of Signal Processing Systems 68, 3, 415--431.Google ScholarDigital Library
- Georgios Keramidas, Chrysa Kokkala, and Iakovos Stamoulis. 2015. Clumsy value cache: An approximate memoization technique for mobile GPU fragment shaders. In Workshop on Approximate Computing (WAPCO’15).Google Scholar
- Daya Shanker Khudia and Scott Mahlke. 2014. Harnessing soft computations for low-budget fault tolerance. In 47th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’14). IEEE, 319--330.Google ScholarDigital Library
- Hyungjun Kim, Pritha Ghoshal, Boris Grot, Paul V. Gratz, and Daniel A. Jiménez. 2011. Reducing network-on-chip energy consumption through spatial locality speculation. In Proceedings of the 5th ACM/IEEE International Symposium on Networks-on-Chip. ACM, 233--240.Google Scholar
- Chandra Krintz and Sezgin Sucu. 2006. Adaptive on-the-fly compression. IEEE Transactions on Parallel and Distributed Systems 17, 1, 15--24.Google ScholarDigital Library
- Parag Kulkarni, Puneet Gupta, and Milos Ercegovac. 2011. Trading accuracy for power with an underdesigned multiplier architecture. In 24th International Conference on VLSI Design (VLSI Design’11). IEEE, 346--351.Google ScholarDigital Library
- Didier Le Gall. 1991. MPEG: A video compression standard for multimedia applications. Communications of the ACM 34, 4, 46--58.Google ScholarDigital Library
- Jae Bum Lee and Chu Shik Jhon. 1998. Reducing coherence overhead of barrier synchronization in software DSMs. In Supercomputing’98: Proceedings of the 1998 ACM/IEEE Conference on Supercomputing (CDROM). IEEE Computer Society, Washington, DC, USA, 1--18.Google Scholar
- Jang-Soo Lee, Won-Kee Hong, and Shin-Dug Kim. 1999. Design and evaluation of a selective compressed memory system. In International Conference on Computer Design (ICCD’99). IEEE, 184--191.Google Scholar
- Kangmin Lee, Se-Joong Lee, and Hoi-Jun Yoo. 2006. Low-power network-on-chip for high-performance SoC design. IEEE Transactions on Very Large Scale Integration (VLSI) Systems 14, 2, 148--160.Google ScholarDigital Library
- Moon-Sang Lee, Young-Jae Kang, Joon-Won Lee, and Seung-Ryoul Maeng. 2002. OPTS: Increasing branch prediction accuracy under context switch. Microprocessors and Microsystems 26, 6, 291--300.Google ScholarCross Ref
- Sungju Lee, Heegon Kim, Yongwha Chung, and Daihee Park. 2012. Energy efficient image/video data transmission on commercial multi-core processors. Sensors 12, 11, 14647--14670.Google ScholarCross Ref
- Sangpil Lee, Keunsoo Kim, Gunjae Koo, Hyeran Jeon, Won Woo Ro, and Murali Annavaram. 2015. Warped-compression: Enabling power efficient GPUs through register compression. In Proceedings of the 42nd Annual International Symposium on Computer Architecture. ACM, 502--514.Google ScholarDigital Library
- Larkhoon Leem, Hyungmin Cho, Jason Bau, Quinn A. Jacobson, and Subhasish Mitra. 2010. ERSA: Error resilient system architecture for probabilistic applications. In Design, Automation 8 Test in Europe Conference 8 Exhibition (DATE’10). IEEE, 1560--1565.Google Scholar
- Debra A. Lelewer and Daniel S. Hirschberg. 1987. Data compression. ACM Computing Surveys 19, 3, 261--296.Google ScholarDigital Library
- Krisda Lengwehasatit and Antonio Ortega. 2004. Scalable variable complexity approximate forward DCT. IEEE Transactions on Circuits and Systems for Video Technology 14, 11, 1236--1248.Google ScholarDigital Library
- Mikko H. Lipasti and John Paul Shen. 1996. Exceeding the dataflow limit via value prediction. In Proceedings of the 29th Annual ACM/IEEE International Symposium on Microarchitecture. IEEE Computer Society, 226--237.Google Scholar
- Mikko H. Lipasti, Christopher B. Wilkerson, and John Paul Shen. 1996. Value locality and load value prediction. ACM SIGOPS Operating Systems Review 30, 5, 138--147.Google ScholarDigital Library
- Shaoshan Liu, Christine Eisenbeis, and Jean-Luc Gaudiot. 2010. A theoretical framework for value prediction in parallel systems. In 39th International Conference on Parallel Processing (ICPP’10). IEEE, 11--20.Google ScholarDigital Library
- Song Liu, Karthik Pattabiraman, Thomas Moscibroda, and Benjamin G. Zorn. 2009. Flicker: Saving refresh-power in mobile devices through critical data partitioning. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’09).Google Scholar
- Gabriel H. Loh, Nuwan Jayasena, M. Oskin, Mark Nutter, David Roberts, Mitesh Meswani, Dong Ping Zhang, and Mike Ignatowski. 2013. A processing in memory taxonomy and a case for studying fixed-function pim. In Workshop on Near-Data Processing (WoNDP’13).Google Scholar
- Enrico Magli and Gabriella Olmo. 2003. Lossy predictive coding of SAR raw data. IEEE Transactions on Geoscience and Remote Sensing 41, 5, 977--987.Google ScholarCross Ref
- Milo M. K. Martin, Daniel J. Sorin, Harold W. Cain, Mark D. Hill, and Mikko H. Lipasti. 2001. Correctly implementing value prediction in microprocessors that support multithreading or multiprocessing. In Proceedings of the 34th Annual ACM/IEEE International Symposium on Microarchitecture. IEEE Computer Society, 328--337.Google Scholar
- John M. Mellor-Crummey and Michael L. Scott. 1991. Algorithms for scalable synchronization on shared-memory multiprocessors. ACM Transactions on Computer Systems 9, 1, 21--65. DOI:http://dx.doi.org/10.1145/103727.103729Google ScholarDigital Library
- Jiayuan Meng, Srimat Chakradhar, and Anand Raghunathan. 2009. Best-effort parallel execution framework for recognition and mining applications. In IEEE International Symposium on Parallel 8 Distributed Processing (IPDPS’09). IEEE, 1--12.Google Scholar
- Jiayuan Mengte, Anand Raghunathan, Srimat Chakradhar, and Surendra Byna. 2010. Exploiting the forgiving nature of applications for scalable parallel execution. IEEE International Symposium on Parallel 8 Distributed Processing (IPDPS). IEEE.Google ScholarCross Ref
- Joshua San Miguel, Mario Badr, and Natalie Enright Jerger. 2014. Load value approximation. In Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture. IEEE Computer Society, 127--139.Google ScholarDigital Library
- Sasa Misailovic, Michael Carbin, Sara Achour, Zichao Qi, and Martin C. Rinard. 2014. Chisel: Reliability-and accuracy-aware optimization of approximate computational kernels. In ACM SIGPLAN Notices, Vol. 49. ACM, 309--328.Google Scholar
- Sasa Misailovic, Deokhwan Kim, and Martin Rinard. 2013. Parallelizing sequential programs with statistical accuracy tests. ACM Transactions on Embedded Computing Systems 12, 2s, 88.Google ScholarDigital Library
- Sasa Misailovic, Stelios Sidiroglou, Henry Hoffmann, and Martin Rinard. 2010. Quality of service profiling. In Proceedings of the 32nd ACM/IEEE International Conference on Software Engineering-Volume 1. ACM, 25--34.Google ScholarDigital Library
- Sasa Misailovic, Stelios Sidiroglou, and Martin C. Rinard. 2012. Dancing with uncertainty. In Proceedings of the 2012 ACM Workshop on Relaxing Synchronization for Multicore and Manycore Scalability. ACM, 51--60.Google Scholar
- Sparsh Mittal. 2016. A survey of techniques for approximate computing. ACM Computing Surveys 48, 4, 62.Google ScholarDigital Library
- Debabrata Mohapatra, Vinay K. Chippa, Anand Raghunathan, and Kaushik Roy. 2011. Design of voltage-scalable meta-functions for approximate computing. In Design, Automation 8 Test in Europe Conference 8 Exhibition (DATE’11). IEEE, 1--6.Google Scholar
- Debabrata Mohapatra, Georgios Karakonstantis, and Kaushik Roy. 2009. Significance driven computation: A voltage-scalable, variation-aware, quality-tuning motion estimator. In Proceedings of the 2009 ACM/IEEE International Symposium on Low Power Electronics and Design. ACM, 195--200.Google ScholarDigital Library
- Thierry Moreau, Mark Wyse, Jacob Nelson, Adrian Sampson, Hadi Esmaeilzadeh, Luis Ceze, and Mark Oskin. 2015. SNNAP: Approximate computing on programmable socs via neural acceleration. In IEEE 21st International Symposium on High Performance Computer Architecture (HPCA’15). IEEE, 603--614.Google ScholarCross Ref
- Michel Mouly, Marie-Bernadette Pautet, and Thomas Foreword By-Haug. 1992. The GSM System for Mobile Communications. Telecom Publishing.Google Scholar
- Tarun Nakra, Rajiv Gupta, and Mary Lou Soffa. 1999. Global context-based value prediction. In Proceedings of the 5th International Symposium on High-Performance Computer Architecture. IEEE, 4--12.Google ScholarCross Ref
- Sriram Narayanan, John Sartori, Rakesh Kumar, and Douglas L. Jones. 2010. Scalable stochastic processors. In Proceedings of the Conference on Design, Automation and Test in Europe. European Design and Automation Association, 335--338.Google Scholar
- D. Nikolopoulos and T. Papatheodorou. 2000. Fast synchronization on scalable cache-coherent multiprocessors using hybrid primitives. In Proceedings of the 14th International Symposium on Parallel and Distributed Processing (IPDPS’00). IEEE Computer Society, Washington, DC, USA, 711.Google Scholar
- Peter Noll. 1997. MPEG digital audio coding. IEEE Signal Processing Magazine 14, 5, 59--81.Google ScholarCross Ref
- NVIDIA. 2014. NVIDIA GTX 980 Whitepaper. Retrieved November 29, 2017 from https://international.download.nvidia.com/geforce-com/international/pdfs/GeForce_GTX_980_Whitepaper_FINAL.PDF.Google Scholar
- Simon Ogg and Bashir Al-Hashimi. 2006. Improved data compression for serial interconnected network on chip through unused significant bit removal. In 19th International Conference on VLSI Design. Held jointly with 5th International Conference on Embedded Systems and Design. IEEE, 5 pp.Google ScholarDigital Library
- Soontorn Oraintara, Ying-Jui Chen, and Truong Q. Nguyen. 2002. Integer fast Fourier transform. IEEE Transactions on Signal Processing 50, 3, 607--618.Google ScholarDigital Library
- David J. Palframan, Nam Sung Kim, and Mikko H. Lipasti. 2014. Precision-aware soft error protection for GPUs. In IEEE 20th International Symposium on High Performance Computer Architecture (HPCA’14). IEEE, 49--59.Google Scholar
- J. Thomas Pawlowski. 2011. Hybrid memory cube (HMC). In Hot Chips, Vol. 23.Google ScholarCross Ref
- Gennady Pekhimenko, Evgeny Bolotin, Mike O’Connor, Onur Mutlu, Todd C. Mowry, and Steve Keckler. 2015. Toggle-aware compression for GPUs. In IEEE Computer Architecture Letters. 14, 2 (2015), 164--168. DOI:10.1109/LCA.2015.2430853Google ScholarDigital Library
- Gennady Pekhimenko, Evgeny Bolotin, Nandita Vijaykumar, Onur Mutlu, Todd C. Mowry, and Stephen W. Keckler. 2016. A case for toggle-aware compression for GPU systems. In IEEE International Symposium on High Performance Computer Architecture (HPCA’16). IEEE, 188--200.Google Scholar
- Arthur Perais and André Seznec. 2014. EOLE: Paving the way for an effective implementation of value prediction. In ACM SIGARCH Computer Architecture News, Vol. 42. IEEE Press, 481--492.Google ScholarDigital Library
- Arthur Perais and André Seznec. 2014. Practical data value speculation for future high-end processors. In IEEE 20th International Symposium on High Performance Computer Architecture (HPCA’14). IEEE, 428--439.Google ScholarCross Ref
- Arthur Perais and André Seznec. 2015. BeBoP: A cost effective predictor infrastructure for superscalar value prediction. In IEEE 21st International Symposium on High Performance Computer Architecture (HPCA’15). IEEE, 13--25.Google ScholarCross Ref
- Calton Pu and Lenin Singaravelu. 2005. Fine-grain adaptive compression in dynamically variable networks. In Proceedings of the 25th IEEE International Conference on Distributed Computing Systems (ICDCS’05). IEEE, 685--694.Google ScholarDigital Library
- Abbas Rahimi, Amirali Ghofrani, Kwang-Ting Cheng, Luca Benini, and Rajesh K. Gupta. 2015. Approximate associative memristive memory for energy-efficient GPUs. In Proceedings of the 2015 Design, Automation 8 Test in Europe Conference 8 Exhibition. EDA Consortium, 1497--1502.Google Scholar
- Vara Ramakrishnan and Isaac D. Scherson. 1999. Efficient techniques for nested and disjoint barrier synchronization. Journal of Parallel and Distributed Computing 58, 2, 333--356. DOI:http://dx.doi.org/10.1006/jpdc.1999.1556Google ScholarDigital Library
- Easwaran Raman, Ram Rangan, David I. August, and others. 2008. Spice: Speculative parallel iteration chunk execution. In Proceedings of the 6th Annual IEEE/ACM International Symposium on Code Generation and Optimization. ACM, 175--184.Google ScholarDigital Library
- Ashish Ranjan, Swagath Venkataramani, Xuanyao Fong, Kaushik Roy, and Anand Raghunathan. 2015. Approximate storage for energy efficient spintronic memories. In Proceedings of the 52nd Annual Design Automation Conference. ACM, 195.Google ScholarDigital Library
- Lakshminarayanan Renganarayana, Vijayalakshmi Srinivasan, Ravi Nair, and Daniel Prener. 2012. Programming with relaxed synchronization. In Proceedings of the 2012 ACM Workshop on Relaxing Synchronization for Multicore and Manycore Scalability. ACM, 41--50.Google ScholarDigital Library
- Martin Rinard. 2013. Parallel synchronization-free approximate data structure construction. In HotPar.Google Scholar
- Martin C. Rinard. 2012. Unsynchronized techniques for approximate parallel computing. In RACES Workshop.Google Scholar
- Antonio Roldao-Lopes, Amir Shahzad, George A. Constantinides, and Eric C. Kerrigan. 2009. More flops or more precision? Accuracy parameterizable linear equation solvers for model predictive control. In 17th IEEE Symposium on Field Programmable Custom Computing Machines (FCCM’09). IEEE, 209--216.Google Scholar
- Cindy Rubio-González, Cuong Nguyen, Hong Diep Nguyen, James Demmel, William Kahan, Koushik Sen, David H. Bailey, Costin Iancu, and David Hough. 2013. Precimonious: Tuning assistant for floating-point precision. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis. ACM, 27.Google ScholarDigital Library
- David Salomon. 2004. Data Compression: The Complete Reference. Springer Science 8 Business Media, New York, NY.Google ScholarDigital Library
- Mehrzad Samadi, Davoud Anoushe Jamshidi, Janghaeng Lee, and Scott Mahlke. 2014. Paraprox: Pattern-based approximation for data parallel applications. In ACM SIGARCH Computer Architecture News, Vol. 42. ACM, 35--50.Google Scholar
- Mehrzad Samadi, Janghaeng Lee, D. Anoushe Jamshidi, Amir Hormati, and Scott Mahlke. 2013. Sage: Self-tuning approximation for graphics engines. In Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture. ACM, 13--24.Google ScholarDigital Library
- Adrian Sampson, Werner Dietl, Emily Fortuna, Danushen Gnanapragasam, Luis Ceze, and Dan Grossman. 2011. EnerJ: Approximate data types for safe and general low-power computation. In ACM SIGPLAN Notices, Vol. 46. ACM, 164--174.Google ScholarDigital Library
- Adrian Sampson, Jacob Nelson, Karin Strauss, and Luis Ceze. 2014. Approximate storage in solid-state memories. ACM Transactions on Computer Systems 32, 3, 9.Google ScholarDigital Library
- Jack Sampson, Ruben Gonzalez, Jean-Francois Collard, Norman P. Jouppi, Mike Schlansker, and Brad Calder. 2006. Exploiting fine-grained data parallelism with chip multiprocessors and fast barriers. In MICRO 39: Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture. IEEE Computer Society, Washington, DC, USA, 235--246. DOI:http://dx.doi.org/10.1109/MICRO.2006.23Google ScholarDigital Library
- Joshua San Miguel and N. Enright Jerger. 2014. Load value approximation: Approaching the ideal memory access latency. In Workshop on Approximate Computing Across the System Stack.Google Scholar
- J. Sartori and R. Kumar. 2010. Low-overhead, high-speed multi-core barrier synchronization. In HiPEAC. 18--34.Google Scholar
- John Sartori and Rakesh Kumar. 2013. Branch and data herding: Reducing control and memory divergence for error-tolerant GPU applications. IEEE Transactions on Multimedia 15, 2, 279--290.Google ScholarDigital Library
- Vijay Sathish, Michael J. Schulte, and Nam Sung Kim. 2012. Lossless and lossy memory I/O link compression for improving performance of GPGPU workloads. In Proceedings of the 21st International Conference on Parallel Architectures and Compilation Techniques. ACM, 325--334.Google ScholarDigital Library
- Yiannakis Sazeides and James E. Smith. 1997. The predictability of data values. In Proceedings of the 30th Annual IEEE/ACM International Symposium on Microarchitecture. IEEE, 248--258.Google Scholar
- Steven L. Scott. 1996. Synchronization and communication in the T3E multiprocessor. SIGOPS Operating Systems Review 30, 5, 26--36. DOI:http://dx.doi.org/10.1145/248208.237144Google ScholarDigital Library
- Marko Scrbak, Mahzabeen Islam, Krishna M. Kavi, Mike Ignatowski, and Nuwan Jayasena. 2015. Processing-in-memory: Exploring the design space. In Architecture of Computing Systems (ARCS’15). Springer, 43--54.Google Scholar
- André Seznec. 2011. A new case for the TAGE branch predictor. In Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture. ACM, 117--127.Google ScholarDigital Library
- Ali Shafiee, Meysam Taassori, Rajeev Balasubramonian, and A. K. Davis. 2014. MemZip: Exploring unconventional benefits from memory compression. In IEEE 20th International Symposium on High Performance Computer Architecture (HPCA’14). IEEE, 638--649.Google Scholar
- Li Shang, Li-Shiuan Peh, and Niraj K. Jha. 2003. Dynamic voltage scaling with links for power optimization of interconnection networks. In Proceedings of the 9th International Symposium on High-Performance Computer Architecture (HPCA’03). IEEE, 91--102.Google Scholar
- Shisheng Shang and Kai Hwang. 1995. Distributed hardwired barrier synchronization for scalable multiprocessor clusters. IEEE Transactions on Parallel Distributed Systems 6, 6, 591--605. DOI:http://dx.doi.org/10.1109/71.388040Google ScholarDigital Library
- Majid Shoushtari, Abbas BanaiyanMofrad, and Nikil Dutt. 2015. Exploiting partially-forgetful memories for approximate computing. IEEE Embedded Systems Letters 7, 1, 19--22.Google ScholarDigital Library
- Stelios Sidiroglou-Douskos, Sasa Misailovic, Henry Hoffmann, and Martin Rinard. 2011. Managing performance vs. accuracy trade-offs with loop perforation. In Proceedings of the 19th ACM SIGSOFT Symposium and the 13th European Conference on Foundations of Software Engineering. ACM, 124--134.Google ScholarDigital Library
- María Soler and José Flich. 2013. Power saving by NoC traffic compression. In European Conference on Parallel Processing. Springer, 465--476.Google Scholar
- Renée St. Amant, Amir Yazdanbakhsh, Jongse Park, Bradley Thwaites, Hadi Esmaeilzadeh, Arjang Hassibi, Luis Ceze, and Doug Burger. 2014. General-purpose code acceleration with limited-precision analog computation. ACM SIGARCH Computer Architecture News 42, 3, 505--516.Google ScholarDigital Library
- J. Gregory Steffan, Christopher B. Colohan, Antonia Zhai, and Todd C. Mowry. 2002. Improving value communication for thread-level speculation. In Proceedings of the 8th International Symposium on High-Performance Computer Architecture. IEEE, 65--75.Google Scholar
- Ayswarya Sundaram, Ameen Aakel, Derek Lockhart, Darshan Thaker, and Diana Franklin. 2008. Efficient fault tolerance in multi-media applications through selective instruction replication. In Proceedings of the 2008 Workshop on Radiation Effects and Fault Tolerance in Nanometer Technologies. ACM, 339--346.Google ScholarDigital Library
- Mark Sutherland, Joshua San Miguel, and Natalie Enright Jerger. 2015. Texture cache approximation on GPUs. In Workshop on Approximate Computing Across the Stack.Google Scholar
- M. B. Taylor. 2012. Is dark silicon useful? Harnessing the four horsemen of the coming dark silicon apocalypse. In Design Automation Conference.Google Scholar
- Bradley Thwaites, Gennady Pekhimenko, Hadi Esmaeilzadeh, Amir Yazdanbakhsh, Onur Mutlu, Jongse Park, Girish Mururu, and Todd Mowry. 2014. Rollback-free value prediction with approximate loads. In Proceedings of the 23rd International Conference on Parallel Architectures and Compilation. ACM, 493--494.Google ScholarDigital Library
- Ye Tian, Qian Zhang, Ting Wang, Feng Yuan, and Qiang Xu. 2015. Approxma: Approximate memory access for dynamic precision scaling. In Proceedings of the 25th Edition on Great Lakes Symposium on VLSI. ACM, 337--342.Google ScholarDigital Library
- Vassilis Vassiliadis, Konstantinos Parasyris, Charalambos Chalios, Christos D. Antonopoulos, Spyros Lalis, Nikolaos Bellas, Hans Vandierendonck, and Dimitrios S. Nikolopoulos. 2015. A programming model and runtime system for significance-aware energy-efficient computing. In ACM SIGPLAN Notices, Vol. 50. ACM, 275--276.Google Scholar
- Swagath Venkataramani, Srimat T. Chakradhar, Kaushik Roy, and Anand Raghunathan. 2015. Approximate computing and the quest for computing efficiency. In Proceedings of the 52nd Annual Design Automation Conference. ACM, 120.Google ScholarDigital Library
- Swagath Venkataramani, Vinay K. Chippa, Srimat T. Chakradhar, Kaushik Roy, and Anand Raghunathan. 2013. Quality programmable vector processors for approximate computing. In Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture. ACM, 1--12.Google ScholarDigital Library
- Swagath Venkataramani, Ashish Ranjan, Kaushik Roy, and Anand Raghunathan. 2014. AxNN: Energy-efficient neuromorphic systems using approximate computing. In Proceedings of the 2014 International Symposium on Low Power Electronics and Design. ACM, 27--32.Google ScholarDigital Library
- Swagath Venkataramani, Kaushik Roy, and Anand Raghunathan. 2013. Substitute-and-simplify: A unified design paradigm for approximate and quality configurable circuits. In Proceedings of the Conference on Design, Automation and Test in Europe. EDA Consortium, 1367--1372.Google ScholarCross Ref
- Nandita Vijaykumar, Gennady Pekhimenko, Adwait Jog, Abhishek Bhowmick, Rachata Ausavarungnirun, Chita Das, Mahmut Kandemir, Todd C. Mowry, and Onur Mutlu. 2015. A case for core-assisted bottleneck acceleration in GPUs: enabling flexible data compression with assist warps. In ACM SIGARCH Computer Architecture News, Vol. 43. ACM, 41--53.Google ScholarDigital Library
- Oreste Villa, Gianluca Palermo, and Cristina Silvano. 2008. Efficiency and scalability of barrier synchronization on NoC based many-core architectures. In CASES’08: Proceedings of the 2008 International Conference on Compilers, Architectures and Synthesis for Embedded Systems. ACM, New York, NY, USA, 81--90. DOI:http://dx.doi.org/10.1145/1450095.1450110Google ScholarDigital Library
- Gregory K. Wallace. 1992. The JPEG still picture compression standard. IEEE Transactions on Consumer Electronics 38, 1, xviii--xxxiv.Google ScholarDigital Library
- Kai Wang and Manoj Franklin. 1997. Highly accurate data value prediction using hybrid predictors. In Proceedings of the 30th Annual ACM/IEEE International Symposium on Microarchitecture. IEEE Computer Society, 281--290.Google ScholarCross Ref
- Zhou Wang, Eero P. Simoncelli, and Alan C. Bovik. 2003. Multiscale structural similarity for image quality assessment. In Conference Record of the 37th Asilomar Conference on Signals, Systems and Computers, Vol. 2. IEEE, 1398--1402.Google Scholar
- Terry A. Welch. 1984. A technique for high-performance data compression. Computer 6, 17, 8--19.Google ScholarDigital Library
- Benjamin Welton, Dries Kimpe, Jason Cope, Christina M. Patrick, Kamil Iskra, and Robert Ross. 2011. Improving I/O forwarding throughput with data compression. In IEEE International Conference on Cluster Computing (CLUSTER’11). IEEE, 438--445.Google ScholarDigital Library
- Yair Wiseman, Karsten Schwan, and Patrick Widener. 2005. Efficient end to end data exchange using configurable compression. ACM SIGOPS Operating Systems Review 39, 3, 4--23.Google ScholarDigital Library
- Qiang Xu, Todd Mytkowicz, and Nam Sung Kim. 2016. Approximate computing: A survey. IEEE Design 8 Test 33, 1, 8--22.Google Scholar
- Xin Xu and H. Howie Huang. 2015. Exploring data-level error tolerance in high-performance solid-state drives. IEEE Transactions on Reliability 64, 1, 15--30.Google ScholarCross Ref
- Amir Yazdanbakhsh, Gennady Pekhimenko, Bradley Thwaites, Hadi Esmaeilzadeh, Onur Mutlu, and Todd C. Mowry. 2016. RFVP: Rollback-free value prediction with safe-to-approximate loads. ACM Transactions on Architecture and Code Optimization 12, 4, 62.Google ScholarDigital Library
- Rong Ye, Ting Wang, Feng Yuan, Rakesh Kumar, and Qiang Xu. 2013. On reconfiguration-oriented approximate adder design and its application. In Proceedings of the International Conference on Computer-Aided Design. IEEE Press, 48--54.Google ScholarCross Ref
- Yavuz Yetim, Sharad Malik, and Margaret Martonosi. 2015. CommGuard: Mitigating communication errors in error-prone parallel execution. In ACM SIGPLAN Notices, Vol. 50. ACM, 311--323.Google ScholarDigital Library
- Yavuz Yetim, Margaret Martonosi, and Sharad Malik. 2013. Extracting useful computation from error-prone processors for streaming applications. In Design, Automation 8 Test in Europe Conference 8 Exhibition (DATE’13). IEEE, 202--207.Google Scholar
- Felix Zahn, Steffen Lammel, and Holger Fröning. 2017. Early experiences with saving energy in direct interconnection networks. In IEEE 3rd International Workshop on High-Performance Interconnection Networks in the Exascale and Big-Data Era (HiPINEB’17). IEEE, 33--40.Google ScholarCross Ref
- Felix Zahn, Pedro Yebenes, Steffen Lammel, Pedro J. Garcia, and Holger Fröning. 2016. Analyzing the energy (dis-)proportionality of scalable interconnection networks. In 2nd IEEE International Workshop on High-Performance Interconnection Networks in the Exascale and Big-Data Era (HiPINEB’16). IEEE, 25--32.Google ScholarCross Ref
- Hang Zhang, Mateja Putic, and John Lach. 2014. Low power gpgpu computation with imprecise hardware. In Proceedings of the 51st Annual Design Automation Conference. ACM, 1--6.Google ScholarDigital Library
- Qian Zhang, Ting Wang, Ye Tian, Feng Yuan, and Qiang Xu. 2015. ApproxANN: An approximate computing framework for artificial neural network. In Proceedings of the 2015 Design, Automation 8 Test in Europe Conference 8 Exhibition. EDA Consortium, 701--706.Google ScholarDigital Library
- Huiyang Zhou and Thomas M. Conte. 2005. Enhancing memory-level parallelism via recovery-free value prediction. IEEE Transactions on Computers 54, 7, 897--912.Google ScholarDigital Library
- Qiuling Zhu, Bilal Akin, H. Ekin Sumbul, Fazle Sadi, James C. Hoe, Larry Pileggi, and Franz Franchetti. 2013. A 3D-stacked logic-in-memory accelerator for application-specific data intensive computing. In IEEE International 3D Systems Integration Conference (3DIC’13). IEEE, 1--7.Google Scholar
- Weirong Zhu, Vugranam C. Sreedhar, Ziang Hu, and Guang R. Gao. 2007. Synchronization state buffer: Supporting efficient fine-grain synchronization on many-core architectures. In Proceedings of the 34th Annual International Symposium on Computer Architecture (ISCA’07). ACM, New York, NY, USA, 35--45. DOI:http://dx.doi.org/10.1145/1250662.1250668Google Scholar
Index Terms
- Approximate Communication: Techniques for Reducing Communication Bottlenecks in Large-Scale Parallel Systems
Recommendations
An online quality management framework for approximate communication in network-on-chips
ICS '19: Proceedings of the ACM International Conference on SupercomputingApproximate communication is being seriously considered as an effective technique for reducing power consumption and improving the communication efficiency of network-on-chips (NoCs). A major problem faced by these techniques is quality control: how do ...
Energy efficient 3D network-on-chip based on approximate communication
AbstractTechnology advancement and integration of many cores into a chip lead to high-performance parallel architectures in computing systems. Three-dimensional Network-on-Chips (3D NoCs) have been adopted as a promising architecture in the ...
Distributed Traffic Simulation and the Reduction of Inter-process Communication Using Traffic Flow Characteristics Transfer
UKSIM '08: Proceedings of the Tenth International Conference on Computer Modeling and SimulationIn this paper, an adaptation of the Java Urban Traffic Simulator (JUTS) for distributed computing environment is described. Because one of the main bottlenecks of any distributed application is the communication among its particular processes, the ...
Comments