ABSTRACT
As user demand scales for intelligent personal assistants (IPAs) such as Apple's Siri, Google's Google Now, and Microsoft's Cortana, we are approaching the computational limits of current datacenter architectures. It is an open question how future server architectures should evolve to enable this emerging class of applications, and the lack of an open-source IPA workload is an obstacle in addressing this question. In this paper, we present the design of Sirius, an open end-to-end IPA web-service application that accepts queries in the form of voice and images, and responds with natural language. We then use this workload to investigate the implications of four points in the design space of future accelerator-based server architectures spanning traditional CPUs, GPUs, manycore throughput co-processors, and FPGAs.
To investigate future server designs for Sirius, we decompose Sirius into a suite of 7 benchmarks (Sirius Suite) comprising the computationally intensive bottlenecks of Sirius. We port Sirius Suite to a spectrum of accelerator platforms and use the performance and power trade-offs across these platforms to perform a total cost of ownership (TCO) analysis of various server design points. In our study, we find that accelerators are critical for the future scalability of IPA services. Our results show that GPU- and FPGA-accelerated servers improve the query latency on average by 10x and 16x. For a given throughput, GPU- and FPGA-accelerated servers can reduce the TCO of datacenters by 2.6x and 1.4x, respectively.
- Apple's Siri. https://www.apple.com/ios/siri/.Google Scholar
- Google's Google Now. http://www.google.com/landing/now/.Google Scholar
- Microsoft's Cortana. http://www.windowsphone.com/en-us/features-8--1.Google Scholar
- Smartphone OS Market Share, Q1 2014. http://www.idc.com/prodserv/smartphone-os-market-share.jsp.Google Scholar
- Google's Android Wear. www.android.com/wear/.Google Scholar
- Google's Google Glass. www.google.com/glass.Google Scholar
- ABI Research. Wearable Computing Devices, Like Apple iWatch, Will Exceed 485 Million Annual Shipments by 2018. 2013. https://www.abiresearch.com/press/wearable-computing-devices-like-apples-iwatch-will.Google Scholar
- Marti A. Hearst. 'Natural' Search User Interfaces. Commun. ACM, 54(11):60--67, November 2011. Google ScholarDigital Library
- M. G. Siegler. Apple's Massive New Data Center Set To Host Nuance Tech. http://techcrunch.com/2011/05/09/apple-nuance-data-center-deal/.Google Scholar
- David Huggins-Daines, Mohit Kumar, Arthur Chan, Alan W Black, Mosur Ravishankar, and Alex I Rudnicky. Pocketsphinx: A free, real-time continuous speech recognition system for hand-held devices. In Acoustics, Speech and Signal Processing, 2006. ICASSP 2006 Proceedings. 2006 IEEE International Conference on, volume 1, pages I--I. IEEE, 2006.Google Scholar
- Daniel Povey, Arnab Ghoshal, Gilles Boulianne, Lukas Burget, Ondrej Glembek, Nagendra Goel, Mirko Hannemann, Petr Motlicek, Yanmin Qian, Petr Schwarz, Jan Silovsky, Georg Stemmer, and Karel Vesely. The kaldi speech recognition toolkit. In IEEE 2011 Workshop on Automatic Speech Recognition and Understanding. IEEE Signal Processing Society, December 2011.Google Scholar
- David Rybach, Stefan Hahn, Patrick Lehnen, David Nolden, Martin Sundermeyer, Zoltan Tüske, Siemon Wiesler, Ralf Schlüter, and Hermann Ney. RASR - the RWTH Aachen University Open Source Speech Recognition Toolkit, 2011.Google Scholar
- Frank Seide, Gang Li, and Dong Yu. Conversational speech transcription using context-dependent deep neural networks. In Interspeech, pages 437--440, 2011.Google Scholar
- David Ferrucci, Eric Brown, Jennifer Chu-Carroll, James Fan, David Gondek, Aditya A. Kalyanpur, Adam Lally, J. William Murdock, Eric Nyberg, John Prager, Nico Schlaefer, and Chris Welty. Building Watson: An Overview of the DeepQA Project | Ferrucci | AI Magazine. AI MAGAZINE, 31(3):59--79, September 2010.Google ScholarCross Ref
- Herbert Bay, Tinne Tuytelaars, and Luc Van Gool. Surf: Speeded up robust features. In Computer Vision--ECCV 2006, pages 404--417. Springer, 2006. Google ScholarDigital Library
- G. Bradski. Dr. Dobb's Journal of Software Tools, 2000.Google Scholar
- Sirius: An Open End-to-End Voice and Vision Personal Assistant. http://sirius.clarity-lab.org.Google Scholar
- Geoffrey Hinton, Li Deng, Dong Yu, George Dahl, Abdel rahman Mohamed, Navdeep Jaitly, Andrew Senior, Vincent Vanhoucke, Patrick Nguyen, Tara Sainath, and Brian Kingsbury. Deep neural networks for acoustic modeling in speech recognition. Signal Processing Magazine, 2012.Google Scholar
- Jeffrey Dean, Greg S. Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Quoc V. Le, Mark Z. Mao, Marc Aurelio Ranzato, Andrew Senior, Paul Tucker, Ke Yang, and Andrew Y. Ng. Large scale distributed deep networks. In NIPS, 2012.Google ScholarDigital Library
- Oscar Tckstrm, Dipanjan Das, Slav Petrov, Ryan McDonald, and Joakim Nivre. Token and type constraints for cross-lingual part-of-speech tagging. Transactions of the Association for Computational Linguistics, 1:1--12, 2013.Google ScholarCross Ref
- Kooaba, inc. wofonthttp://www.vision.ee.ethz.ch/ surf/download.html.Google Scholar
- Qualcomm Acquires Kooaba Visual Recognition Company. http://mobilemarketingmagazine.com/qualcomm-acquires-kooaba-visual-recognition-company/.Google Scholar
- G David Forney Jr. The viterbi algorithm. Proceedings of the IEEE, 61(3):268--278, 1973.Google ScholarCross Ref
- George E Dahl, Dong Yu, Li Deng, and Alex Acero. Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition. Audio, Speech, and Language Processing, IEEE Transactions on, 20(1):30--42, 2012. Google ScholarDigital Library
- Xuedong Huang, James Baker, and Raj Reddy. A historical perspective of speech recognition. Commun. ACM, 57(1):94--103, January 2014. Google ScholarDigital Library
- Vijay R. Chandrasekhar, David M. Chen, Sam S. Tsai, Ngai-Man Cheung, Huizhong Chen, Gabriel Takacs, Yuriy Reznik, Ramakrishna Vedantham, Radek Grzeszczuk, Jeff Bach, and Bernd Girod. The stanford mobile visual search data set. In Proceedings of the Second Annual ACM Conference on Multimedia Systems, MMSys '11, pages 117--122, New York, NY, USA, 2011. ACM. Google ScholarDigital Library
- Martin F Porter. An algorithm for suffix stripping. Program: electronic library and information systems, 14(3):130--137, 1980.Google Scholar
- John Lafferty, Andrew McCallum, and Fernando CN Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. 2001.Google ScholarDigital Library
- Apache nutch. http://nutch.apache.org.Google Scholar
- Michael Ferdman, Almutaz Adileh, Onur Kocberber, Stavros Volos, Mohammad Alisafaee, Djordje Jevdjic, Cansu Kaynak, Adrian Daniel Popescu, Anastasia Ailamaki, and Babak Falsafi. Clearing the clouds: A study of emerging scale-out workloads on modern hardware. In Proceedings of the Seventeenth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS XVII, pages 37--48, New York, NY, USA, 2012. ACM. Google ScholarDigital Library
- Intel vtune. https://software.intel.com/en-us/intel-vtune-amplifier-xe.Google Scholar
- SLRE: Super Light Regular Expression Library. http://cesanta.com/.Google Scholar
- Naoaki Okazaki. CRFsuite: a fast implementation of Conditional Random Fields (CRFs), 2007. http://www.chokkan.org/software/crfsuite/.Google Scholar
- Erik F. Tjong Kim Sang and Sabine Buchholz. Introduction to the conll-2000 shared task: Chunking. In Proceedings of the 2Nd Workshop on Learning Language in Logic and the 4th Conference on Computational Natural Language Learning - Volume 7, ConLL '00, pages 127--132, Stroudsburg, PA, USA, 2000. Association for Computational Linguistics. Google ScholarDigital Library
- Jike Chong, Ekaterina Gonina, and Kurt Keutzer. Efficient automatic speech recognition on the gpu. Chapter in GPU Computing Gems Emerald Edition, Morgan Kaufmann, 1, 2011.Google ScholarCross Ref
- A Singh, N. Kumar, S. Gera, and A Mittal. Achieving magnitude order improvement in porter stemmer algorithm over multi-core architecture. In Informatics and Systems (INFOS), 2010 The 7th International Conference on, pages 1--8, March 2010.Google Scholar
- Clément Farabet, Yann LeCun, Koray Kavukcuoglu, Eugenio Culurciello, Berin Martini, Polina Akselrod, and Selcuk Talay. Large-scale FPGA-based convolutional networks. Machine Learning on Very Large Data Sets, 2011.Google ScholarCross Ref
- Giorgos Vasiliadis, Michalis Polychronakis, Spiros Antonatos, Evangelos P. Markatos, and Sotiris Ioannidis. Regular expression matching on graphics hardware for intrusion detection. In Proceedings of the 12th International Symposium on Recent Advances in Intrusion Detection, RAID '09, pages 265--283, Berlin, Heidelberg, 2009. Springer-Verlag. Google ScholarDigital Library
- Yi-Hua E. Yang, Weirong Jiang, and Viktor K. Prasanna. Compact Architecture for High-throughput Regular Expression Matching on FPGA. In Proceedings of the 4th ACM/IEEE Symposium on Architectures for Networking and Communications Systems, ANCS '08, pages 30--39, New York, NY, USA, 2008. ACM. Google ScholarDigital Library
- Nico Piatkowski. Linear-Chain CRF@GPU, 2011. http://sfb876.tu-dortmund.de/crfgpu/linear_crf_cuda.html.Google Scholar
- Sriram Swaminathan, Russell Tessier, Dennis Goeckel, and Wayne Burleson. A dynamically reconfigurable adaptive viterbi decoder. In Proceedings of the 2002 ACM/SIGDA Tenth International Symposium on Field-programmable Gate Arrays, FPGA '02, pages 227--236, New York, NY, USA, 2002. ACM. Google ScholarDigital Library
- Dimitris Bouris, Antonis Nikitakis, and Ioannis Papaefstathiou. Fast and Efficient FPGA-Based Feature Detection Employing the SURF Algorithm. In Proceedings of the 2010 18th IEEE Annual International Symposium on Field-Programmable Custom Computing Machines, FCCM '10, pages 3--10, Washington, DC, USA, 2010. IEEE Computer Society. Google ScholarDigital Library
- Luiz Andre Barroso, Jimmy Clidaras, and Urs Holzle. The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines, Second Edition. Synthesis Lectures on Computer Architecture. 2013. Google ScholarCross Ref
- Thinkmate high performance computing, 2014. http://www.thinkmate.com/system/rax-xf2--1130v3-sh.Google Scholar
- Jason Mars and Lingjia Tang. Whare-map: Heterogeneity in homogeneous warehouse-scale computers. In ISCA '13: Proceedings of the 40th annual International Symposium on Computer Architecture. IEEE/ACM, 2013. Google ScholarDigital Library
- Rajeev Krishna, Scott Mahlke, and Todd Austin. Architectural optimizations for low-power, real-time speech recognition. In Proceedings of the 2003 International Conference on Compilers, Architecture and Synthesis for Embedded Systems, CASES '03, pages 220--231, New York, NY, USA, 2003. ACM. Google ScholarDigital Library
- Binu Mathew, Al Davis, and Zhen Fang. A low-power accelerator for the sphinx 3 speech recognition system. In Proceedings of the 2003 International Conference on Compilers, Architecture and Synthesis for Embedded Systems, CASES '03, pages 210--219, New York, NY, USA, 2003. ACM. Google ScholarDigital Library
- Paul R. Dixon, Tasuku Oonishi, and Sadaoki Furui. Harnessing graphics processors for the fast computation of acoustic likelihoods in speech recognition. Comput. Speech Lang., 23(4):510--526, October 2009. Google ScholarDigital Library
- Jungsuk Kim, Jike Chong, and Ian R. Lane. Efficient On-The-Fly Hypothesis Rescoring in a Hybrid GPU/CPU-based Large Vocabulary Continuous Speech Recognition Engine. In INTERSPEECH. ISCA, 2012.Google Scholar
- Edward C. Lin, Kai Yu, Rob A. Rutenbar, and Tsuhan Chen. A 1000-word Vocabulary, Speaker-independent, Continuous Live-mode Speech Recognizer Implemented in a Single FPGA. In Proceedings of the 2007 ACM/SIGDA 15th International Symposium on Field Programmable Gate Arrays, FPGA '07, pages 60--68, New York, NY, USA, 2007. ACM. Google ScholarDigital Library
- J Hauswald, T Manville, Q Zheng, R Dreslinski, C Chakrabarti, and T Mudge. A hybrid approach to offloading mobile image classification. In Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on, pages 8375--8379. IEEE, 2014.Google ScholarCross Ref
- Ethan Rublee, Vincent Rabaud, Kurt Konolige, and Gary Bradski. Orb: an efficient alternative to sift or surf. In Computer Vision (ICCV), 2011 IEEE International Conference on, pages 2564--2571. IEEE, 2011. Google ScholarDigital Library
- Tung H Dinh, Dao Q Vu, Vu-Duc Ngo, Nam Pham Ngoc, and Vu T Truong. High throughput fpga architecture for corner detection in traffic images. In Communications and Electronics (ICCE), 2014 IEEE Fifth International Conference on, pages 297--302. IEEE, 2014.Google ScholarCross Ref
- Yuliang Sun, Zilong Wang, Sitao Huang, Lanjun Wang, Yu Wang, Rong Luo, and Huazhong Yang. Accelerating frequent item counting with fpga. In Proceedings of the 2014 ACM/SIGDA International Symposium on Field-programmable Gate Arrays, FPGA '14, pages 109--112, New York, NY, USA, 2014. ACM. Google ScholarDigital Library
- Jan Van Lunteren, Christoph Hagleitner, Timothy Heil, Giora Biran, Uzi Shvadron, and Kubilay Atasu. Designing a programmable wire-speed regular-expression matching accelerator. In Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO-45, pages 461--472, Washington, DC, USA, 2012. IEEE Computer Society. Google ScholarDigital Library
- Tianshi Chen, Zidong Du, Ninghui Sun, Jia Wang, Chengyong Wu, Yunji Chen, and Olivier Temam. DianNao: A Small-footprint High-throughput Accelerator for Ubiquitous Machine-learning. In Proceedings of the 19th International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS '14, pages 269--284, New York, NY, USA, 2014. ACM. Google ScholarDigital Library
- Hadi Esmaeilzadeh, Adrian Sampson, Luis Ceze, and Doug Burger. Neural acceleration for general-purpose approximate programs. In Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO-45, pages 449--460, Washington, DC, USA, 2012. IEEE Computer Society. Google ScholarDigital Library
- Alex Graves, Abdel-rahman Mohamed, and Geoffrey Hinton. Speech recognition with deep recurrent neural networks. In Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on, pages 6645--6649. IEEE, 2013.Google ScholarCross Ref
- Ravi Iyer, Sadagopan Srinivasan, Omesh Tickoo, Zhen Fang, Ramesh Illikkal, Steven Zhang, Vineet Chadha, Paul M. Stillwell Jr., and Seung Eun Lee. Cogniserve: Heterogeneous server architecture for large-scale recognition. IEEE Micro, 31(3):20--31, 2011. Google ScholarDigital Library
- Kevin Lim, David Meisner, Ali G. Saidi, Parthasarathy Ranganathan, and Thomas F. Wenisch. Thin servers with smart pipes: Designing soc accelerators for memcached. In Proceedings of the 40th Annual International Symposium on Computer Architecture, ISCA '13, pages 36--47, New York, NY, USA, 2013. ACM. Google ScholarDigital Library
- Onur Kocberber, Boris Grot, Javier Picorel, Babak Falsafi, Kevin Lim, and Parthasarathy Ranganathan. Meet the walkers: Accelerating index traversals for in-memory databases. In Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO-46, pages 468--479, New York, NY, USA, 2013. ACM. Google ScholarDigital Library
- Andrew Putnam, Adrian Caulfield, Eric Chung, Derek Chiou, Kypros Constantinides, John Demme, Hadi Esmaeilzadeh, Jeremy Fowers, Gopi Prashanth Gopal, Jan Gray, Michael Haselman, Scott Hauck, Stephen Heil, Amir Hormati, Joo-Young Kim, Sitaram Lanka, Jim Larus, Eric Peterson, Simon Pope, Aaron Smith, Jason Thong, Phillip Yi Xiao, and Doug Burger. A reconfigurable fabric for accelerating large-scale datacenter services. In 41st Annual International Symposium on Computer Architecture (ISCA), June 2014. Google ScholarDigital Library
- Jason Mars, Lingjia Tang, Robert Hundt, Kevin Skadron, and Mary Lou Soffa. Bubble-up: Increasing utilization in modern warehouse scale computers via sensible co-locations. In Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), MICRO-44, pages 248--259, New York, NY, USA, 2011. ACM. Google ScholarDigital Library
- Jason Mars, Lingjia Tang, Kevin Skadron, Mary Lou Soffa, and Robert Hundt. Increasing utilization in modern warehouse-scale computers using bubble-up. IEEE Micro, 32(3):88--99, May 2012. Google ScholarDigital Library
- Lingjia Tang, Jason Mars, Xiao Zhang, Robert Hagmann, Robert Hundt, and Eric Tune. Optimizing google's warehouse scale computers: The numa experience. In Proceedings of the 2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA), HPCA '13, pages 188--197, Washington, DC, USA, 2013. IEEE Computer Society. Google ScholarDigital Library
- Lingjia Tang, Jason Mars, Wei Wang, Tanima Dey, and Mary Lou Soffa. Reqos: Reactive static/dynamic compilation for qos in warehouse scale computers. In Proceedings of the Eighteenth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), ASPLOS '13, pages 89--100, New York, NY, USA, 2013. ACM. Google ScholarDigital Library
- Hailong Yang, Alex Breslow, Jason Mars, and Lingjia Tang. Bubble-flux: Precise online qos management for increased utilization in warehouse scale computers. In Proceedings of the 40th Annual International Symposium on Computer Architecture (ISCA), ISCA '13, pages 607--618, New York, NY, USA, 2013. ACM. Google ScholarDigital Library
- Yunqi Zhang, Michael Laurenzano, Jason Mars, and Lingjia Tang. Smite: Precise qos prediction on real system smt processors to improve utilization in warehouse scale computers. In Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), MICRO-47, New York, NY, USA, 2014. ACM. Google ScholarDigital Library
- Michael Laurenzano, Yunqi Zhang, Lingjia Tang, and Jason Mars. Protean code: Achieving near-free online code transformations for warehouse scale computers. In Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), MICRO-47, New York, NY, USA, 2014. ACM. Google ScholarDigital Library
- Vinicius Petrucci, Michael A. Laurenzano, Yunqi Zhang, John Doherty, Daniel Mosse, Jason Mars, and Lingjia Tang. Octopus-man: Qos-driven task management for heterogeneous multicore in warehouse scale computers. In Proceedings of the 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA), HPCA '15, Washington, DC, USA, 2015. IEEE Computer Society.Google ScholarCross Ref
- Chang-Hong Hsu, Yunqi Zhang, Michael A. Laurenzano, David Meisner, Thomas Wenisch, Lingjia Tang, Jason Mars, and Ron Dreslinski. Adrenaline: Pinpointing and reigning in tail queries with quick voltage boosting. In Proceedings of the 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA), HPCA '15, Washington, DC, USA, 2015. IEEE Computer Society.Google ScholarCross Ref
Index Terms
- Sirius: An Open End-to-End Voice and Vision Personal Assistant and Its Implications for Future Warehouse Scale Computers
Recommendations
Sirius: An Open End-to-End Voice and Vision Personal Assistant and Its Implications for Future Warehouse Scale Computers
ASPLOS '15As user demand scales for intelligent personal assistants (IPAs) such as Apple's Siri, Google's Google Now, and Microsoft's Cortana, we are approaching the computational limits of current datacenter architectures. It is an open question how future ...
Designing Future Warehouse-Scale Computers for Sirius, an End-to-End Voice and Vision Personal Assistant
As user demand scales for intelligent personal assistants (IPAs) such as Apple’s Siri, Google’s Google Now, and Microsoft’s Cortana, we are approaching the computational limits of current datacenter (DC) architectures. It is an open question how future ...
Sirius: An Open End-to-End Voice and Vision Personal Assistant and Its Implications for Future Warehouse Scale Computers
ASPLOS'15As user demand scales for intelligent personal assistants (IPAs) such as Apple's Siri, Google's Google Now, and Microsoft's Cortana, we are approaching the computational limits of current datacenter architectures. It is an open question how future ...
Comments