research-article

Darkroom: compiling high-level image processing code into hardware pipelines

Authors:
James Hegarty

Stanford University

Stanford University
View Profile

,
John Brunhaver

Stanford University

Stanford University
View Profile

,
Zachary DeVito

Stanford University

Stanford University
View Profile

,
Jonathan Ragan-Kelley

MIT CSAIL

MIT CSAIL
View Profile

,
Noy Cohen

Stanford University

Stanford University
View Profile

,
Steven Bell

Stanford University

Stanford University
View Profile

,
Artem Vasilyev

Stanford University

Stanford University
View Profile

,
Mark Horowitz

Stanford University

Stanford University
View Profile

,
Pat Hanrahan

Stanford University

Stanford University
View Profile

Authors Info & Claims

ACM Transactions on Graphics Volume 33 Issue 4Article No.: 144pp 1–11https://doi.org/10.1145/2601097.2601174

Published:27 July 2014Publication History

ACM Transactions on Graphics

Abstract

Specialized image signal processors (ISPs) exploit the structure of image processing pipelines to minimize memory bandwidth using the architectural pattern of line-buffering, where all intermediate data between each stage is stored in small on-chip buffers. This provides high energy efficiency, allowing long pipelines with tera-op/sec. image processing in battery-powered devices, but traditionally requires painstaking manual design in hardware. Based on this pattern, we present Darkroom, a language and compiler for image processing. The semantics of the Darkroom language allow it to compile programs directly into line-buffered pipelines, with all intermediate values in local line-buffer storage, eliminating unnecessary communication with off-chip DRAM. We formulate the problem of optimally scheduling line-buffered pipelines to minimize buffering as an integer linear program. Finally, given an optimally scheduled pipeline, Darkroom synthesizes hardware descriptions for ASIC or FPGA, or fast CPU code. We evaluate Darkroom implementations of a range of applications, including a camera pipeline, low-level feature detection algorithms, and deblurring. For many applications, we demonstrate gigapixel/sec. performance in under 0.5mm² of ASIC silicon at 250 mW (simulated on a 45nm foundry process), real-time 1080p/60 video processing using a fraction of the resources of a modern FPGA, and tens of megapixels/sec. of throughput on a quad-core x86 processor.

Supplemental Material

a144-sidebyside.mp4

mp4

25.3 MB

Download

Available for Download

zip

a144-hegarty.zip (374.1 MB)

Supplemental material.

References

Adams, A., Talvala, E.-V., Park, S. H., Jacobs, D. E., Ajdin, B., Gelfand, N., Dolson, J., Vaquero, D., Baek, J., Tico, M., Lensch, H. P. A., Matusik, W., Pulli, K., Horowitz, M., and Levoy, M. 2010. The Frankencamera: An experimental platform for computational photography. ACM Transactions on Graphics 29, 4 (July), 29:1--29:12. Google ScholarDigital Library
Aptina. Aptina MT9P111. http://www.aptina.com/products/soc/mt9p111/.Google Scholar
Berkelaar, M., Eikland, K., Notebaert, P., et al. 2004. lpsolve: Open source (mixed-integer) linear programming system. Eindhoven U. of Technology.Google Scholar
Bilsen, G., Engels, M., Lauwereins, R., and Peperstraete, J. 1995. Cyclo-static data flow. In 1995 International Conference on Acoustics, Speech, and Signal Processing, vol. 5, 3255--3258.Google Scholar
Bouguet, J.-Y. 2001. Pyramidal implementation of the affine Lucas Kanade feature tracker description of the algorithm. Tech. rep., Intel Corporation.Google Scholar
Canny, J. 1986. A computational approach to edge detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 6, 679--698. Google ScholarDigital Library
Datta, K., Murphy, M., Volkov, V., Williams, S., Carter, J., Oliker, L., Patterson, D., Shalf, J., and Yelick, K. 2008. Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures. In Proceedings of the 2008 ACM/IEEE conference on Supercomputing, IEEE Press, 4. Google ScholarDigital Library
DeVito, Z., Hegarty, J., Aiken, A., Hanrahan, P., and Vitek, J. 2013. Terra: A multi-stage language for high-performance computing. In Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation, 105--116. Google ScholarDigital Library
Elliott, C. 2001. Functional image synthesis. In Proceedings of Bridges.Google Scholar
Frigo, M., and Strumpen, V. 2005. Cache oblivious stencil computations. In Proceedings of the 19th annual international conference on Supercomputing, ACM, 361--366. Google ScholarDigital Library
Gummaraju, J., and Rosenblum, M. 2005. Stream programming on general-purpose processors. In Proceedings of the 38th Annual IEEE/ACM International Symposium on Microarchitecture, IEEE, 343--354. Google ScholarDigital Library
Hameed, R., Qadeer, W., Wachs, M., Azizi, O., Solomatnikov, A., Lee, B. C., Richardson, S., Kozyrakis, C., and Horowitz, M. 2010. Understanding sources of inefficiency in general-purpose chips. In Proceedings of the 37th Annual International Symposium on Computer Architecture, ACM, 37--47. Google ScholarDigital Library
Harris, C., and Stephens, M. 1988. A combined corner and edge detector. In Proceedings of the 4th Alvey Vision Conference, 147--151.Google Scholar
Holzmann, G. 1988. Beyond Photography: The Digital Darkroom. Prentice Hall. Google ScholarDigital Library
Kung, H. T. 1979. Let's design algorithms for VLSI systems. In Proceedings of the Caltech Conference on Very Large Scale Integration.Google Scholar
Lattner, C., and Adve, V. 2004. LLVM: A Compilation Framework for Lifelong Program Analysis & Transformation. In Proceedings of the 2004 International Symposium on Code Generation and Optimization (CGO'04). Google ScholarDigital Library
Lee, E. A., and Messerschmitt, D. G. 1987. Static scheduling of synchronous data flow programs for digital signal processing. IEEE Transactions on Computers 100, 1, 24--35. Google ScholarDigital Library
Leiserson, C. E., and Saxe, J. B. 1991. Retiming synchronous circuitry. Algorithmica 6, 1--6, 5--35.Google ScholarDigital Library
Lucas, B. D., Kanade, T., et al. 1981. An iterative image registration technique with an application to stereo vision. In IJCAI, vol. 81, 674--679. Google ScholarDigital Library
Malladi, K., Nothaft, F., Periyathambi, K., Lee, B., Kozyrakis, C., and Horowitz, M. 2012. Towards energy-proportional datacenter memory with mobile dram. In 2012 39th Annual International Symposium on Computer Architecture (ISCA), 37--48. Google ScholarDigital Library
Muralimanohar, N., and Balasubramonian, R. 2009. Cacti 6.0: A tool to understand large caches. Tech. rep., HP Labs.Google Scholar
Murthy, P., Bhattacharyya, S., and Lee, E. 1997. Joint minimization of code and data for synchronous dataflow programs. Formal Methods in System Design 11, 1, 41--70. Google ScholarDigital Library
Nguyen, A., Satish, N., Chhugani, J., Kim, C., and Dubey, P. 2010. 3.5-d blocking optimization for stencil computations on modern cpus and gpus. In in Proc. of the 2010 ACM/IEEE Intl Conf. for High Performance Computing, Networking, Storage and Analysis, 2010, 1--13. Google ScholarDigital Library
OpenCV. OpenCV. http://opencv.org/.Google Scholar
Qualcomm. Qualcomm hexagon SDK. https://developer.qualcomm.com/mobile-development/maximize-hardware/mobile-multimedia-optimization-hexagon-sdk.Google Scholar
Ragan-Kelley, J., Adams, A., Paris, S., Levoy, M., Amarasinghe, S., and Durand, F. 2012. Decoupling algorithms from schedules for easy optimization of image processing pipelines. ACM Transactions on Graphics (TOG) 31, 4, 32. Google ScholarDigital Library
Ragan-Kelley, J., Barnes, C., Adams, A., Paris, S., Durand, F., and Amarasinghe, S. 2013. Halide: A language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. In Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation, ACM, 519--530. Google ScholarDigital Library
Richardson, W. H. 1972. Bayesian-based iterative method of image restoration. JOSA 62, 1, 55--59.Google ScholarCross Ref
Shacham, O., Galal, S., Sankaranarayanan, S., Wachs, M., Brunhaver, J., Vassiliev, A., Horowitz, M., Danowitz, A., Qadeer, W., and Richardson, S. 2012. Avoiding game over: Bringing design to the next level. In Proceedings of the 49th Annual Design Automation Conference (DAC), 623--629. Google ScholarDigital Library
Shantzis, M. A. 1994. A model for efficient and flexible image computing. In Proceedings of the 21st annual conference on Computer graphics and interactive techniques, ACM, 147--154. Google ScholarDigital Library
Sugerman, J., Fatahalian, K., Boulos, S., Akeley, K., and Hanrahan, P. 2009. Gramps: A programming model for graphics pipelines. ACM Transactions on Graphics (TOG) 28, 1 (Feb.), 4:1--4:11. Google ScholarDigital Library
Tang, Y., Chowdhury, R. A., Kuszmaul, B. C., Luk, C.-K., and Leiserson, C. E. 2011. The Pochoir stencil compiler. In Proceedings of the 23rd ACM Symposium on Parallelism in Algorithms and Architectures, ACM, 117--128. Google ScholarDigital Library
Vivado. vivado. http://www.xilinx.com/products/design-tools/vivado/integration/esl-design/.Google Scholar

Index Terms

Darkroom: compiling high-level image processing code into hardware pipelines

Recommendations

HeteroCL: A Multi-Paradigm Programming Infrastructure for Software-Defined Reconfigurable Computing
FPGA '19: Proceedings of the 2019 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

With the pursuit of improving compute performance under strict power constraints, there is an increasing need for deploying applications to heterogeneous hardware architectures with accelerators, such as GPUs and FPGAs. However, although these ...
Read More
In-Datacenter Performance Analysis of a Tensor Processing Unit
ISCA '17: Proceedings of the 44th Annual International Symposium on Computer Architecture

Many architects believe that major improvements in cost-energy-performance must now come from domain-specific hardware. This paper evaluates a custom ASIC---called a Tensor Processing Unit (TPU) --- deployed in datacenters since 2015 that accelerates ...
Read More
AutoSA: A Polyhedral Compiler for High-Performance Systolic Arrays on FPGA
FPGA '21: The 2021 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

While systolic array architectures have the potential to deliver tremendous performance, it is notoriously challenging to customize an efficient systolic array processor for a target application. Designing systolic arrays requires knowledge for both ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

ACM Transactions on Graphics Volume 33, Issue 4
July 2014
1366 pages
ISSN:0730-0301
EISSN:1557-7368
DOI:10.1145/2601097
Issue’s Table of Contents

Copyright © 2014 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 27 July 2014
Published in tog Volume 33, Issue 4

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
FPGAs
domain-specific languages
hardware synthesis
image processing
video processing
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 129
  Total Citations
  View Citations
- 1,174
  Total Downloads
- Downloads (Last 12 months)19
- Downloads (Last 6 weeks)2
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Darkroom: compiling high-level image processing code into hardware pipelines

ACM Transactions on Graphics

Abstract

Supplemental Material

Available for Download

References

Cited By

Index Terms

Recommendations

HeteroCL: A Multi-Paradigm Programming Infrastructure for Software-Defined Reconfigurable Computing

In-Datacenter Performance Analysis of a Tensor Processing Unit

AutoSA: A Polyhedral Compiler for High-Performance Systolic Arrays on FPGA

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Darkroom: compiling high-level image processing code into hardware pipelines

ACM Transactions on Graphics

Abstract

Supplemental Material

Available for Download

References

Cited By

Index Terms

Recommendations

HeteroCL: A Multi-Paradigm Programming Infrastructure for Software-Defined Reconfigurable Computing

In-Datacenter Performance Analysis of a Tensor Processing Unit

AutoSA: A Polyhedral Compiler for High-Performance Systolic Arrays on FPGA

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media