The dataflow supercomputing paradigm essence
At the time when the von Neumann paradigm for computing was formed, the technology was such that the ratio of arithmetic or logic (ALU) operation latencies over the communication (COMM) delays to memory or another processor t(ALU)/t(COMM) was extremely large (sometime argued to be approaching infinity).
In his famous lecture notes on Computing, the Nobel Laureate Richard Feynman presented an observation that in theory, ALU operations could be done with zero energy, while communications can never reach zero energy levels, and that speed and energy of computing could be traded. In other words, this means that in practice, the future technologies will be characterized with t(COMM)/t(ALU) extremely large (in theory, t(COMM)/t(ALU) approaching infinity), which is exactly the opposite of what was the case at the times of von Neumann. That is why a number of pioneers in dataflow supercomputing accepted to use the term Feynman Paradigm for the approach utilized by Maxeler computing systems. Feynman never worked in dataflow computing, but his observations made many to believe into the great future of the dataflow computing paradigm (along with his involvement with the Connection Machine design).
Obviously, when computing technology is characterized with extremely large t(COMM)/t(ALU), the control flow machines of the multi-core type (like Intel) or the many-core type (like NVidia) could never be as fast as the dataflow machines like Maxeler, for one simple reason: Buses of the control flow machines will never become of zero length, while many edges of the execution graph can easily be made zero length.
The main technology related question now is: “Where is the ratio of t(ALU) and t(COMM) now?”
According to the above mentioned sources, the energy needed to do an IEEE Floating Point Double Precision Multiplication will be only 10pJ around the year 2020, while the energy needed to move the result from one core to another core of a multi-core or a many-core machine is 2000pJ, which represents a factor of 200×. On the other hand, moving the same result over an almost-zero-length edge in a Maxeler dataflow machine, in some cases, may take less than 2pJ, which represents a factor of 0.2×.
Therefore, the times have arrived for the technology to enable the dataflow approach to be effective. Here the term technology refers to the combination of: the hardware technology, the programming paradigm and its support technologies, the compiler technology, and the code analysis, development, and testing technology. These technologies are shed more light at in the next section.
The paradigm support technologies
As far as the hardware technology, a Maxeler dataflow machine consists of either a conventional computer with a Maxeler PCIe board (e.g., Galava, Isca, Maia etc. all collectively referred as DataFlow Engines—DFEs), or is a 1U box
1 that could be stacked into a 40U rack, or into a train of 40U racks.
As far as the programming paradigm and the support technologies, the Maxeler dataflow machines are programmed in space, rather than in time, which is the case with typical control flow machines. In the case of Maxeler (and the Maxeler Application Gallery project), the general framework is referred to as OpenSPL (Open Spatial Programming Language,
http://www.OpenSPL.org), and its specific implementation used for the development of AppGallery.Maxeler.com applications is called MaxJ (shorthand for Maxeler Java).
As far as compilation, it consists of two parts: (a) Generating the execution graph, and (b) Mapping the execution graph onto the underlying reconfigurable infrastructure. The Maxeler compiler (MaxCompiler) is responsible for the first part. The synthesis tools from the FPGA device vendor are responsible for the second part. As interface between these two distinct phases VHDL is used.
As far as the synergy of the compiler and the internal “nano-accelerators” (small add-on at the hardware structure level, invisible at the level of the static dataflow concept, and invisible to the system designer, but under the tight control of the compiler), the most illustrative is the fact that the compiled code for a board based on Altera or Xilinx chips is much slower compared to the code compiled for a Maxeler board having the same type and the same number of Altera or Xilinx chips. This is due to many different “nano-accelerators” present at the Maxeler board. These are invisible on the concept level and invisible to the dataflow programmer, but extremely important for the concept implementation, visible to the Maxeler compiler, and it knows how to utilize them. These add-ons are here referred to as “nano-accelerators”.
As far as the tools for analysis, development, and testing, they have to take care of the following issues: (a) When the paradigm changes, the algorithms that previously were the best ones, according to some criterion, now are not any more, so the possible algorithms have to be re-evaluated, or new ones have to be created; (b) When the paradigm changes, the usage of the new paradigm has to be made simple, so the community accepts it, with as short as possible delays, and (c) When the paradigm changes, the testing tools have to be made compatible with the needs of the new paradigm, and the types of errors that occur most frequently.
The best example of the first is bitonic sort, which is considered among the slowest in control flow and among the fastest in dataflow. The best example of the second is WebIDE.Maxeler.com with all its features to support programmers. The most appropriate example of the third is related to the on-going research aimed at including a machine learning assistant that analyses past errors and offers hints for the future testing-oriented work.
All the above support technologies have to be aware of the following: If the temporal and the spatial computing are separated (hybrid computing with only combinational logic in the spatial part), and if latency is traded for better performance (maximal acceleration even for the most sophisticated types of data dependencies between different loop iterations), benefits occur only if the support technologies are able to utilize them!
Presentation of the Maxeler application gallery project
The next section of this paper contains selected examples from the Maxeler Application Gallery project, visible at the web (AppGallery.Maxeler.com). Each example is presented using the same template.
The presentation template, wherever possible, conditionally speaking, includes the following elements: (a) The whereabouts (b) The algorithm implementation essence supported with a figure, (c) The coding approach changes made necessary by the paradigm changes, supported by GitHub details, (d) The major highlights, supported with a figure, (e) The advantages and drawbacks, (f) The future trends, and (g) Conclusions related to complexity, performance, energy, and risks leading to possible problems in the domains of complexity, performance, energy, and creation of new risks that recursively open a new round of all the above.