Abstract
Vector computing can be read as equipping a processing unit with replicated ALUs. To benefit from this hardware concurrency, we have to phrase our calculations as operation sequences over small vectors. We discuss what well-suited loops that can exploit vector units have to look like, we discuss the relation between loops with vectorisation and partial loop unrolling, and we discuss the difference between horizontal and vertical vectorisation. These insights allow us to study realisation flavours of vectorisation on the chip: Are wide vector registers used, or do we introduce hardware threads with lockstepping, how do masking and blending facilitate slight thread divergence, and why does non-continuous data access require us to gather and scatter register content?