Performance anomalies when running Gaussian frequency calculations in parallel on SGI Altix computers with CC-NUMA memory architecture are analyzed using performance tools that access hardware counters. The bottleneck is the frequent and nearly simultaneous data-loads of all threads involved in the calculation of data allocated in the node where the master thread runs. Code changes that ensure these data-loads are localized improve performance by a factor close to two. The improvements carry over to other molecular models and other types of calculations. An expansion or an alternative of
OpenMP’s clause can facilitate the code transformations.