Skip to main content

2015 | OriginalPaper | Buchkapitel

Montgomery Modular Multiplication on ARM-NEON Revisited

verfasst von : Hwajeong Seo, Zhe Liu, Johann Großschädl, Jongseok Choi, Howon Kim

Erschienen in: Information Security and Cryptology - ICISC 2014

Verlag: Springer International Publishing

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Montgomery modular multiplication constitutes the “arithmetic foundation” of modern public-key cryptography with applications ranging from RSA, DSA and Diffie-Hellman over elliptic curve schemes to pairing-based cryptosystems. The increased prevalence of SIMD-type instructions in commodity processors (e.g. Intel SSE, ARM NEON) has initiated a massive body of research on vector-parallel implementations of Montgomery modular multiplication. In this paper, we introduce the Cascade Operand Scanning (COS) method to speed up multi-precision multiplication on SIMD architectures. We developed the COS technique with the goal of reducing Read-After-Write (RAW) dependencies in the propagation of carries, which also reduces the number of pipeline stalls (i.e. bubbles). The COS method operates on 32-bit words in a row-wise fashion (similar to the operand-scanning method) and does not require a “non-canonical” representation of operands with a reduced radix. We show that two COS computations can be “coarsely” integrated into an efficient vectorized variant of Montgomery multiplication, which we call Coarsely Integrated Cascade Operand Scanning (CICOS) method. Due to our sophisticated instruction scheduling, the CICOS method reaches record-setting execution times for Montgomery modular multiplication on ARM-NEON platforms. Detailed benchmarking results obtained on an ARM Cortex-A9 and Cortex-A15 processors show that the proposed CICOS method outperforms Bos et al’s implementation from SAC 2013 by up to 57 % (A9) and 40 % (A15), respectively.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Fußnoten
1
Note that the timings in the proceedings version of Bos et al’s paper differ from the version in the IACR eprint archive at https://​eprint.​iacr.​org/​2013/​519. We used the faster timings from the eprint version for comparison with our work.
 
2
Operands \(A[0 \sim 7]\) and \(B[0 \sim 7]\) are stored in 32-bit registers. Intermediate results \(C[0 \sim 15]\) are stored in 64-bit registers. We use two packed 32-bit registers in the 64-bit register.
 
3
In the first round, the range is within [0, 0x1_ffff_fffd], because higher bits and lower bits of intermediate results \((C[0 \sim 7])\) are located in range of [0, 0xffff_fffe] and [0, 0xffff_ffff], respectively. From second round, the addition of higher and lower bits are located within [0, 0x1_ffff_fffe], because both higher and lower bits are located in range of [0, 0xffff_ffff].
 
4
In the first round, intermediate results (\(C[0\sim 7]\)) are in range of [0, 0x1_ffff_fffd] so multiplication and accumulation results are in range of [0, 0xffff_ffff_ffff_fffe]. From second round, the intermediate results are located in [0, 0x1_ffff_fffe] so multiplication and accumulation results are in range of [0, 0xffff_ffff_ffff_ffff].
 
5
NEON engine supports sixteen 128-bit registers. We assigned four registers for operands (\(A, B\)), four for intermediate results (\(C\)) and four for temporal storages.
 
6
Operands \(A[0 \sim 7]\), \(B[0 \sim 7]\), \(M[0 \sim 7]\), \(Q[0 \sim 7]\) and \(M'\) are stored in 32-bit registers. Intermediate results \(C[0 \sim 15]\) are stored in 64-bit registers.
 
7
In the first round, the range is within [0, 0x1_ffff_fffd], because higher bits and lower bits of intermediate results \((C[0 \sim 7])\) are located in range of [0, 0xffff_fffe] and [0, 0xffff_ffff], respectively. From second round, the addition of higher and lower bits are located within [0, 0x1_ffff_fffe], because both higher and lower bits are located in range of [0, 0xffff_ffff].
 
Literatur
1.
Zurück zum Zitat Barrett, P.: Implementing the rivest shamir and adleman public key encryption algorithm on a standard digital signal processor. In: Odlyzko, A.M. (ed.) CRYPTO 1986. LNCS, vol. 263, pp. 311–323. Springer, Heidelberg (1987) CrossRef Barrett, P.: Implementing the rivest shamir and adleman public key encryption algorithm on a standard digital signal processor. In: Odlyzko, A.M. (ed.) CRYPTO 1986. LNCS, vol. 263, pp. 311–323. Springer, Heidelberg (1987) CrossRef
2.
Zurück zum Zitat Bernstein, D.J., Schwabe, P.: NEON crypto. In: Prouff, E., Schaumont, P. (eds.) CHES 2012. LNCS, vol. 7428, pp. 320–339. Springer, Heidelberg (2012) CrossRef Bernstein, D.J., Schwabe, P.: NEON crypto. In: Prouff, E., Schaumont, P. (eds.) CHES 2012. LNCS, vol. 7428, pp. 320–339. Springer, Heidelberg (2012) CrossRef
4.
Zurück zum Zitat Bos, J.W., Kaihara, M.E.: montgomery multiplication on the cell. In: Wyrzykowski, R., Dongarra, J., Karczewski, K., Wasniewski, J. (eds.) PPAM 2009, Part I. LNCS, vol. 6067, pp. 477–485. Springer, Heidelberg (2010) CrossRef Bos, J.W., Kaihara, M.E.: montgomery multiplication on the cell. In: Wyrzykowski, R., Dongarra, J., Karczewski, K., Wasniewski, J. (eds.) PPAM 2009, Part I. LNCS, vol. 6067, pp. 477–485. Springer, Heidelberg (2010) CrossRef
5.
Zurück zum Zitat Bos, J.W., Montgomery, P.L., Shumow, D., Zaverucha, G.M.: Montgomery multiplication using vector instructions. In: Lange, T., Lauter, K., Lisoněk, P. (eds.) SAC 2013. LNCS, vol. 8282, pp. 471–490. Springer, Heidelberg (2014) CrossRef Bos, J.W., Montgomery, P.L., Shumow, D., Zaverucha, G.M.: Montgomery multiplication using vector instructions. In: Lange, T., Lauter, K., Lisoněk, P. (eds.) SAC 2013. LNCS, vol. 8282, pp. 471–490. Springer, Heidelberg (2014) CrossRef
6.
Zurück zum Zitat Câmara, D., Gouvêa, C.P.L., López, J., Dahab, R.: Fast software polynomial multiplication on ARM processors using the NEON engine. In: Cuzzocrea, A., Kittl, C., Simos, D.E., Weippl, E., Xu, L. (eds.) CD-ARES Workshops 2013. LNCS, vol. 8128, pp. 137–154. Springer, Heidelberg (2013) CrossRef Câmara, D., Gouvêa, C.P.L., López, J., Dahab, R.: Fast software polynomial multiplication on ARM processors using the NEON engine. In: Cuzzocrea, A., Kittl, C., Simos, D.E., Weippl, E., Xu, L. (eds.) CD-ARES Workshops 2013. LNCS, vol. 8128, pp. 137–154. Springer, Heidelberg (2013) CrossRef
7.
Zurück zum Zitat Faz-Hernández, A., Longa, P., Sánchez, A.H.: Efficient and secure algorithms for GLV-based scalar multiplication and their implementation on GLV-GLS curves. In: Benaloh, J. (ed.) CT-RSA 2014. LNCS, vol. 8366, pp. 1–27. Springer, Heidelberg (2014) CrossRef Faz-Hernández, A., Longa, P., Sánchez, A.H.: Efficient and secure algorithms for GLV-based scalar multiplication and their implementation on GLV-GLS curves. In: Benaloh, J. (ed.) CT-RSA 2014. LNCS, vol. 8366, pp. 1–27. Springer, Heidelberg (2014) CrossRef
8.
Zurück zum Zitat Gueron, S., Krasnov, V.: Software implementation of modular exponentiation, using advanced vector instructions architectures. In: Özbudak, F., Rodríguez-Henríquez, F. (eds.) WAIFI 2012. LNCS, vol. 7369, pp. 119–135. Springer, Heidelberg (2012) CrossRef Gueron, S., Krasnov, V.: Software implementation of modular exponentiation, using advanced vector instructions architectures. In: Özbudak, F., Rodríguez-Henríquez, F. (eds.) WAIFI 2012. LNCS, vol. 7369, pp. 119–135. Springer, Heidelberg (2012) CrossRef
10.
Zurück zum Zitat Montgomery, P.L.: Modular multiplication without trial division. Math. Comput. 44(170), 519–521 (1985)CrossRefMATH Montgomery, P.L.: Modular multiplication without trial division. Math. Comput. 44(170), 519–521 (1985)CrossRefMATH
11.
Zurück zum Zitat Pabbuleti, K.C., Mane, D.H., Desai, A., Albert, C., Schaumont, P.: Simd acceleration of modular arithmetic on contemporary embedded platforms. In: 2013 IEEE High Performance Extreme Computing Conference (HPEC), pp. 1–6. IEEE (2013) Pabbuleti, K.C., Mane, D.H., Desai, A., Albert, C., Schaumont, P.: Simd acceleration of modular arithmetic on contemporary embedded platforms. In: 2013 IEEE High Performance Extreme Computing Conference (HPEC), pp. 1–6. IEEE (2013)
12.
Zurück zum Zitat Quisquater, J.-J.: Procédé de codage selon la méthode dite rsa, par un microcontrôleur et dispositifs utilisant ce procédé. Demande de brevet français. (Dépôt numéro: 90 02274), 122 (1990) Quisquater, J.-J.: Procédé de codage selon la méthode dite rsa, par un microcontrôleur et dispositifs utilisant ce procédé. Demande de brevet français. (Dépôt numéro: 90 02274), 122 (1990)
13.
Zurück zum Zitat Quisquater, J.-J.: Encoding system according to the so-called rsa method, by means of a microcontroller and arrangement implementing this system, 24 November 1992. US Patent 5,166,978 Quisquater, J.-J.: Encoding system according to the so-called rsa method, by means of a microcontroller and arrangement implementing this system, 24 November 1992. US Patent 5,166,978
14.
Zurück zum Zitat Sánchez, A.H., Rodríguez-Henríquez, F.: NEON implementation of an attribute-based encryption scheme. In: Jacobson, M., Locasto, M., Mohassel, P., Safavi-Naini, R. (eds.) ACNS 2013. LNCS, vol. 7954, pp. 322–338. Springer, Heidelberg (2013) CrossRef Sánchez, A.H., Rodríguez-Henríquez, F.: NEON implementation of an attribute-based encryption scheme. In: Jacobson, M., Locasto, M., Mohassel, P., Safavi-Naini, R. (eds.) ACNS 2013. LNCS, vol. 7954, pp. 322–338. Springer, Heidelberg (2013) CrossRef
Metadaten
Titel
Montgomery Modular Multiplication on ARM-NEON Revisited
verfasst von
Hwajeong Seo
Zhe Liu
Johann Großschädl
Jongseok Choi
Howon Kim
Copyright-Jahr
2015
DOI
https://doi.org/10.1007/978-3-319-15943-0_20

Premium Partner