1 Introduction
-
We present TZmCFI, a CFI implementation for resource-constrained microcontrollers and RTOS-based applications (Sect. 3).
-
We propose an improved shadow exception stack technique that supports multi-tasking systems and imposes a lower overhead (Sect. 4).
-
We evaluate a prototype system based on TZmCFI from runtime performance and security points of view (Sect. 5).
2 Background
2.1 Control Flow Integrity
2.1.1 Protecting CFI States
2.1.2 System-Level CFI
2.2 Armv8-M
2.3 Previous Works
3 TZmCFI
3.1 Assumptions
3.2 Design
TCCreateThread
to initialize the task structure internal to Monitor with an initial program counter and obtain a task ID. However, this is vulnerable to data-oriented attacks. To protect the system against such attacks, initializing task structures is permitted only during system startup. The operating system signals that the startup process is complete through a manually inserted hook, after which code pointers generated by untrusted code are no longer trusted. This state persists until a system reset. After a system startup, the operating system must notify context switches through another hook, passing the previously obtained task ID as a parameter.3.2.1 Shadow Stacks
-
Shadow Push inserted to a prologue, pushes a return target to the current task’s shadow stack.
-
Shadow Assert inserted to an epilogue, pops a trustworthy return target from the shadow stack, superseding the untrustworthy one from the stack. If this type of monitor call is followed by a function return instruction, they are fused into a Shadow Assert Return monitor call for an improved runtime performance.
4 Shadow Exception Stack
4.1 Exception Handling in Armv8-M
FAULTMASK
and PRIMASK
. When set to a non-default value, they raise the execution priority, effectively disabling exceptions.2LR
, a general-purpose register commonly used for storing a return address, with a special value called EXC_RETURN
. After that, the processor loads a vector address from an exception vector table and transfers the control to it. When the program performs an indirect jump to LR
(which is exactly the same as a normal function return) and it contains EXC_RETURN
, the processor does not simply update the program counter but instead initiates an exception return sequence, where the original context state is restored from the stack. This process utilizes information from bit fields in EXC_RETURN
, e.g., to locate which stack the exception frame is located in.4.1.1 Naïve Shadow Stacks
4.2 Proposed Solution
PC
, LR
, EXC_RETURN
, R12
, and a pointer to the exception frame). Protecting R12
is not essential as far as only exception handling is concerned. The reason R12
is included is that it is used by shadow stack instrumentation code for passing a continuation address (the return address for a monitor call, which should not be confused with the return address of the instrumented function) as a part of its special calling convention, thus corrupting it may lead to a control-flow violation.4.3 Multi-tasking
4.4 Performance Optimization
CONTROL
register is updated to transition the calling task into privileged mode, but this greatly increases CPU utilization because all Non-Secure exception handlers are to be instrumented. For this reason, we took an alternative approach where we replaced this mechanism with a secure function named TCRaisePrivilege
. For FreeRTOS, switching to this approach is as easy as to point the macro portRAISE_PRIVILEGE
to the secure function. Assuming CFI is in place, this approach does not hinder security because CFI prevents it from being called from a disallowed location.EXC_RETURN
in this case) is subject to protection by shadow stack. The shadow exception stack instrumentation replaces the return address with a constant function pointer to the return trampoline. This means that it is actually unnecessary to preserve or protect the return address provided that the return instruction is replaced with a direct jump to the return trampoline. We implemented this optimization technique as a new LLVM calling convention TC_INTR
. Fig. 8 is a concrete example that illustrates this technique.
5 Evaluation
5.1 Overhead Analysis
-O3
and -Os
, respectively. We modified Zig to produce LLVM bitcode output (which Clang already could do) to enable link-time optimization for all compiled code.-
Ctx The use of the context management API including task creation and context switching. This is technically not a CFI mechanism by itself but rather a prerequisite for other mechanisms. Note that even without TZmCFI, multi-tasking applications are still required to use the API (with a slimmer implementation) to correctly preempt the execution of Secure functions. Also, task deletion is never performed because it is not supported by TZmCFI.
-
SES Shadow exception stacks (Sect. 4).
-
APE Accelerated privilege escalation (Sect. 4.4).
-
SS Our multi-task-aware TrustZone-based implementation of shadow stacks (Sect. 3.2.1). The two flavors of the implementation, Aborting and Non-Aborting are evaluated separately.
-
EntInt and LeaInt represent the execution of an exception trampoline and an exception return trampoline. This pair of operations is executed every time an exception is handled to keep track of valid exception return targets on SES.
-
ShPush and ShAsrt represent push and pop operations on SS, which are inserted to function prologues and epilogues by the compiler. Some leaf functions do not have them if the return target is not spilled to memory.
-
ShAsrtRet is a fused operation of ShAsrt and a function return, used for performance optimization.
5.1.1 Interrupt Latency
Skew | ReleaseFast | ReleaseSmall | ||
---|---|---|---|---|
Uninstrumented | Instrumented | Uninstrumented | Instrumented | |
\(< -4\) | 35 | 132 (+ 97) | 31 | 131 (+ 100) |
2 | 30 | 126 (+ 96) | 25 | 125 (+ 100) |
7 | 36 | 236 (+ 200) | 31 | 233 (+ 202) |
cpsid f
instruction that disables interrupts was swapped with the next instruction. An exception entry chain was observed at \(skew = 7\) and the interrupt response time was 140 cycles (ReleaseFast) and 142 cycles (ReleaseSmall). Rewriting the program to use the FPGA RAM block inside the FPGA yielded little improvement. Because the algorithm’s most time-consuming part is expected to be the stack walk loop, this result suggested that the overhead was largely caused by inefficiencies in the invocation of the core subroutines, not by the algorithm.5.1.2 FreeRTOS+MPU System Calls
Build mode | Ctx | SES | APE | Shadow stack | Overhead\(^{\mathrm{a}}\) | NewTask | DelTask | NewTask+disp | DelTask+disp | SemTake | SemGive | SemGive+disp |
---|---|---|---|---|---|---|---|---|---|---|---|---|
ReleaseFast | 18 | 1026 | 347 | 2803 | 429 | 175 | 210 | 561 | ||||
\(\checkmark \) | 18 | 3050 | 347 | 4927 | 529 | 175 | 210 | 661 | ||||
\(\checkmark \) | \(\checkmark \) | 18 | 3012 | 309 | 4888 | 492 | 137 | 171 | 622 | |||
\(\checkmark \) | \(\checkmark \) | 18 | 3228 | 526 | 5284 | 887 | 354 | 389 | 1019 | |||
\(\checkmark \) | \(\checkmark \) | \(\checkmark \) | 18 | 3011 | 309 | 5066 | 671 | 137 | 171 | 801 | ||
\(\checkmark \) | \(\checkmark \) | Non-aborting | 59 | 3517 | 747 | 5575 | 1051 | 482 | 518 | 1229 | ||
\(\checkmark \) | \(\checkmark \) | \(\checkmark \) | Non-aborting | 59 | 3258 | 488 | 5315 | 792 | 222 | 258 | 969 | |
\(\checkmark \) | \(\checkmark \) | Aborting | 62 | 3538 | 762 | 5593 | 1063 | 491 | 527 | 1244 | ||
\(\checkmark \) | \(\checkmark \) | \(\checkmark \) | Aborting | 62 | 3276 | 500 | 5330 | 801 | 228 | 264 | 981 | |
ReleaseSmall | 19 | 1279 | 359 | 4003 | 429 | 177 | 210 | 561 | ||||
\(\checkmark \) | 19 | 4065 | 359 | 6889 | 529 | 177 | 210 | 661 | ||||
\(\checkmark \) | \(\checkmark \) | 19 | 4027 | 321 | 6851 | 491 | 139 | 171 | 622 | |||
\(\checkmark \) | \(\checkmark \) | 19 | 4242 | 537 | 7244 | 885 | 355 | 388 | 1017 | |||
\(\checkmark \) | \(\checkmark \) | \(\checkmark \) | 19 | 4026 | 321 | 7028 | 669 | 139 | 171 | 800 | ||
\(\checkmark \) | \(\checkmark \) | Non-aborting | 59 | 4616 | 844 | 7606 | 1072 | 481 | 517 | 1230 | ||
\(\checkmark \) | \(\checkmark \) | \(\checkmark \) | Non-aborting | 59 | 4357 | 586 | 7347 | 813 | 222 | 258 | 971 | |
\(\checkmark \) | \(\checkmark \) | Aborting | 62 | 4643 | 865 | 7627 | 1087 | 490 | 526 | 1245 | ||
\(\checkmark \) | \(\checkmark \) | \(\checkmark \) | Aborting | 62 | 4381 | 604 | 7365 | 825 | 228 | 264 | 983 |
Build mode | Event | Overheada | NewTask | DelTask | NewTask+disp | DelTask+disp | SemTake | SemGive | SemGive+disp |
---|---|---|---|---|---|---|---|---|---|
ReleaseFast | EntInt | 0 | 1 | 1 | 2 | 2 | 1 | 1 | 2 |
LeaInt | 0 | 1 | 1 | 2 | 2 | 1 | 1 | 2 | |
ShPush | 1 | 8 | 6 | 9 | 5 | 4 | 4 | 6 | |
ShAsrt | 0 | 1 | 3 | 1 | 1 | 1 | 1 | 1 | |
ShAsrtRet | 1 | 7 | 3 | 6 | 4 | 3 | 3 | 5 | |
ReleaseSmall | EntInt | 0 | 1 | 1 | 2 | 2 | 1 | 1 | 2 |
LeaInt | 0 | 1 | 1 | 2 | 2 | 1 | 1 | 2 | |
ShPush | 1 | 10 | 8 | 11 | 5 | 4 | 4 | 6 | |
ShAsrt | 0 | 1 | 4 | 1 | 1 | 1 | 1 | 1 | |
ShAsrtRet | 1 | 9 | 4 | 7 | 5 | 3 | 3 | 5 |
5.1.3 CoreMark
Build mode | Event | Count | Build mode | Event | Count |
---|---|---|---|---|---|
ReleaseFast | ShPush | 1249 | ReleaseSmall | ShPush | 1671 |
ShArtRet | 1249 | ShArtRet | 1671 |