Power Consumption and Its Limits
Desktop server/ workstation
High-performance workstations and servers
Recent areas of focus to decrease TDP
An electrical outlet, commonly known as the “AC power source”
A so-called SMPS unit, commonly known as the “DC power source”
A rechargeable battery
Media Workloads on Consumer Platforms
Still image capture
Still image preview/view finder
Wireless display or Miracast: clone mode or extended mode
Browser-based video streaming
Video recording and dual video recording
Videophone and video chat
Video email and multimedia messaging
Video upload to Internet
Power Management Features
CPU settings (e.g., CPU states enabled, CPU fan throttling), platform settings (e.g., thermal high/low watermarks, chassis fan throttling, etc.)
HLT (halt instruction in x86 for CPU to halt until next external interrupt is fired), Stop clock, Intel SpeedStep (aka dynamic frequency scaling)
Blanking, dimming, power saver mode, efficient energy use as specified in the Energy Star international standard
Power down to intermediate state, power shutoff
Hard drive/ CD-ROM
Wake on LAN
Power state transition of devices such as mouse, USB drives, etc.; wake on access (e.g., mouse movement)
ACPI and Power Management
ACPI Power States
The system is fully usable. CPUs are active. Devices may or may not be active, and can possibly enter a lower power state. There is a subset of S0, called “Away mode,” where monitor is off, but background tasks are running.
The system appears to be off. Power consumption is reduced. All the processor caches are flushed, and the CPU(s) stops executing instructions. The power to the CPU(s) and RAM is maintained. Nonessential devices may be powered off. This state is rarely used.
The system appears to be off. CPU is powered off. Dirty cache is flushed to RAM. Similar to S1, the S2 state is also rarely used.
Commonly known as standby, sleep, or Suspend-to-RAM (STR). The system appears to be off. System context is maintained on the system DRAM. All power is shut to the noncritical circuits, but RAM power is retained. Transition to S0 takes longer than S2 and S1, respectively.
Known as hibernation or Suspend-to-Disk (STD). The system appears to be off. Power consumption is reduced to the lowest level. System context is maintained on the disk, preserving the state of the OS, applications, and open documents. Contents of the main memory are saved to non-volatile memory such as a hard drive, and the system is powered down, except for the logic to resume.
The system appears to be off. System context is not maintained. Some components may remain powered, so the computer can wake from input from a keyboard, mouse, LAN, or USB device. The working context can be restored if it is stored on nonvolatile memory. All power is shut, except for the logic required to restart. Full boot is required to restart.
The system is completely off and consumes no power. A full reboot is required for the system to return to the active state.
Subset of Dx
Fully ON and operating.
Intermediate power states. Definition varies by device.
Device is off and unresponsive to bus, but the system is ON. Device is still connected to power. D3 Hot has auxiliary power enabling a higher power state. A transition from D0 to D3 implies D3 Hot.
No power to the device—both the device and system are OFF. It is possible for the device to consume trickle power, but a wake event is needed to move the device and the system back to D0 and/or S0 states.
Power Management by the Operating System
Linux Power Management
The X Window
Intel Embedded Graphics Driver
Windows Power Management
System Power Requirements
1. Maximum battery life should be achieved with minimum energy consumption.
2. The delay for startup and shutdown should be minimal.
3. Power decisions should be intelligently made—for example, a device that is not in a best position to change the system power state should not do so.
4. Capabilities should be available to adjust fans or driver motors on-demand for quiet operation.
5. All requirements should be met in a platform independent manner.
Device Power Requirements
6. Devices, especially for portable systems, must be extremely power conscious.
7. Devices should be aggressive in power savings:
a. Should provide just-in-time capabilities.
b. Low transition latency to higher states.
c. When possible, the device logic should be partitioned into separate power buses so that portions of a device can be turned off as needed.
d. Should support connected standby as appropriate for quick connection.
Windows Hardware Certification Requirements
8. The Windows HCK tests require that all devices must support S3 and S4 without refusing system sleep request.
9. Standby and connected standby must last for days.
10. Device must queue up and not lose the I/O request while in D1-D3 states.
Performance mode: In performance mode, the system attempts to deliver maximum performance without regard to power consumption.
Balanced mode: In this mode, the operating system attempts to reach a balance between performance and power.
Power saver mode: In this mode, the operating system attempts to save maximum power in order to preserve battery life, even sacrificing some performance.
Tune the application behavior based on the user’s current power policy.
Modify the application behavior in response to a change in power policy.
Move to a different power policy as required by the application.
The Windows Driver Model
The Windows Driver Framework
Device Power Management in Windows 8
Device D1 and D2
Indicates whether the device supports D1, or D2, or both.
Wake from Dx
Indicates whether the device supports waking from a Dx state.
Defines the Dx state corresponding to each Sx state.
Nominal transition time to D0.
Dealing with Power Requests
IRP_MN_QUERY_POWER: A query to determine the capability of the device to safely enter a new requested Dx or Sx state, or a shutdown or restart of the device. If the device is capable of the transition at a given time, the driver should queue any further request that is contrary to the transition before announcing the capability, as a SET request typically follows a QUERY request.
IRP_MN_SET_POWER: An order to move the device to a new Dx state or respond to a new Sx state. Generally, device drivers carry out a SET request without fail; the exception is bus drivers such as USB drivers, which may return a failure if the device is in the process of being removed. Drivers serve a SET request by requesting appropriate change to the device power state, saving context when moving to a lower power state, and restoring context when transitioning to a higher power state.
IRP_MN_WAIT_WAKE: A request to the device driver to enable the device hardware so that an external wake event can awaken the entire system. One such request may be kept in a pending state at any given time until the external event occurs; upon occurrence of the event, the driver returns a success. If the device can no longer wake the system, the driver returns a failure and the Power Manager cancels the request.
IRP_MN_POWER_SEQUENCE: A query for the D1-D3 counters--that is, the number of times the device has actually been in a lower power state. The difference between the count before and the count after a sleep request would tell the Power Manager whether the device did get a chance to go to a lower power state, or if it was prohibited by a long latency, so that the Power Manager can take appropriate action and possibly not issue a sleep request for the device.
Power Management by the Processor
CPU States (C-states)
This is the only state that runs software. All clocks are running and the processor core is active. The processor can service snoops and maintain cache coherency in this state. All power management for interfaces, clock gating, etc., are controlled at the unit level.
The first level of power reduction occurs when the core processor executes an Auto-Halt instruction. This stops the execution of the instruction stream and greatly reduces the core processor’s power consumption. The core processor can service snoops and maintain cache coherency in this state. The processor’s North Complex logic does not explicitly distinguish C1 from C0.
The next level of power reduction occurs when the core processor is placed into the Stop Grant state. The core processor can service snoops and maintain cache coherency in this state. The North Complex only supports receiving a single Stop Grant.
Entry into the C2 state will occur after the core processor requests C2 (or deeper). Upon detection of a break event, C2 state will be exited, entering the C0 state. Processor must ensure that the PLLs are awake and the memory will be out of self-refresh at this point.
In this state, the core processor shuts down its PLL and cannot handle snoop requests. The core processor voltage regulator is also told to reduce the processor’s voltage. During the C4 state, the North Complex continues to handle traffic to memory so long as this traffic does not require a snoop (i.e., no coherent traffic requests are serviced).
The C4 state is entered by receiving a C4 request from the core processor/OS. The exit from C4 occurs when the North Complex detects a snoop-able event or a break event, which would cause it to wake up the core processor and initiate the sequence to return to the C0 state.
Deep Power Down
Prior to entering the C6 state, the core processor flushes its cache and saves its core context to a special on-die SRAM on a different power plane. Once the C6 entry sequence has completed, the core processor’s voltage can be completely shut off.
The key difference for the North Complex logic between the C4 state and the C6 state is that since the core processor’s cache is empty, there is no need to perform snoops on the internal front side bus (FSB). This means that bus master events (which would cause a popup from the C4 state to the C2 state) can be allowed to flow unhindered during the C6 state. However, the core processor must still be returned to the C0 state to service interrupts.
A residency counter is read by the core processor to enable an intelligent promotion/demotion based on energy awareness of transitions and history of residencies/transitions.
Performance States (P-states)
Thermal States (T-States)
The Voltage-Frequency Curve
Reduce voltage and frequency
Reduce activity and Cdyn
System integration optimization
Application level optimization
Dynamic voltage and frequency scaling
Use of low-level cache
Reducing CPU processing
Optimizing driver codes to use the fewest CPU instructions to accomplish a task
Simplifying the device driver interface to match the hardware interface to minimize the command transformation costs
Using special-purpose hardware for some tasks with a balanced approach for task execution
Dynamic Voltage and Frequency Scaling
Use of Low-level Cache
As power consumption is proportional to execution residency, running less code in the CPU translates to less power consumption. So, performing code optimization of key software modules contributes to algorithmic optimization.
Processing tasks can be offloaded to dedicated power-efficient fixed-function media hardware blocks as supported by the platform.
In order to perform various stages in a pipeline of tasks for a given usage, it is generally necessary to expand the data into some intermediate representation within a stage. Storing such data requires a much larger bandwidth to memory and caches. The cost of memory transactions in terms of power consumption can be reduced by minimizing the memory bandwidth. Bandwidth reduction techniques are, therefore, important considerations for algorithmic optimization.
The concurrency available among various stages or substages of the pipeline may be explored and appropriate parallelization approaches may be made to reduce the execution time.
The I/O operations can be optimized by appropriate buffering to enable the packing of larger amounts of data followed by longer idle periods, as frequent short transfers do not give the modules a chance to power down for idle periods. Also, disk access latency and fragmentation in files should be taken into account for I/O optimization, as they may have significant impact in power consumption.
Appropriate scheduling and coalescing of interrupts provide the opportunity to maximize idle time.
All active tasks can be overlapped in all parts of the platform—for example, the CPU, the GPU, the I/O communication, and the storage.
Computational Complexity Reduction
Selecting Efficient Data types
Code Parallelization and Optimization
Memory Transfer Reduction
System Integration Optimization
Reducing the number of layers.
Improving the understanding of the authors of various layers regarding each other’s capabilities and limitations.
Redefining the boundaries of the layers.
System Operating Point on the P-F Curve
Duty Cycle Reduction
Context Awareness by the Application
Instead of a full system scan as done while on AC power, a virus checker may start a partial scan of the system on battery power.
A media player may decide to trade off video quality to achieve longer playback of a Blu-ray movie.
A gaming application may choose to sacrifice some special effects to accommodate more sections of the game.
Applications Seeking User Intervention
An application can monitor battery capacity, and when the battery charge drops to a certain fraction of its capacity--say, 50 or 25 percent--the application may indicate a warning to the user interface to alert the user of the remaining battery capacity.
An application can respond to a power source change from AC to DC by notifying the user of the change and providing an option to dim the display.
An application can respond to ambient light level and request the user to adjust the display brightness.
Understanding the impact of an application on power consumption by the system, and potentially finding optimization opportunities by tuning the application.
Determining the effect of software changes at the user level, at the driver level, or at the kernel level; and understanding whether there is any performance or power regression owing to code changes.
Verifying that a debug code was removed from the software.
Determining the amount of power savings from power-management features, and verifying that such features are turned on.
Determining the performance per watt in order to drive performance and power tuning, thereby obtaining the best tradeoff in practical thermally constrained environments.
AC Power Measurement
DC Power Measurement
Considerations in Power Measurement
The TDP of the processor part under measurement.
The accuracy and precision of the data acquisition system; The ability of the DAQ and associated software for real-time conversion of analog voltage signals to digital data sequence, and for subsequent processing and analysis.
Ambient temperature, heat dissipation, and cooling variations from one set of measurements to another; to hedge against run-to-run variation from environmental factors, a three-run set of measurements is usually taken and the median measured value is considered.
Separate annotation of appropriate power rails for associated power savings, while recording the power consumption on all power rails at a typical sampling rate of 1 kHz (i.e., one sample every one millisecond), with a thermally relevant measurement window between one and five seconds as the moving average.
Recognition of operating system background tasks and power policy; for example, when no media workload is running and the processor is apparently idle, the CPU may still be busy running background tasks; in addition, the power-saving policy of the operating system may have adjusted the high-frequency limit of the CPU, which needs to be carefully considered.
Consideration of average power over a period of time in order to eliminate the sudden spikes in power transients, and consideration only of steady-state power consumption behavior.
Benchmarks included for both synthetic settings and common usage scenarios; appropriate workloads considered for high-end usages so that various parts of the system get a chance to reach their potential limits.
Consideration of using the latest available graphics driver and media SDK versions, as there may be power optimizations available in driver and middleware level; also, there is a risk of power or performance regression with a new graphics driver such as potential changes to the GPU core, memory, PLL, voltage regulator settings, and over-clock (turbo) settings.
Tools and Applications
An Example DC Power-Measurement System
Software Tools and Applications
CPU/Chipset Power: Such problems are identified by examining the CPU C-state residency to determine whether the CPU and the chipset power are optimally managed, and to get some insight into what is causing any increase in platform power consumption. For example, high residency at deep C-states such as C3 may indicate frequent C-state transition due to device interrupt or software activity.
CPU Utilization: CPU utilization samples are commonly taken at every timer tick interrupt--i.e., every 15.6 millisecond for most media applications and some background applications. However, the timer resolution can be shortened from the default 15.6 millisecond in an attempt to capture activities within shorter periods. For multi-core CPUs, CPU utilization and power consumption depend on the active duration, while each core may only be active for a partial segment of the total duration for which the platform is active. Therefore, CPU core utilization and platform utilization should be counted separately. Logically, when the activities of two cores overlap, the CPU utilization is shown as the sum of two utilizations by most power measurement tools. Only few tools, the Intel Battery Life Analyzer among them, can actually use fine-grain process information to determine the total active duration of both the platform package and the logical CPU. By investigating the CPU utilization, inefficient software components and their hotspots can be identified, and the impact of the software component and its hotspots can be determined to find optimization opportunities.
CPU Activity Frequency:Power tools can help identify software components causing frequent transition of CPU states. It is valuable to determine the frequency of the activity of each component and the number of activities that are happening in each tick period. Understanding why the frequent transitions are happening may help point to power-related issues or improvement prospects.
GPU Power: On the modern processors, as most media applications run on the GPU, it is also important to understand the impact of GPU C-state transitions and GPU utilization. GPU utilization largely controls the power consumption of media applications. However, there are only few tools that have the ability to report GPU utilization; the Intel GPA is one such tool.