| CPU frequency and voltage scaling code in the Linux(TM) kernel |
| |
| |
| L i n u x C P U F r e q |
| |
| C P U F r e q G o v e r n o r s |
| |
| - information for users and developers - |
| |
| |
| Dominik Brodowski <linux@brodo.de> |
| some additions and corrections by Nico Golde <nico@ngolde.de> |
| |
| |
| |
| Clock scaling allows you to change the clock speed of the CPUs on the |
| fly. This is a nice method to save battery power, because the lower |
| the clock speed, the less power the CPU consumes. |
| |
| |
| Contents: |
| --------- |
| 1. What is a CPUFreq Governor? |
| |
| 2. Governors In the Linux Kernel |
| 2.1 Performance |
| 2.2 Powersave |
| 2.3 Userspace |
| 2.4 Ondemand |
| 2.5 Conservative |
| 2.6 Interactive |
| |
| 3. The Governor Interface in the CPUfreq Core |
| |
| |
| |
| 1. What Is A CPUFreq Governor? |
| ============================== |
| |
| Most cpufreq drivers (in fact, all except one, longrun) or even most |
| cpu frequency scaling algorithms only offer the CPU to be set to one |
| frequency. In order to offer dynamic frequency scaling, the cpufreq |
| core must be able to tell these drivers of a "target frequency". So |
| these specific drivers will be transformed to offer a "->target/target_index" |
| call instead of the existing "->setpolicy" call. For "longrun", all |
| stays the same, though. |
| |
| How to decide what frequency within the CPUfreq policy should be used? |
| That's done using "cpufreq governors". Two are already in this patch |
| -- they're the already existing "powersave" and "performance" which |
| set the frequency statically to the lowest or highest frequency, |
| respectively. At least two more such governors will be ready for |
| addition in the near future, but likely many more as there are various |
| different theories and models about dynamic frequency scaling |
| around. Using such a generic interface as cpufreq offers to scaling |
| governors, these can be tested extensively, and the best one can be |
| selected for each specific use. |
| |
| Basically, it's the following flow graph: |
| |
| CPU can be set to switch independently | CPU can only be set |
| within specific "limits" | to specific frequencies |
| |
| "CPUfreq policy" |
| consists of frequency limits (policy->{min,max}) |
| and CPUfreq governor to be used |
| / \ |
| / \ |
| / the cpufreq governor decides |
| / (dynamically or statically) |
| / what target_freq to set within |
| / the limits of policy->{min,max} |
| / \ |
| / \ |
| Using the ->setpolicy call, Using the ->target/target_index call, |
| the limits and the the frequency closest |
| "policy" is set. to target_freq is set. |
| It is assured that it |
| is within policy->{min,max} |
| |
| |
| 2. Governors In the Linux Kernel |
| ================================ |
| |
| 2.1 Performance |
| --------------- |
| |
| The CPUfreq governor "performance" sets the CPU statically to the |
| highest frequency within the borders of scaling_min_freq and |
| scaling_max_freq. |
| |
| |
| 2.2 Powersave |
| ------------- |
| |
| The CPUfreq governor "powersave" sets the CPU statically to the |
| lowest frequency within the borders of scaling_min_freq and |
| scaling_max_freq. |
| |
| |
| 2.3 Userspace |
| ------------- |
| |
| The CPUfreq governor "userspace" allows the user, or any userspace |
| program running with UID "root", to set the CPU to a specific frequency |
| by making a sysfs file "scaling_setspeed" available in the CPU-device |
| directory. |
| |
| |
| 2.4 Ondemand |
| ------------ |
| |
| The CPUfreq governor "ondemand" sets the CPU depending on the |
| current usage. To do this the CPU must have the capability to |
| switch the frequency very quickly. There are a number of sysfs file |
| accessible parameters: |
| |
| sampling_rate: measured in uS (10^-6 seconds), this is how often you |
| want the kernel to look at the CPU usage and to make decisions on |
| what to do about the frequency. Typically this is set to values of |
| around '10000' or more. It's default value is (cmp. with users-guide.txt): |
| transition_latency * 1000 |
| Be aware that transition latency is in ns and sampling_rate is in us, so you |
| get the same sysfs value by default. |
| Sampling rate should always get adjusted considering the transition latency |
| To set the sampling rate 750 times as high as the transition latency |
| in the bash (as said, 1000 is default), do: |
| echo `$(($(cat cpuinfo_transition_latency) * 750 / 1000)) \ |
| >ondemand/sampling_rate |
| |
| sampling_rate_min: |
| The sampling rate is limited by the HW transition latency: |
| transition_latency * 100 |
| Or by kernel restrictions: |
| If CONFIG_NO_HZ_COMMON is set, the limit is 10ms fixed. |
| If CONFIG_NO_HZ_COMMON is not set or nohz=off boot parameter is used, the |
| limits depend on the CONFIG_HZ option: |
| HZ=1000: min=20000us (20ms) |
| HZ=250: min=80000us (80ms) |
| HZ=100: min=200000us (200ms) |
| The highest value of kernel and HW latency restrictions is shown and |
| used as the minimum sampling rate. |
| |
| up_threshold: defines what the average CPU usage between the samplings |
| of 'sampling_rate' needs to be for the kernel to make a decision on |
| whether it should increase the frequency. For example when it is set |
| to its default value of '95' it means that between the checking |
| intervals the CPU needs to be on average more than 95% in use to then |
| decide that the CPU frequency needs to be increased. |
| |
| ignore_nice_load: this parameter takes a value of '0' or '1'. When |
| set to '0' (its default), all processes are counted towards the |
| 'cpu utilisation' value. When set to '1', the processes that are |
| run with a 'nice' value will not count (and thus be ignored) in the |
| overall usage calculation. This is useful if you are running a CPU |
| intensive calculation on your laptop that you do not care how long it |
| takes to complete as you can 'nice' it and prevent it from taking part |
| in the deciding process of whether to increase your CPU frequency. |
| |
| sampling_down_factor: this parameter controls the rate at which the |
| kernel makes a decision on when to decrease the frequency while running |
| at top speed. When set to 1 (the default) decisions to reevaluate load |
| are made at the same interval regardless of current clock speed. But |
| when set to greater than 1 (e.g. 100) it acts as a multiplier for the |
| scheduling interval for reevaluating load when the CPU is at its top |
| speed due to high load. This improves performance by reducing the overhead |
| of load evaluation and helping the CPU stay at its top speed when truly |
| busy, rather than shifting back and forth in speed. This tunable has no |
| effect on behavior at lower speeds/lower CPU loads. |
| |
| powersave_bias: this parameter takes a value between 0 to 1000. It |
| defines the percentage (times 10) value of the target frequency that |
| will be shaved off of the target. For example, when set to 100 -- 10%, |
| when ondemand governor would have targeted 1000 MHz, it will target |
| 1000 MHz - (10% of 1000 MHz) = 900 MHz instead. This is set to 0 |
| (disabled) by default. |
| When AMD frequency sensitivity powersave bias driver -- |
| drivers/cpufreq/amd_freq_sensitivity.c is loaded, this parameter |
| defines the workload frequency sensitivity threshold in which a lower |
| frequency is chosen instead of ondemand governor's original target. |
| The frequency sensitivity is a hardware reported (on AMD Family 16h |
| Processors and above) value between 0 to 100% that tells software how |
| the performance of the workload running on a CPU will change when |
| frequency changes. A workload with sensitivity of 0% (memory/IO-bound) |
| will not perform any better on higher core frequency, whereas a |
| workload with sensitivity of 100% (CPU-bound) will perform better |
| higher the frequency. When the driver is loaded, this is set to 400 |
| by default -- for CPUs running workloads with sensitivity value below |
| 40%, a lower frequency is chosen. Unloading the driver or writing 0 |
| will disable this feature. |
| |
| |
| 2.5 Conservative |
| ---------------- |
| |
| The CPUfreq governor "conservative", much like the "ondemand" |
| governor, sets the CPU depending on the current usage. It differs in |
| behaviour in that it gracefully increases and decreases the CPU speed |
| rather than jumping to max speed the moment there is any load on the |
| CPU. This behaviour more suitable in a battery powered environment. |
| The governor is tweaked in the same manner as the "ondemand" governor |
| through sysfs with the addition of: |
| |
| freq_step: this describes what percentage steps the cpu freq should be |
| increased and decreased smoothly by. By default the cpu frequency will |
| increase in 5% chunks of your maximum cpu frequency. You can change this |
| value to anywhere between 0 and 100 where '0' will effectively lock your |
| CPU at a speed regardless of its load whilst '100' will, in theory, make |
| it behave identically to the "ondemand" governor. |
| |
| down_threshold: same as the 'up_threshold' found for the "ondemand" |
| governor but for the opposite direction. For example when set to its |
| default value of '20' it means that if the CPU usage needs to be below |
| 20% between samples to have the frequency decreased. |
| |
| sampling_down_factor: similar functionality as in "ondemand" governor. |
| But in "conservative", it controls the rate at which the kernel makes |
| a decision on when to decrease the frequency while running in any |
| speed. Load for frequency increase is still evaluated every |
| sampling rate. |
| |
| 2.6 Interactive |
| --------------- |
| |
| The CPUfreq governor "interactive" is designed for latency-sensitive, |
| interactive workloads. This governor sets the CPU speed depending on |
| usage, similar to "ondemand" and "conservative" governors, but with a |
| different set of configurable behaviors. |
| |
| The tuneable values for this governor are: |
| |
| target_loads: CPU load values used to adjust speed to influence the |
| current CPU load toward that value. In general, the lower the target |
| load, the more often the governor will raise CPU speeds to bring load |
| below the target. The format is a single target load, optionally |
| followed by pairs of CPU speeds and CPU loads to target at or above |
| those speeds. Colons can be used between the speeds and associated |
| target loads for readability. For example: |
| |
| 85 1000000:90 1700000:99 |
| |
| targets CPU load 85% below speed 1GHz, 90% at or above 1GHz, until |
| 1.7GHz and above, at which load 99% is targeted. If speeds are |
| specified these must appear in ascending order. Higher target load |
| values are typically specified for higher speeds, that is, target load |
| values also usually appear in an ascending order. The default is |
| target load 90% for all speeds. |
| |
| min_sample_time: The minimum amount of time to spend at the current |
| frequency before ramping down. Default is 80000 uS. |
| |
| hispeed_freq: An intermediate "hi speed" at which to initially ramp |
| when CPU load hits the value specified in go_hispeed_load. If load |
| stays high for the amount of time specified in above_hispeed_delay, |
| then speed may be bumped higher. Default is the maximum speed |
| allowed by the policy at governor initialization time. |
| |
| go_hispeed_load: The CPU load at which to ramp to hispeed_freq. |
| Default is 99%. |
| |
| above_hispeed_delay: When speed is at or above hispeed_freq, wait for |
| this long before raising speed in response to continued high load. |
| The format is a single delay value, optionally followed by pairs of |
| CPU speeds and the delay to use at or above those speeds. Colons can |
| be used between the speeds and associated delays for readability. For |
| example: |
| |
| 80000 1300000:200000 1500000:40000 |
| |
| uses delay 80000 uS until CPU speed 1.3 GHz, at which speed delay |
| 200000 uS is used until speed 1.5 GHz, at which speed (and above) |
| delay 40000 uS is used. If speeds are specified these must appear in |
| ascending order. Default is 20000 uS. |
| |
| timer_rate: Sample rate for reevaluating CPU load when the CPU is not |
| idle. A deferrable timer is used, such that the CPU will not be woken |
| from idle to service this timer until something else needs to run. |
| (The maximum time to allow deferring this timer when not running at |
| minimum speed is configurable via timer_slack.) Default is 20000 uS. |
| |
| timer_slack: Maximum additional time to defer handling the governor |
| sampling timer beyond timer_rate when running at speeds above the |
| minimum. For platforms that consume additional power at idle when |
| CPUs are running at speeds greater than minimum, this places an upper |
| bound on how long the timer will be deferred prior to re-evaluating |
| load and dropping speed. For example, if timer_rate is 20000uS and |
| timer_slack is 10000uS then timers will be deferred for up to 30msec |
| when not at lowest speed. A value of -1 means defer timers |
| indefinitely at all speeds. Default is 80000 uS. |
| |
| boost: If non-zero, immediately boost speed of all CPUs to at least |
| hispeed_freq until zero is written to this attribute. If zero, allow |
| CPU speeds to drop below hispeed_freq according to load as usual. |
| Default is zero. |
| |
| boostpulse: On each write, immediately boost speed of all CPUs to |
| hispeed_freq for at least the period of time specified by |
| boostpulse_duration, after which speeds are allowed to drop below |
| hispeed_freq according to load as usual. |
| |
| boostpulse_duration: Length of time to hold CPU speed at hispeed_freq |
| on a write to boostpulse, before allowing speed to drop according to |
| load as usual. Default is 80000 uS. |
| |
| |
| 3. The Governor Interface in the CPUfreq Core |
| ============================================= |
| |
| A new governor must register itself with the CPUfreq core using |
| "cpufreq_register_governor". The struct cpufreq_governor, which has to |
| be passed to that function, must contain the following values: |
| |
| governor->name - A unique name for this governor |
| governor->governor - The governor callback function |
| governor->owner - .THIS_MODULE for the governor module (if |
| appropriate) |
| |
| The governor->governor callback is called with the current (or to-be-set) |
| cpufreq_policy struct for that CPU, and an unsigned int event. The |
| following events are currently defined: |
| |
| CPUFREQ_GOV_START: This governor shall start its duty for the CPU |
| policy->cpu |
| CPUFREQ_GOV_STOP: This governor shall end its duty for the CPU |
| policy->cpu |
| CPUFREQ_GOV_LIMITS: The limits for CPU policy->cpu have changed to |
| policy->min and policy->max. |
| |
| If you need other "events" externally of your driver, _only_ use the |
| cpufreq_governor_l(unsigned int cpu, unsigned int event) call to the |
| CPUfreq core to ensure proper locking. |
| |
| |
| The CPUfreq governor may call the CPU processor driver using one of |
| these two functions: |
| |
| int cpufreq_driver_target(struct cpufreq_policy *policy, |
| unsigned int target_freq, |
| unsigned int relation); |
| |
| int __cpufreq_driver_target(struct cpufreq_policy *policy, |
| unsigned int target_freq, |
| unsigned int relation); |
| |
| target_freq must be within policy->min and policy->max, of course. |
| What's the difference between these two functions? When your governor |
| still is in a direct code path of a call to governor->governor, the |
| per-CPU cpufreq lock is still held in the cpufreq core, and there's |
| no need to lock it again (in fact, this would cause a deadlock). So |
| use __cpufreq_driver_target only in these cases. In all other cases |
| (for example, when there's a "daemonized" function that wakes up |
| every second), use cpufreq_driver_target to lock the cpufreq per-CPU |
| lock before the command is passed to the cpufreq processor driver. |
| |