| |
| Cgroup unified hierarchy |
| |
| April, 2014 Tejun Heo <tj@kernel.org> |
| |
| This document describes the changes made by unified hierarchy and |
| their rationales. It will eventually be merged into the main cgroup |
| documentation. |
| |
| CONTENTS |
| |
| 1. Background |
| 2. Basic Operation |
| 2-1. Mounting |
| 2-2. cgroup.subtree_control |
| 2-3. cgroup.controllers |
| 3. Structural Constraints |
| 3-1. Top-down |
| 3-2. No internal tasks |
| 4. Other Changes |
| 4-1. [Un]populated Notification |
| 4-2. Other Core Changes |
| 4-3. Per-Controller Changes |
| 4-3-1. blkio |
| 4-3-2. cpuset |
| 4-3-3. memory |
| 5. Planned Changes |
| 5-1. CAP for resource control |
| |
| |
| 1. Background |
| |
| cgroup allows an arbitrary number of hierarchies and each hierarchy |
| can host any number of controllers. While this seems to provide a |
| high level of flexibility, it isn't quite useful in practice. |
| |
| For example, as there is only one instance of each controller, utility |
| type controllers such as freezer which can be useful in all |
| hierarchies can only be used in one. The issue is exacerbated by the |
| fact that controllers can't be moved around once hierarchies are |
| populated. Another issue is that all controllers bound to a hierarchy |
| are forced to have exactly the same view of the hierarchy. It isn't |
| possible to vary the granularity depending on the specific controller. |
| |
| In practice, these issues heavily limit which controllers can be put |
| on the same hierarchy and most configurations resort to putting each |
| controller on its own hierarchy. Only closely related ones, such as |
| the cpu and cpuacct controllers, make sense to put on the same |
| hierarchy. This often means that userland ends up managing multiple |
| similar hierarchies repeating the same steps on each hierarchy |
| whenever a hierarchy management operation is necessary. |
| |
| Unfortunately, support for multiple hierarchies comes at a steep cost. |
| Internal implementation in cgroup core proper is dazzlingly |
| complicated but more importantly the support for multiple hierarchies |
| restricts how cgroup is used in general and what controllers can do. |
| |
| There's no limit on how many hierarchies there may be, which means |
| that a task's cgroup membership can't be described in finite length. |
| The key may contain any varying number of entries and is unlimited in |
| length, which makes it highly awkward to handle and leads to addition |
| of controllers which exist only to identify membership, which in turn |
| exacerbates the original problem. |
| |
| Also, as a controller can't have any expectation regarding what shape |
| of hierarchies other controllers would be on, each controller has to |
| assume that all other controllers are operating on completely |
| orthogonal hierarchies. This makes it impossible, or at least very |
| cumbersome, for controllers to cooperate with each other. |
| |
| In most use cases, putting controllers on hierarchies which are |
| completely orthogonal to each other isn't necessary. What usually is |
| called for is the ability to have differing levels of granularity |
| depending on the specific controller. In other words, hierarchy may |
| be collapsed from leaf towards root when viewed from specific |
| controllers. For example, a given configuration might not care about |
| how memory is distributed beyond a certain level while still wanting |
| to control how CPU cycles are distributed. |
| |
| Unified hierarchy is the next version of cgroup interface. It aims to |
| address the aforementioned issues by having more structure while |
| retaining enough flexibility for most use cases. Various other |
| general and controller-specific interface issues are also addressed in |
| the process. |
| |
| |
| 2. Basic Operation |
| |
| 2-1. Mounting |
| |
| Currently, unified hierarchy can be mounted with the following mount |
| command. Note that this is still under development and scheduled to |
| change soon. |
| |
| mount -t cgroup -o __DEVEL__sane_behavior cgroup $MOUNT_POINT |
| |
| All controllers which are not bound to other hierarchies are |
| automatically bound to unified hierarchy and show up at the root of |
| it. Controllers which are enabled only in the root of unified |
| hierarchy can be bound to other hierarchies at any time. This allows |
| mixing unified hierarchy with the traditional multiple hierarchies in |
| a fully backward compatible way. |
| |
| |
| 2-2. cgroup.subtree_control |
| |
| All cgroups on unified hierarchy have a "cgroup.subtree_control" file |
| which governs which controllers are enabled on the children of the |
| cgroup. Let's assume a hierarchy like the following. |
| |
| root - A - B - C |
| \ D |
| |
| root's "cgroup.subtree_control" file determines which controllers are |
| enabled on A. A's on B. B's on C and D. This coincides with the |
| fact that controllers on the immediate sub-level are used to |
| distribute the resources of the parent. In fact, it's natural to |
| assume that resource control knobs of a child belong to its parent. |
| Enabling a controller in a "cgroup.subtree_control" file declares that |
| distribution of the respective resources of the cgroup will be |
| controlled. Note that this means that controller enable states are |
| shared among siblings. |
| |
| When read, the file contains a space-separated list of currently |
| enabled controllers. A write to the file should contain a |
| space-separated list of controllers with '+' or '-' prefixed (without |
| the quotes). Controllers prefixed with '+' are enabled and '-' |
| disabled. If a controller is listed multiple times, the last entry |
| wins. The specific operations are executed atomically - either all |
| succeed or fail. |
| |
| |
| 2-3. cgroup.controllers |
| |
| Read-only "cgroup.controllers" file contains a space-separated list of |
| controllers which can be enabled in the cgroup's |
| "cgroup.subtree_control" file. |
| |
| In the root cgroup, this lists controllers which are not bound to |
| other hierarchies and the content changes as controllers are bound to |
| and unbound from other hierarchies. |
| |
| In non-root cgroups, the content of this file equals that of the |
| parent's "cgroup.subtree_control" file as only controllers enabled |
| from the parent can be used in its children. |
| |
| |
| 3. Structural Constraints |
| |
| 3-1. Top-down |
| |
| As it doesn't make sense to nest control of an uncontrolled resource, |
| all non-root "cgroup.subtree_control" files can only contain |
| controllers which are enabled in the parent's "cgroup.subtree_control" |
| file. A controller can be enabled only if the parent has the |
| controller enabled and a controller can't be disabled if one or more |
| children have it enabled. |
| |
| |
| 3-2. No internal tasks |
| |
| One long-standing issue that cgroup faces is the competition between |
| tasks belonging to the parent cgroup and its children cgroups. This |
| is inherently nasty as two different types of entities compete and |
| there is no agreed-upon obvious way to handle it. Different |
| controllers are doing different things. |
| |
| The cpu controller considers tasks and cgroups as equivalents and maps |
| nice levels to cgroup weights. This works for some cases but falls |
| flat when children should be allocated specific ratios of CPU cycles |
| and the number of internal tasks fluctuates - the ratios constantly |
| change as the number of competing entities fluctuates. There also are |
| other issues. The mapping from nice level to weight isn't obvious or |
| universal, and there are various other knobs which simply aren't |
| available for tasks. |
| |
| The blkio controller implicitly creates a hidden leaf node for each |
| cgroup to host the tasks. The hidden leaf has its own copies of all |
| the knobs with "leaf_" prefixed. While this allows equivalent control |
| over internal tasks, it's with serious drawbacks. It always adds an |
| extra layer of nesting which may not be necessary, makes the interface |
| messy and significantly complicates the implementation. |
| |
| The memory controller currently doesn't have a way to control what |
| happens between internal tasks and child cgroups and the behavior is |
| not clearly defined. There have been attempts to add ad-hoc behaviors |
| and knobs to tailor the behavior to specific workloads. Continuing |
| this direction will lead to problems which will be extremely difficult |
| to resolve in the long term. |
| |
| Multiple controllers struggle with internal tasks and came up with |
| different ways to deal with it; unfortunately, all the approaches in |
| use now are severely flawed and, furthermore, the widely different |
| behaviors make cgroup as whole highly inconsistent. |
| |
| It is clear that this is something which needs to be addressed from |
| cgroup core proper in a uniform way so that controllers don't need to |
| worry about it and cgroup as a whole shows a consistent and logical |
| behavior. To achieve that, unified hierarchy enforces the following |
| structural constraint: |
| |
| Except for the root, only cgroups which don't contain any task may |
| have controllers enabled in their "cgroup.subtree_control" files. |
| |
| Combined with other properties, this guarantees that, when a |
| controller is looking at the part of the hierarchy which has it |
| enabled, tasks are always only on the leaves. This rules out |
| situations where child cgroups compete against internal tasks of the |
| parent. |
| |
| There are two things to note. Firstly, the root cgroup is exempt from |
| the restriction. Root contains tasks and anonymous resource |
| consumption which can't be associated with any other cgroup and |
| requires special treatment from most controllers. How resource |
| consumption in the root cgroup is governed is up to each controller. |
| |
| Secondly, the restriction doesn't take effect if there is no enabled |
| controller in the cgroup's "cgroup.subtree_control" file. This is |
| important as otherwise it wouldn't be possible to create children of a |
| populated cgroup. To control resource distribution of a cgroup, the |
| cgroup must create children and transfer all its tasks to the children |
| before enabling controllers in its "cgroup.subtree_control" file. |
| |
| |
| 4. Other Changes |
| |
| 4-1. [Un]populated Notification |
| |
| cgroup users often need a way to determine when a cgroup's |
| subhierarchy becomes empty so that it can be cleaned up. cgroup |
| currently provides release_agent for it; unfortunately, this mechanism |
| is riddled with issues. |
| |
| - It delivers events by forking and execing a userland binary |
| specified as the release_agent. This is a long deprecated method of |
| notification delivery. It's extremely heavy, slow and cumbersome to |
| integrate with larger infrastructure. |
| |
| - There is single monitoring point at the root. There's no way to |
| delegate management of a subtree. |
| |
| - The event isn't recursive. It triggers when a cgroup doesn't have |
| any tasks or child cgroups. Events for internal nodes trigger only |
| after all children are removed. This again makes it impossible to |
| delegate management of a subtree. |
| |
| - Events are filtered from the kernel side. A "notify_on_release" |
| file is used to subscribe to or suppress release events. This is |
| unnecessarily complicated and probably done this way because event |
| delivery itself was expensive. |
| |
| Unified hierarchy implements an interface file "cgroup.populated" |
| which can be used to monitor whether the cgroup's subhierarchy has |
| tasks in it or not. Its value is 0 if there is no task in the cgroup |
| and its descendants; otherwise, 1. poll and [id]notify events are |
| triggered when the value changes. |
| |
| This is significantly lighter and simpler and trivially allows |
| delegating management of subhierarchy - subhierarchy monitoring can |
| block further propagation simply by putting itself or another process |
| in the subhierarchy and monitor events that it's interested in from |
| there without interfering with monitoring higher in the tree. |
| |
| In unified hierarchy, the release_agent mechanism is no longer |
| supported and the interface files "release_agent" and |
| "notify_on_release" do not exist. |
| |
| |
| 4-2. Other Core Changes |
| |
| - None of the mount options is allowed. |
| |
| - remount is disallowed. |
| |
| - rename(2) is disallowed. |
| |
| - The "tasks" file is removed. Everything should at process |
| granularity. Use the "cgroup.procs" file instead. |
| |
| - The "cgroup.procs" file is not sorted. pids will be unique unless |
| they got recycled in-between reads. |
| |
| - The "cgroup.clone_children" file is removed. |
| |
| |
| 4-3. Per-Controller Changes |
| |
| 4-3-1. blkio |
| |
| - blk-throttle becomes properly hierarchical. |
| |
| |
| 4-3-2. cpuset |
| |
| - Tasks are kept in empty cpusets after hotplug and take on the masks |
| of the nearest non-empty ancestor, instead of being moved to it. |
| |
| - A task can be moved into an empty cpuset, and again it takes on the |
| masks of the nearest non-empty ancestor. |
| |
| |
| 4-3-3. memory |
| |
| - use_hierarchy is on by default and the cgroup file for the flag is |
| not created. |
| |
| |
| 5. Planned Changes |
| |
| 5-1. CAP for resource control |
| |
| Unified hierarchy will require one of the capabilities(7), which is |
| yet to be decided, for all resource control related knobs. Process |
| organization operations - creation of sub-cgroups and migration of |
| processes in sub-hierarchies may be delegated by changing the |
| ownership and/or permissions on the cgroup directory and |
| "cgroup.procs" interface file; however, all operations which affect |
| resource control - writes to a "cgroup.subtree_control" file or any |
| controller-specific knobs - will require an explicit CAP privilege. |
| |
| This, in part, is to prevent the cgroup interface from being |
| inadvertently promoted to programmable API used by non-privileged |
| binaries. cgroup exposes various aspects of the system in ways which |
| aren't properly abstracted for direct consumption by regular programs. |
| This is an administration interface much closer to sysctl knobs than |
| system calls. Even the basic access model, being filesystem path |
| based, isn't suitable for direct consumption. There's no way to |
| access "my cgroup" in a race-free way or make multiple operations |
| atomic against migration to another cgroup. |
| |
| Another aspect is that, for better or for worse, the cgroup interface |
| goes through far less scrutiny than regular interfaces for |
| unprivileged userland. The upside is that cgroup is able to expose |
| useful features which may not be suitable for general consumption in a |
| reasonable time frame. It provides a relatively short path between |
| internal details and userland-visible interface. Of course, this |
| shortcut comes with high risk. We go through what we go through for |
| general kernel APIs for good reasons. It may end up leaking internal |
| details in a way which can exert significant pain by locking the |
| kernel into a contract that can't be maintained in a reasonable |
| manner. |
| |
| Also, due to the specific nature, cgroup and its controllers don't |
| tend to attract attention from a wide scope of developers. cgroup's |
| short history is already fraught with severely mis-designed |
| interfaces, unnecessary commitments to and exposing of internal |
| details, broken and dangerous implementations of various features. |
| |
| Keeping cgroup as an administration interface is both advantageous for |
| its role and imperative given its nature. Some of the cgroup features |
| may make sense for unprivileged access. If deemed justified, those |
| must be further abstracted and implemented as a different interface, |
| be it a system call or process-private filesystem, and survive through |
| the scrutiny that any interface for general consumption is required to |
| go through. |
| |
| Requiring CAP is not a complete solution but should serve as a |
| significant deterrent against spraying cgroup usages in non-privileged |
| programs. |