|  | Scaling in the Linux Networking Stack | 
|  |  | 
|  |  | 
|  | Introduction | 
|  | ============ | 
|  |  | 
|  | This document describes a set of complementary techniques in the Linux | 
|  | networking stack to increase parallelism and improve performance for | 
|  | multi-processor systems. | 
|  |  | 
|  | The following technologies are described: | 
|  |  | 
|  | RSS: Receive Side Scaling | 
|  | RPS: Receive Packet Steering | 
|  | RFS: Receive Flow Steering | 
|  | Accelerated Receive Flow Steering | 
|  | XPS: Transmit Packet Steering | 
|  |  | 
|  |  | 
|  | RSS: Receive Side Scaling | 
|  | ========================= | 
|  |  | 
|  | Contemporary NICs support multiple receive and transmit descriptor queues | 
|  | (multi-queue). On reception, a NIC can send different packets to different | 
|  | queues to distribute processing among CPUs. The NIC distributes packets by | 
|  | applying a filter to each packet that assigns it to one of a small number | 
|  | of logical flows. Packets for each flow are steered to a separate receive | 
|  | queue, which in turn can be processed by separate CPUs. This mechanism is | 
|  | generally known as “Receive-side Scaling” (RSS). The goal of RSS and | 
|  | the other scaling techniques is to increase performance uniformly. | 
|  | Multi-queue distribution can also be used for traffic prioritization, but | 
|  | that is not the focus of these techniques. | 
|  |  | 
|  | The filter used in RSS is typically a hash function over the network | 
|  | and/or transport layer headers-- for example, a 4-tuple hash over | 
|  | IP addresses and TCP ports of a packet. The most common hardware | 
|  | implementation of RSS uses a 128-entry indirection table where each entry | 
|  | stores a queue number. The receive queue for a packet is determined | 
|  | by masking out the low order seven bits of the computed hash for the | 
|  | packet (usually a Toeplitz hash), taking this number as a key into the | 
|  | indirection table and reading the corresponding value. | 
|  |  | 
|  | Some advanced NICs allow steering packets to queues based on | 
|  | programmable filters. For example, webserver bound TCP port 80 packets | 
|  | can be directed to their own receive queue. Such “n-tuple” filters can | 
|  | be configured from ethtool (--config-ntuple). | 
|  |  | 
|  | ==== RSS Configuration | 
|  |  | 
|  | The driver for a multi-queue capable NIC typically provides a kernel | 
|  | module parameter for specifying the number of hardware queues to | 
|  | configure. In the bnx2x driver, for instance, this parameter is called | 
|  | num_queues. A typical RSS configuration would be to have one receive queue | 
|  | for each CPU if the device supports enough queues, or otherwise at least | 
|  | one for each memory domain, where a memory domain is a set of CPUs that | 
|  | share a particular memory level (L1, L2, NUMA node, etc.). | 
|  |  | 
|  | The indirection table of an RSS device, which resolves a queue by masked | 
|  | hash, is usually programmed by the driver at initialization. The | 
|  | default mapping is to distribute the queues evenly in the table, but the | 
|  | indirection table can be retrieved and modified at runtime using ethtool | 
|  | commands (--show-rxfh-indir and --set-rxfh-indir). Modifying the | 
|  | indirection table could be done to give different queues different | 
|  | relative weights. | 
|  |  | 
|  | == RSS IRQ Configuration | 
|  |  | 
|  | Each receive queue has a separate IRQ associated with it. The NIC triggers | 
|  | this to notify a CPU when new packets arrive on the given queue. The | 
|  | signaling path for PCIe devices uses message signaled interrupts (MSI-X), | 
|  | that can route each interrupt to a particular CPU. The active mapping | 
|  | of queues to IRQs can be determined from /proc/interrupts. By default, | 
|  | an IRQ may be handled on any CPU. Because a non-negligible part of packet | 
|  | processing takes place in receive interrupt handling, it is advantageous | 
|  | to spread receive interrupts between CPUs. To manually adjust the IRQ | 
|  | affinity of each interrupt see Documentation/IRQ-affinity.txt. Some systems | 
|  | will be running irqbalance, a daemon that dynamically optimizes IRQ | 
|  | assignments and as a result may override any manual settings. | 
|  |  | 
|  | == Suggested Configuration | 
|  |  | 
|  | RSS should be enabled when latency is a concern or whenever receive | 
|  | interrupt processing forms a bottleneck. Spreading load between CPUs | 
|  | decreases queue length. For low latency networking, the optimal setting | 
|  | is to allocate as many queues as there are CPUs in the system (or the | 
|  | NIC maximum, if lower). The most efficient high-rate configuration | 
|  | is likely the one with the smallest number of receive queues where no | 
|  | receive queue overflows due to a saturated CPU, because in default | 
|  | mode with interrupt coalescing enabled, the aggregate number of | 
|  | interrupts (and thus work) grows with each additional queue. | 
|  |  | 
|  | Per-cpu load can be observed using the mpstat utility, but note that on | 
|  | processors with hyperthreading (HT), each hyperthread is represented as | 
|  | a separate CPU. For interrupt handling, HT has shown no benefit in | 
|  | initial tests, so limit the number of queues to the number of CPU cores | 
|  | in the system. | 
|  |  | 
|  |  | 
|  | RPS: Receive Packet Steering | 
|  | ============================ | 
|  |  | 
|  | Receive Packet Steering (RPS) is logically a software implementation of | 
|  | RSS. Being in software, it is necessarily called later in the datapath. | 
|  | Whereas RSS selects the queue and hence CPU that will run the hardware | 
|  | interrupt handler, RPS selects the CPU to perform protocol processing | 
|  | above the interrupt handler. This is accomplished by placing the packet | 
|  | on the desired CPU’s backlog queue and waking up the CPU for processing. | 
|  | RPS has some advantages over RSS: 1) it can be used with any NIC, | 
|  | 2) software filters can easily be added to hash over new protocols, | 
|  | 3) it does not increase hardware device interrupt rate (although it does | 
|  | introduce inter-processor interrupts (IPIs)). | 
|  |  | 
|  | RPS is called during bottom half of the receive interrupt handler, when | 
|  | a driver sends a packet up the network stack with netif_rx() or | 
|  | netif_receive_skb(). These call the get_rps_cpu() function, which | 
|  | selects the queue that should process a packet. | 
|  |  | 
|  | The first step in determining the target CPU for RPS is to calculate a | 
|  | flow hash over the packet’s addresses or ports (2-tuple or 4-tuple hash | 
|  | depending on the protocol). This serves as a consistent hash of the | 
|  | associated flow of the packet. The hash is either provided by hardware | 
|  | or will be computed in the stack. Capable hardware can pass the hash in | 
|  | the receive descriptor for the packet; this would usually be the same | 
|  | hash used for RSS (e.g. computed Toeplitz hash). The hash is saved in | 
|  | skb->rx_hash and can be used elsewhere in the stack as a hash of the | 
|  | packet’s flow. | 
|  |  | 
|  | Each receive hardware queue has an associated list of CPUs to which | 
|  | RPS may enqueue packets for processing. For each received packet, | 
|  | an index into the list is computed from the flow hash modulo the size | 
|  | of the list. The indexed CPU is the target for processing the packet, | 
|  | and the packet is queued to the tail of that CPU’s backlog queue. At | 
|  | the end of the bottom half routine, IPIs are sent to any CPUs for which | 
|  | packets have been queued to their backlog queue. The IPI wakes backlog | 
|  | processing on the remote CPU, and any queued packets are then processed | 
|  | up the networking stack. | 
|  |  | 
|  | ==== RPS Configuration | 
|  |  | 
|  | RPS requires a kernel compiled with the CONFIG_RPS kconfig symbol (on | 
|  | by default for SMP). Even when compiled in, RPS remains disabled until | 
|  | explicitly configured. The list of CPUs to which RPS may forward traffic | 
|  | can be configured for each receive queue using a sysfs file entry: | 
|  |  | 
|  | /sys/class/net/<dev>/queues/rx-<n>/rps_cpus | 
|  |  | 
|  | This file implements a bitmap of CPUs. RPS is disabled when it is zero | 
|  | (the default), in which case packets are processed on the interrupting | 
|  | CPU. Documentation/IRQ-affinity.txt explains how CPUs are assigned to | 
|  | the bitmap. | 
|  |  | 
|  | == Suggested Configuration | 
|  |  | 
|  | For a single queue device, a typical RPS configuration would be to set | 
|  | the rps_cpus to the CPUs in the same memory domain of the interrupting | 
|  | CPU. If NUMA locality is not an issue, this could also be all CPUs in | 
|  | the system. At high interrupt rate, it might be wise to exclude the | 
|  | interrupting CPU from the map since that already performs much work. | 
|  |  | 
|  | For a multi-queue system, if RSS is configured so that a hardware | 
|  | receive queue is mapped to each CPU, then RPS is probably redundant | 
|  | and unnecessary. If there are fewer hardware queues than CPUs, then | 
|  | RPS might be beneficial if the rps_cpus for each queue are the ones that | 
|  | share the same memory domain as the interrupting CPU for that queue. | 
|  |  | 
|  | ==== RPS Flow Limit | 
|  |  | 
|  | RPS scales kernel receive processing across CPUs without introducing | 
|  | reordering. The trade-off to sending all packets from the same flow | 
|  | to the same CPU is CPU load imbalance if flows vary in packet rate. | 
|  | In the extreme case a single flow dominates traffic. Especially on | 
|  | common server workloads with many concurrent connections, such | 
|  | behavior indicates a problem such as a misconfiguration or spoofed | 
|  | source Denial of Service attack. | 
|  |  | 
|  | Flow Limit is an optional RPS feature that prioritizes small flows | 
|  | during CPU contention by dropping packets from large flows slightly | 
|  | ahead of those from small flows. It is active only when an RPS or RFS | 
|  | destination CPU approaches saturation.  Once a CPU's input packet | 
|  | queue exceeds half the maximum queue length (as set by sysctl | 
|  | net.core.netdev_max_backlog), the kernel starts a per-flow packet | 
|  | count over the last 256 packets. If a flow exceeds a set ratio (by | 
|  | default, half) of these packets when a new packet arrives, then the | 
|  | new packet is dropped. Packets from other flows are still only | 
|  | dropped once the input packet queue reaches netdev_max_backlog. | 
|  | No packets are dropped when the input packet queue length is below | 
|  | the threshold, so flow limit does not sever connections outright: | 
|  | even large flows maintain connectivity. | 
|  |  | 
|  | == Interface | 
|  |  | 
|  | Flow limit is compiled in by default (CONFIG_NET_FLOW_LIMIT), but not | 
|  | turned on. It is implemented for each CPU independently (to avoid lock | 
|  | and cache contention) and toggled per CPU by setting the relevant bit | 
|  | in sysctl net.core.flow_limit_cpu_bitmap. It exposes the same CPU | 
|  | bitmap interface as rps_cpus (see above) when called from procfs: | 
|  |  | 
|  | /proc/sys/net/core/flow_limit_cpu_bitmap | 
|  |  | 
|  | Per-flow rate is calculated by hashing each packet into a hashtable | 
|  | bucket and incrementing a per-bucket counter. The hash function is | 
|  | the same that selects a CPU in RPS, but as the number of buckets can | 
|  | be much larger than the number of CPUs, flow limit has finer-grained | 
|  | identification of large flows and fewer false positives. The default | 
|  | table has 4096 buckets. This value can be modified through sysctl | 
|  |  | 
|  | net.core.flow_limit_table_len | 
|  |  | 
|  | The value is only consulted when a new table is allocated. Modifying | 
|  | it does not update active tables. | 
|  |  | 
|  | == Suggested Configuration | 
|  |  | 
|  | Flow limit is useful on systems with many concurrent connections, | 
|  | where a single connection taking up 50% of a CPU indicates a problem. | 
|  | In such environments, enable the feature on all CPUs that handle | 
|  | network rx interrupts (as set in /proc/irq/N/smp_affinity). | 
|  |  | 
|  | The feature depends on the input packet queue length to exceed | 
|  | the flow limit threshold (50%) + the flow history length (256). | 
|  | Setting net.core.netdev_max_backlog to either 1000 or 10000 | 
|  | performed well in experiments. | 
|  |  | 
|  |  | 
|  | RFS: Receive Flow Steering | 
|  | ========================== | 
|  |  | 
|  | While RPS steers packets solely based on hash, and thus generally | 
|  | provides good load distribution, it does not take into account | 
|  | application locality. This is accomplished by Receive Flow Steering | 
|  | (RFS). The goal of RFS is to increase datacache hitrate by steering | 
|  | kernel processing of packets to the CPU where the application thread | 
|  | consuming the packet is running. RFS relies on the same RPS mechanisms | 
|  | to enqueue packets onto the backlog of another CPU and to wake up that | 
|  | CPU. | 
|  |  | 
|  | In RFS, packets are not forwarded directly by the value of their hash, | 
|  | but the hash is used as index into a flow lookup table. This table maps | 
|  | flows to the CPUs where those flows are being processed. The flow hash | 
|  | (see RPS section above) is used to calculate the index into this table. | 
|  | The CPU recorded in each entry is the one which last processed the flow. | 
|  | If an entry does not hold a valid CPU, then packets mapped to that entry | 
|  | are steered using plain RPS. Multiple table entries may point to the | 
|  | same CPU. Indeed, with many flows and few CPUs, it is very likely that | 
|  | a single application thread handles flows with many different flow hashes. | 
|  |  | 
|  | rps_sock_flow_table is a global flow table that contains the *desired* CPU | 
|  | for flows: the CPU that is currently processing the flow in userspace. | 
|  | Each table value is a CPU index that is updated during calls to recvmsg | 
|  | and sendmsg (specifically, inet_recvmsg(), inet_sendmsg(), inet_sendpage() | 
|  | and tcp_splice_read()). | 
|  |  | 
|  | When the scheduler moves a thread to a new CPU while it has outstanding | 
|  | receive packets on the old CPU, packets may arrive out of order. To | 
|  | avoid this, RFS uses a second flow table to track outstanding packets | 
|  | for each flow: rps_dev_flow_table is a table specific to each hardware | 
|  | receive queue of each device. Each table value stores a CPU index and a | 
|  | counter. The CPU index represents the *current* CPU onto which packets | 
|  | for this flow are enqueued for further kernel processing. Ideally, kernel | 
|  | and userspace processing occur on the same CPU, and hence the CPU index | 
|  | in both tables is identical. This is likely false if the scheduler has | 
|  | recently migrated a userspace thread while the kernel still has packets | 
|  | enqueued for kernel processing on the old CPU. | 
|  |  | 
|  | The counter in rps_dev_flow_table values records the length of the current | 
|  | CPU's backlog when a packet in this flow was last enqueued. Each backlog | 
|  | queue has a head counter that is incremented on dequeue. A tail counter | 
|  | is computed as head counter + queue length. In other words, the counter | 
|  | in rps_dev_flow[i] records the last element in flow i that has | 
|  | been enqueued onto the currently designated CPU for flow i (of course, | 
|  | entry i is actually selected by hash and multiple flows may hash to the | 
|  | same entry i). | 
|  |  | 
|  | And now the trick for avoiding out of order packets: when selecting the | 
|  | CPU for packet processing (from get_rps_cpu()) the rps_sock_flow table | 
|  | and the rps_dev_flow table of the queue that the packet was received on | 
|  | are compared. If the desired CPU for the flow (found in the | 
|  | rps_sock_flow table) matches the current CPU (found in the rps_dev_flow | 
|  | table), the packet is enqueued onto that CPU’s backlog. If they differ, | 
|  | the current CPU is updated to match the desired CPU if one of the | 
|  | following is true: | 
|  |  | 
|  | - The current CPU's queue head counter >= the recorded tail counter | 
|  | value in rps_dev_flow[i] | 
|  | - The current CPU is unset (>= nr_cpu_ids) | 
|  | - The current CPU is offline | 
|  |  | 
|  | After this check, the packet is sent to the (possibly updated) current | 
|  | CPU. These rules aim to ensure that a flow only moves to a new CPU when | 
|  | there are no packets outstanding on the old CPU, as the outstanding | 
|  | packets could arrive later than those about to be processed on the new | 
|  | CPU. | 
|  |  | 
|  | ==== RFS Configuration | 
|  |  | 
|  | RFS is only available if the kconfig symbol CONFIG_RPS is enabled (on | 
|  | by default for SMP). The functionality remains disabled until explicitly | 
|  | configured. The number of entries in the global flow table is set through: | 
|  |  | 
|  | /proc/sys/net/core/rps_sock_flow_entries | 
|  |  | 
|  | The number of entries in the per-queue flow table are set through: | 
|  |  | 
|  | /sys/class/net/<dev>/queues/rx-<n>/rps_flow_cnt | 
|  |  | 
|  | == Suggested Configuration | 
|  |  | 
|  | Both of these need to be set before RFS is enabled for a receive queue. | 
|  | Values for both are rounded up to the nearest power of two. The | 
|  | suggested flow count depends on the expected number of active connections | 
|  | at any given time, which may be significantly less than the number of open | 
|  | connections. We have found that a value of 32768 for rps_sock_flow_entries | 
|  | works fairly well on a moderately loaded server. | 
|  |  | 
|  | For a single queue device, the rps_flow_cnt value for the single queue | 
|  | would normally be configured to the same value as rps_sock_flow_entries. | 
|  | For a multi-queue device, the rps_flow_cnt for each queue might be | 
|  | configured as rps_sock_flow_entries / N, where N is the number of | 
|  | queues. So for instance, if rps_sock_flow_entries is set to 32768 and there | 
|  | are 16 configured receive queues, rps_flow_cnt for each queue might be | 
|  | configured as 2048. | 
|  |  | 
|  |  | 
|  | Accelerated RFS | 
|  | =============== | 
|  |  | 
|  | Accelerated RFS is to RFS what RSS is to RPS: a hardware-accelerated load | 
|  | balancing mechanism that uses soft state to steer flows based on where | 
|  | the application thread consuming the packets of each flow is running. | 
|  | Accelerated RFS should perform better than RFS since packets are sent | 
|  | directly to a CPU local to the thread consuming the data. The target CPU | 
|  | will either be the same CPU where the application runs, or at least a CPU | 
|  | which is local to the application thread’s CPU in the cache hierarchy. | 
|  |  | 
|  | To enable accelerated RFS, the networking stack calls the | 
|  | ndo_rx_flow_steer driver function to communicate the desired hardware | 
|  | queue for packets matching a particular flow. The network stack | 
|  | automatically calls this function every time a flow entry in | 
|  | rps_dev_flow_table is updated. The driver in turn uses a device specific | 
|  | method to program the NIC to steer the packets. | 
|  |  | 
|  | The hardware queue for a flow is derived from the CPU recorded in | 
|  | rps_dev_flow_table. The stack consults a CPU to hardware queue map which | 
|  | is maintained by the NIC driver. This is an auto-generated reverse map of | 
|  | the IRQ affinity table shown by /proc/interrupts. Drivers can use | 
|  | functions in the cpu_rmap (“CPU affinity reverse map”) kernel library | 
|  | to populate the map. For each CPU, the corresponding queue in the map is | 
|  | set to be one whose processing CPU is closest in cache locality. | 
|  |  | 
|  | ==== Accelerated RFS Configuration | 
|  |  | 
|  | Accelerated RFS is only available if the kernel is compiled with | 
|  | CONFIG_RFS_ACCEL and support is provided by the NIC device and driver. | 
|  | It also requires that ntuple filtering is enabled via ethtool. The map | 
|  | of CPU to queues is automatically deduced from the IRQ affinities | 
|  | configured for each receive queue by the driver, so no additional | 
|  | configuration should be necessary. | 
|  |  | 
|  | == Suggested Configuration | 
|  |  | 
|  | This technique should be enabled whenever one wants to use RFS and the | 
|  | NIC supports hardware acceleration. | 
|  |  | 
|  | XPS: Transmit Packet Steering | 
|  | ============================= | 
|  |  | 
|  | Transmit Packet Steering is a mechanism for intelligently selecting | 
|  | which transmit queue to use when transmitting a packet on a multi-queue | 
|  | device. To accomplish this, a mapping from CPU to hardware queue(s) is | 
|  | recorded. The goal of this mapping is usually to assign queues | 
|  | exclusively to a subset of CPUs, where the transmit completions for | 
|  | these queues are processed on a CPU within this set. This choice | 
|  | provides two benefits. First, contention on the device queue lock is | 
|  | significantly reduced since fewer CPUs contend for the same queue | 
|  | (contention can be eliminated completely if each CPU has its own | 
|  | transmit queue). Secondly, cache miss rate on transmit completion is | 
|  | reduced, in particular for data cache lines that hold the sk_buff | 
|  | structures. | 
|  |  | 
|  | XPS is configured per transmit queue by setting a bitmap of CPUs that | 
|  | may use that queue to transmit. The reverse mapping, from CPUs to | 
|  | transmit queues, is computed and maintained for each network device. | 
|  | When transmitting the first packet in a flow, the function | 
|  | get_xps_queue() is called to select a queue. This function uses the ID | 
|  | of the running CPU as a key into the CPU-to-queue lookup table. If the | 
|  | ID matches a single queue, that is used for transmission. If multiple | 
|  | queues match, one is selected by using the flow hash to compute an index | 
|  | into the set. | 
|  |  | 
|  | The queue chosen for transmitting a particular flow is saved in the | 
|  | corresponding socket structure for the flow (e.g. a TCP connection). | 
|  | This transmit queue is used for subsequent packets sent on the flow to | 
|  | prevent out of order (ooo) packets. The choice also amortizes the cost | 
|  | of calling get_xps_queues() over all packets in the flow. To avoid | 
|  | ooo packets, the queue for a flow can subsequently only be changed if | 
|  | skb->ooo_okay is set for a packet in the flow. This flag indicates that | 
|  | there are no outstanding packets in the flow, so the transmit queue can | 
|  | change without the risk of generating out of order packets. The | 
|  | transport layer is responsible for setting ooo_okay appropriately. TCP, | 
|  | for instance, sets the flag when all data for a connection has been | 
|  | acknowledged. | 
|  |  | 
|  | ==== XPS Configuration | 
|  |  | 
|  | XPS is only available if the kconfig symbol CONFIG_XPS is enabled (on by | 
|  | default for SMP). The functionality remains disabled until explicitly | 
|  | configured. To enable XPS, the bitmap of CPUs that may use a transmit | 
|  | queue is configured using the sysfs file entry: | 
|  |  | 
|  | /sys/class/net/<dev>/queues/tx-<n>/xps_cpus | 
|  |  | 
|  | == Suggested Configuration | 
|  |  | 
|  | For a network device with a single transmission queue, XPS configuration | 
|  | has no effect, since there is no choice in this case. In a multi-queue | 
|  | system, XPS is preferably configured so that each CPU maps onto one queue. | 
|  | If there are as many queues as there are CPUs in the system, then each | 
|  | queue can also map onto one CPU, resulting in exclusive pairings that | 
|  | experience no contention. If there are fewer queues than CPUs, then the | 
|  | best CPUs to share a given queue are probably those that share the cache | 
|  | with the CPU that processes transmit completions for that queue | 
|  | (transmit interrupts). | 
|  |  | 
|  | Per TX Queue rate limitation: | 
|  | ============================= | 
|  |  | 
|  | These are rate-limitation mechanisms implemented by HW, where currently | 
|  | a max-rate attribute is supported, by setting a Mbps value to | 
|  |  | 
|  | /sys/class/net/<dev>/queues/tx-<n>/tx_maxrate | 
|  |  | 
|  | A value of zero means disabled, and this is the default. | 
|  |  | 
|  | Further Information | 
|  | =================== | 
|  | RPS and RFS were introduced in kernel 2.6.35. XPS was incorporated into | 
|  | 2.6.38. Original patches were submitted by Tom Herbert | 
|  | (therbert@google.com) | 
|  |  | 
|  | Accelerated RFS was introduced in 2.6.35. Original patches were | 
|  | submitted by Ben Hutchings (bwh@kernel.org) | 
|  |  | 
|  | Authors: | 
|  | Tom Herbert (therbert@google.com) | 
|  | Willem de Bruijn (willemb@google.com) |