)]}'
{
  "commit": "67f9a619e7460b7d07284a9d0745727a77d3ade6",
  "tree": "d76bcca7ad5e9430150ebcdb391180bf2e0878e8",
  "parents": [
    "d79fc0fc6645b0cf5cd980da76942ca6d6300fa4"
  ],
  "author": {
    "name": "Ingo Molnar",
    "email": "mingo@elte.hu",
    "time": "Sat Sep 10 00:26:16 2005 -0700"
  },
  "committer": {
    "name": "Linus Torvalds",
    "email": "torvalds@g5.osdl.org",
    "time": "Sat Sep 10 10:06:23 2005 -0700"
  },
  "message": "[PATCH] sched: fix SMT scheduler latency bug\n\nWilliam Weston reported unusually high scheduling latencies on his x86 HT\nbox, on the -RT kernel.  I managed to reproduce it on my HT box and the\nlatency tracer shows the incident in action:\n\n                 _------\u003d\u003e CPU#\n                / _-----\u003d\u003e irqs-off\n               | / _----\u003d\u003e need-resched\n               || / _---\u003d\u003e hardirq/softirq\n               ||| / _--\u003d\u003e preempt-depth\n               |||| /\n               |||||     delay\n   cmd     pid ||||| time  |   caller\n      \\   /    |||||   \\   |   /\n      du-2803  3Dnh2    0us : __trace_start_sched_wakeup (try_to_wake_up)\n        ..............................................................\n        ... we are running on CPU#3, PID 2778 gets woken to CPU#1: ...\n        ..............................................................\n      du-2803  3Dnh2    0us : __trace_start_sched_wakeup \u003c\u003c...\u003e-2778\u003e (73 1)\n      du-2803  3Dnh2    0us : _raw_spin_unlock (try_to_wake_up)\n        ................................................\n        ... still on CPU#3, we send an IPI to CPU#1: ...\n        ................................................\n      du-2803  3Dnh1    0us : resched_task (try_to_wake_up)\n      du-2803  3Dnh1    1us : smp_send_reschedule (try_to_wake_up)\n      du-2803  3Dnh1    1us : send_IPI_mask_bitmask (smp_send_reschedule)\n      du-2803  3Dnh1    2us : _raw_spin_unlock_irqrestore (try_to_wake_up)\n        ...............................................\n        ... 1 usec later, the IPI arrives on CPU#1: ...\n        ...............................................\n  \u003cidle\u003e-0     1Dnh.    2us : smp_reschedule_interrupt (c0100c5a 0 0)\n\nSo far so good, this is the normal wakeup/preemption mechanism.  But here\ncomes the scheduler anomaly on CPU#1:\n\n  \u003cidle\u003e-0     1Dnh.    2us : preempt_schedule_irq (need_resched)\n  \u003cidle\u003e-0     1Dnh.    2us : preempt_schedule_irq (need_resched)\n  \u003cidle\u003e-0     1Dnh.    3us : __schedule (preempt_schedule_irq)\n  \u003cidle\u003e-0     1Dnh.    3us : profile_hit (__schedule)\n  \u003cidle\u003e-0     1Dnh1    3us : sched_clock (__schedule)\n  \u003cidle\u003e-0     1Dnh1    4us : _raw_spin_lock_irq (__schedule)\n  \u003cidle\u003e-0     1Dnh1    4us : _raw_spin_lock_irqsave (__schedule)\n  \u003cidle\u003e-0     1Dnh2    5us : _raw_spin_unlock (__schedule)\n  \u003cidle\u003e-0     1Dnh1    5us : preempt_schedule (__schedule)\n  \u003cidle\u003e-0     1Dnh1    6us : _raw_spin_lock (__schedule)\n  \u003cidle\u003e-0     1Dnh2    6us : find_next_bit (__schedule)\n  \u003cidle\u003e-0     1Dnh2    6us : _raw_spin_lock (__schedule)\n  \u003cidle\u003e-0     1Dnh3    7us : find_next_bit (__schedule)\n  \u003cidle\u003e-0     1Dnh3    7us : find_next_bit (__schedule)\n  \u003cidle\u003e-0     1Dnh3    8us : _raw_spin_unlock (__schedule)\n  \u003cidle\u003e-0     1Dnh2    8us : preempt_schedule (__schedule)\n  \u003cidle\u003e-0     1Dnh2    8us : find_next_bit (__schedule)\n  \u003cidle\u003e-0     1Dnh2    9us : trace_stop_sched_switched (__schedule)\n  \u003cidle\u003e-0     1Dnh2    9us : _raw_spin_lock (trace_stop_sched_switched)\n  \u003cidle\u003e-0     1Dnh3   10us : trace_stop_sched_switched \u003c\u003c...\u003e-2778\u003e (73 8c)\n  \u003cidle\u003e-0     1Dnh3   10us : _raw_spin_unlock (trace_stop_sched_switched)\n  \u003cidle\u003e-0     1Dnh1   10us : _raw_spin_unlock (__schedule)\n  \u003cidle\u003e-0     1Dnh.   11us : local_irq_enable_noresched (preempt_schedule_irq)\n  \u003cidle\u003e-0     1Dnh.   11us \u003c (0)\n\nwe didnt pick up pid 2778! It only gets scheduled much later:\n\n   \u003c...\u003e-2778  1Dnh2  412us : __switch_to (__schedule)\n   \u003c...\u003e-2778  1Dnh2  413us : __schedule \u003c\u003cidle\u003e-0\u003e (8c 73)\n   \u003c...\u003e-2778  1Dnh2  413us : _raw_spin_unlock (__schedule)\n   \u003c...\u003e-2778  1Dnh1  413us : trace_stop_sched_switched (__schedule)\n   \u003c...\u003e-2778  1Dnh1  414us : _raw_spin_lock (trace_stop_sched_switched)\n   \u003c...\u003e-2778  1Dnh2  414us : trace_stop_sched_switched \u003c\u003c...\u003e-2778\u003e (73 1)\n   \u003c...\u003e-2778  1Dnh2  414us : _raw_spin_unlock (trace_stop_sched_switched)\n   \u003c...\u003e-2778  1Dnh1  415us : trace_stop_sched_switched (__schedule)\n\nthe reason for this anomaly is the following code in dependent_sleeper():\n\n                /*\n                 * If a user task with lower static priority than the\n                 * running task on the SMT sibling is trying to schedule,\n                 * delay it till there is proportionately less timeslice\n                 * left of the sibling task to prevent a lower priority\n                 * task from using an unfair proportion of the\n                 * physical cpu\u0027s resources. -ck\n                 */\n[...]\n                        if (((smt_curr-\u003etime_slice * (100 - sd-\u003eper_cpu_gain) /\n                                100) \u003e task_timeslice(p)))\n                                        ret \u003d 1;\n\nNote that in contrast to the comment above, we dont actually do the check\nbased on static priority, we do the check based on timeslices.  But\ntimeslices go up and down, and even highprio tasks can randomly have very\nlow timeslices (just before their next refill) and can thus be judged as\n\u0027lowprio\u0027 by the above piece of code.  This condition is clearly buggy.\nThe correct test is to check for static_prio _and_ to check for the\npreemption priority.  Even on different static priority levels, a\nhigher-prio interactive task should not be delayed due to a\nhigher-static-prio CPU hog.\n\nThere is a symmetric bug in the \u0027kick SMT sibling\u0027 code of this function as\nwell, which can be solved in a similar way.\n\nThe patch below (against the current scheduler queue in -mm) fixes both\nbugs.  I have build and boot-tested this on x86 SMT, and nice +20 tasks\nstill get properly throttled - so the dependent-sleeper logic is still in\naction.\n\nbtw., these bugs pessimised the SMT scheduler because the \u0027delay wakeup\u0027\nproperty was applied too liberally, so this fix is likely a throughput\nimprovement as well.\n\nI separated out a smt_slice() function to make the code easier to read.\n\nSigned-off-by: Ingo Molnar \u003cmingo@elte.hu\u003e\nSigned-off-by: Andrew Morton \u003cakpm@osdl.org\u003e\nSigned-off-by: Linus Torvalds \u003ctorvalds@osdl.org\u003e\n",
  "tree_diff": [
    {
      "type": "modify",
      "old_id": "6da13bba3e23c18add93ec275e904106dfdda2af",
      "old_mode": 33188,
      "old_path": "kernel/sched.c",
      "new_id": "c61ee3451a04443ca04cd4007c6297f3993cdfd0",
      "new_mode": 33188,
      "new_path": "kernel/sched.c"
    }
  ]
}