)]}'
{
  "commit": "f2dd80ecca5f06b46134f2bd811f046c503c756c",
  "tree": "8723efa12a71458cc76b4d1bc44d439802dab2ff",
  "parents": [
    "fdf880a60835cd1dec2563463ac63ae3084e0ddc"
  ],
  "author": {
    "name": "Daniel Axtens",
    "email": "dja@axtens.net",
    "time": "Wed Sep 23 16:41:48 2015 +1000"
  },
  "committer": {
    "name": "Michael Ellerman",
    "email": "mpe@ellerman.id.au",
    "time": "Fri Oct 09 08:07:19 2015 +1100"
  },
  "message": "powerpc/powernv: Panic on unhandled Machine Check\n\nAll unrecovered machine check errors on PowerNV should cause an\nimmediate panic. There are 2 reasons that this is the right policy:\nit\u0027s not safe to continue, and we\u0027re already trying to reboot.\n\nFirstly, if we go through the recovery process and do not successfully\nrecover, we can\u0027t be sure about the state of the machine, and it is\nnot safe to recover and proceed.\n\nLinux knows about the following sources of Machine Check Errors:\n- Uncorrectable Errors (UE)\n- Effective - Real Address Translation (ERAT)\n- Segment Lookaside Buffer (SLB)\n- Translation Lookaside Buffer (TLB)\n- Unknown/Unrecognised\n\nIn the SLB, TLB and ERAT cases, we can further categorise these as\nparity errors, multihit errors or unknown/unrecognised.\n\nWe can handle SLB errors by flushing and reloading the SLB. We can\nhandle TLB and ERAT multihit errors by flushing the TLB. (It appears\nwe may not handle TLB and ERAT parity errors: I will investigate\nfurther and send a followup patch if appropriate.)\n\nThis leaves us with uncorrectable errors. Uncorrectable errors are\nusually the result of ECC memory detecting an error that it cannot\ncorrect, but they also crop up in the context of PCI cards failing\nduring DMA writes, and during CAPI error events.\n\nThere are several types of UE, and there are 3 places a UE can occur:\nSkiboot, the kernel, and userspace. For Skiboot errors, we have the\nfacility to make some recoverable. For userspace, we can simply kill\n(SIGBUS) the affected process. We have no meaningful way to deal with\nUEs in kernel space or in unrecoverable sections of Skiboot.\n\nCurrently, these unrecovered UEs fall through to\nmachine_check_expection() in traps.c, which calls die(), which OOPSes\nand sends SIGBUS to the process. This sometimes allows us to stumble\nonwards. For example we\u0027ve seen UEs kill the kernel eehd and\nkhugepaged. However, the process killed could have held a lock, or it\ncould have been a more important process, etc: we can no longer make\nany assertions about the state of the machine. Similarly if we see a\nUE in skiboot (and again we\u0027ve seen this happen), we\u0027re not in a\nposition where we can make any assertions about the state of the\nmachine.\n\nLikewise, for unknown or unrecognised errors, we\u0027re not able to say\nanything about the state of the machine.\n\nTherefore, if we have an unrecovered MCE, the most appropriate thing\nto do is to panic.\n\nThe second reason is that since e784b6499d9c (\"powerpc/powernv: Invoke\nopal_cec_reboot2() on unrecoverable machine check errors.\"), we\nattempt a special OPAL reboot on an unhandled MCE. This is so the\nhardware can record error data for later debugging.\n\nThe comments in that commit assert that we are heading down the panic\npath anyway. At the moment this is not always true. With UEs in kernel\nspace, for instance, they are marked as recoverable by the hardware,\nso if the attempt to reboot failed (e.g. old Skiboot), we wouldn\u0027t\npanic() but would simply die() and OOPS. It doesn\u0027t make sense to be\nstaggering on if we\u0027ve just tried to reboot: we should panic().\n\nExplicitly panic() on unrecovered MCEs on PowerNV.\nUpdate the comments appropriately.\n\nThis fixes some hangs following EEH events on cxlflash setups.\n\nSigned-off-by: Daniel Axtens \u003cdja@axtens.net\u003e\nReviewed-by: Andrew Donnellan \u003candrew.donnellan@au1.ibm.com\u003e\nReviewed-by: Ian Munsie \u003cimunsie@au1.ibm.com\u003e\nSigned-off-by: Michael Ellerman \u003cmpe@ellerman.id.au\u003e\n",
  "tree_diff": [
    {
      "type": "modify",
      "old_id": "230f3a7cdea45f8d160797fe55eb7b154c9c9dba",
      "old_mode": 33188,
      "old_path": "arch/powerpc/platforms/powernv/opal.c",
      "new_id": "4296d55e88f30afa7cb91fd54d06e6b2a532d577",
      "new_mode": 33188,
      "new_path": "arch/powerpc/platforms/powernv/opal.c"
    }
  ]
}
