website: Update TME post

Change-Id: I523769be0bb1128618d225902448bbd2a7a5c4cc
Reviewed-on: https://gem5-review.googlesource.com/c/public/gem5-website/+/44665
Reviewed-by: Jason Lowe-Power <power.jg@gmail.com>
Maintainer: Jason Lowe-Power <power.jg@gmail.com>
Tested-by: Jason Lowe-Power <power.jg@gmail.com>
diff --git a/_posts/2020-10-27-tme.md b/_posts/2020-10-27-tme.md
index b6d0486..ce49f11 100644
--- a/_posts/2020-10-27-tme.md
+++ b/_posts/2020-10-27-tme.md
@@ -10,6 +10,8 @@
 **This post was originally posted on the Arm Research Blog: [here](
 https://community.arm.com/developer/research/b/articles/posts/arms-transactional-memory-extension-support-)**
 
+April 16 2021: The code example in this article has been updated to reflect the [Armv9-A architecture](https://www.arm.com/company/news/2021/03/arms-answer-to-the-future-of-ai-armv9-architecture?_ga=2.247379204.1872122303.1618821498-1812760823.1604088481) release and to be functional with respect to the gem5/ruby model.
+
 ## A shift to concurrency
 
 In 2005, Herb Sutter published his seminal article “The Free Lunch is Over” (Sutter, 2005). He outlined that the sequential performance of microprocessors would soon plateau, and the industry would respond by offering more performant processors by way of increased core counts. The consequence of this paradigm shift has been a move away from a purely sequential programming model for writing software to that of a concurrent one with multiple threads of execution. When applications inherently exhibit parallelism, dividing work between multiple threads can yield performance gains when the threads execute on different cores.
@@ -24,7 +26,7 @@
 
 ## Arm's TME
 
-The [Transactional Memory Extension (TME)](https://community.arm.com/developer/ip-products/processors/b/processors-ip-blog/posts/new-technologies-for-the-arm-a-profile-architecture) is part of Arm’s A-profile Future Architecture Technologies program, which provides advanced information on unreleased versions of the architecture. TME is a best effort HTM architecture which does not guarantee completion of transactions. The programmer must provide a fallback path to guarantee progress, such as a mutex-guarded critical section. It provides strong isolation, meaning transactions are isolated from both other transactions, and concurrent non-transactional memory accesses. It uses flattened nesting of transactions, in which nested transactions are subsumed by the outer transaction. The effects of a nested transaction do not become visible to other observers until the outer transaction commits. When a nested transaction aborts, it causes the outer transaction (and all its nested transactions within) to abort.
+The [Transactional Memory Extension (TME)](https://community.arm.com/developer/ip-products/processors/b/processors-ip-blog/posts/new-technologies-for-the-arm-a-profile-architecture) is an optional feature of [Armv9](https://developer.arm.com/architectures/cpu-architecture/a-profile/exploration-tools/armv9-a-a64-instruction-set-architecture-release-notes?_ga=2.155180600.1872122303.1618821498-1812760823.1604088481) (previously a part of Arm’s A-profile Future Architecture Technologies program). TME is a best effort HTM architecture which does not guarantee completion of transactions. The programmer must provide a fallback path to guarantee progress, such as a mutex-guarded critical section. It provides strong isolation, meaning transactions are isolated from both other transactions, and concurrent non-transactional memory accesses. It uses flattened nesting of transactions, in which nested transactions are subsumed by the outer transaction. The effects of a nested transaction do not become visible to other observers until the outer transaction commits. When a nested transaction aborts, it causes the outer transaction (and all its nested transactions within) to abort.
 
 TME comprises four instructions:
 
@@ -70,7 +72,12 @@
 
 To test the new functionality in gem5, we outline a simple program written in C that uses TME transactions to update a histogram in parallel. This program uses manual lock elision—a lock is used to protect a shared data structure but is bypassed, that is, elided, whenever possible in favor of transactions. This satisfies the requirements of a fallback path if a transaction cannot make progress.
 
-We first define a very simple spinlock that works with [AArch64](https://developer.arm.com/architectures/learn-the-architecture/aarch64-instruction-set-architecture?_ga=2.17759802.282459154.1604342475-1664555334.1603995267)’s weak memory model.
+We first define a very simple spinlock that works both with [AArch64](https://developer.arm.com/architectures/learn-the-architecture/aarch64-instruction-set-architecture?_ga=2.17759802.282459154.1604342475-1664555334.1603995267)’s weak memory model and TME. Arm's recommended locking acquisition sequences using Load-Exclusive/Store-Exclusive typically rely on a load-acquire exclusive instruction (e.g. LDAXR) for correct memory ordering. When this form of lock acquisition is used in conjunction with TME lock elision, mutual exclusion cannot be guaranteed.. To circumvent this issue, two solutions are possible:
+
+1. To construct a locking acquisition sequence using Armv8.1 Large System Extensions (LSE) atomics. This is the idiomatic and performant Armv8-A and Armv9-A locking sequence which can be found in the Linux kernel.
+2. To add a full memory barrier (DMB SY) before the first memory operation of the critical section.
+
+Since LSE atomics aren't currently implemented in gem5/ruby we use the second option.
 
 ```cpp
 #include <stdatomic.h>
@@ -82,8 +89,12 @@
 }
 
 inline void lock_acquire(lock_t *lock) {
-    while (atomic_exchange_explicit(lock, 1, memory_order_acquire))
+    // The following atomic exchange can use relaxed memory
+    // ordering since it is followed by a full barrier.
+    while (atomic_exchange_explicit(lock, 1, memory_order_relaxed))
         ; // spin until acquired
+    // This will generate a full memory barrier, e.g. DMB SY
+    atomic_thread_fence(memory_order_seq_cst);
 }
 
 inline int lock_is_acquired(lock_t *lock) {
@@ -233,9 +244,11 @@
 TME is supported in GCC as of [version 10](https://gcc.gnu.org/gcc-10/changes.html)—this includes [ACLE intrinsics](https://developer.arm.com/documentation/101028/0010/Transactional-Memory-Extension--TME--intrinsics?_ga=2.211085974.282459154.1604342475-1664555334.1603995267). To compile source files with TME instructions, an AArch64 compiler must be used with the feature enabled via the march flag, for example, `-march=armv8-a+tme`.
 
 ```
-aarch64-linux-gnu-gcc -std=c11 -O2 -static -march=armv8-a+tme -pthread -o histogram.exe ./histogram.c
+aarch64-linux-gnu-gcc -std=c11 -O2 -static -march=armv8-a+tme+nolse -pthread -o histogram.exe ./histogram.c
 ```
 
+As of this writing, gem5 implements Arm’s Large System Extension (LSE) in the Classic memory system but not Ruby. In order to easily use system emulation mode, this feature should be disabled in the source tree. To do this in gem5 v20.1, modify **./src/arch/arm/isa.cc:l106** and change `haveLSE` from `true` to `false`.
+
 gem5 must then be compiled with the new Ruby `MESI_Three_Level_HTM` protocol.
 
 ```