website: Re-add gem5 101 content

The gem5 101 page referenced a wiki which has been taken down. The
content has been re-added by translating a text-dump of the wiki to
Markdown.

Important Note: While efforts were made to be faithful to the original
tutorials, some dead-links persist in Homeworks 3, 4, 5, and 6. It is
also possible the tutorials are uncompatible with the latest versions of
gem5. This should be treated more-so as an archive than references for
learning gem5.

Change-Id: I42ca5bf86d6c2b82e51ad93224740f8398abd434
Reviewed-on: https://gem5-review.googlesource.com/c/public/gem5-website/+/55686
Reviewed-by: Jason Lowe-Power <power.jg@gmail.com>
Tested-by: Jason Lowe-Power <power.jg@gmail.com>
Maintainer: Bobby Bruce <bbruce@ucdavis.edu>
diff --git a/_data/documentation.yml b/_data/documentation.yml
index c4ad7b2..4d2d66e 100755
--- a/_data/documentation.yml
+++ b/_data/documentation.yml
@@ -295,7 +295,21 @@
           url: /documentation/learning_gem5/part3/simple-MI_example
     - title: gem5 101
       id: gem5_101
-      url: /documentation/learning_gem5/gem5_101/
+      subitems:
+        - page: gem5 101
+          url: /documentation/learning_gem5/gem5_101
+        - page: Homework 1
+          url: /documentation/learning_gem5/gem5_101/homework-1
+        - page: Homework 2
+          url: /documentation/learning_gem5/gem5_101/homework-2
+        - page: Homework 3
+          url: /documentation/learning_gem5/gem5_101/homework-3
+        - page: Homework 4
+          url: /documentation/learning_gem5/gem5_101/homework-4
+        - page: Homework 5
+          url: /documentation/learning_gem5/gem5_101/homework-5
+        - page: Homework 6
+          url: /documentation/learning_gem5/gem5_101/homework-6
 
   - title: gem5 Doxygen
     items:
diff --git a/_pages/documentation/learning_gem5/part4_gem5_101.md b/_pages/documentation/learning_gem5/gem5_101/0_gem5_101.md
similarity index 81%
rename from _pages/documentation/learning_gem5/part4_gem5_101.md
rename to _pages/documentation/learning_gem5/gem5_101/0_gem5_101.md
index a841e33..5836c78 100644
--- a/_pages/documentation/learning_gem5/part4_gem5_101.md
+++ b/_pages/documentation/learning_gem5/gem5_101/0_gem5_101.md
@@ -2,7 +2,7 @@
 layout: documentation
 title: gem5 101
 doc: Learning gem5
-parent: learning_gem5
+parent: gem5_101
 permalink: /documentation/learning_gem5/gem5_101/
 authors: Swapnil Haria
 ---
@@ -14,33 +14,44 @@
 particular offering of architecture courses, CS 752 and CS 757, taught at the
 University of Wisconsin-Madison.
 
+**IMPORTANT NOTE:** Links to the homework parts here were translated to
+markdown from a, now non-existent, wiki. Best efforts have been made to
+preserve the content in its original state but these homework assignments
+may still:
+
+1. Be out of date and incompatible with the latest versions of gem5.
+2. Contain dead-links or references to out-of-date resources.
+
+We do not guarantee these homework assignments can be completed easily in their
+current state.
+
 ## First steps with gem5, and Hello World!
-[Part I](http://pages.cs.wisc.edu/~david/courses/cs752/Fall2015/wiki/index.php?n=Main.Homework1)
+[Part I](/documentation/learning_gem5/gem5_101/homework-1)
 
 In part I, you will first learn to download and build gem5 correctly, create a simple configuration script for a simple system, write a simple C program and run a gem5 simulation. You will then introduce a two-level cache hierarchy in your system (fun stuff). Finally, you get to view the effect of changing system parameters such as memory types, processor frequency and complexity on the performance of your simple program.
 
 ## Getting down and dirty
-[Part II](http://pages.cs.wisc.edu/~david/courses/cs752/Fall2015/wiki/index.php?n=Main.Homework2)
+[Part II](/documentation/learning_gem5/gem5_101/homework-2)
 
 For part II, we had used gem5 capabilities straight out of the box. Now, we will witness the flexibility and usefulness of gem5 by extending the simulator functionality. We walk you through the implementation of an x86 instruction (FSUBR), which is currently missing from gem5. This will introduce you to gem5's language for describing instruction sets, and illustrate how instructions are decoded and broken down into micro-ops which are ultimately executed by the processor.
 
 ## Pipelining solves everything
-[Part III](http://pages.cs.wisc.edu/~david/courses/cs752/Fall2015/wiki/index.php?n=Main.Homework3)
+[Part III](/documentation/learning_gem5/gem5_101/homework-3)
 
 From the ISA, we now move on to the processor micro-architecture. Part III introduces the various different cpu models implemented in gem5, and analyzes the performance of a pipelined implementation. Specifically, you will learn how the latency and bandwidth of different pipeline stages affect overall performance. Also, a sample usage of gem5 pseudo-instructions is also included at no additional cost.
 
 ## Always be experimenting
-[Part IV](http://pages.cs.wisc.edu/~david/courses/cs752/Fall2015/wiki/index.php?n=Main.Homework4)
+[Part IV](/documentation/learning_gem5/gem5_101/homework-4)
 
 Exploiting instruction-level parallelism (ILP) is a useful way of improving single-threaded performance. Branch prediction and predication are two common techniques of exploiting ILP. In this part, we use gem5 to verify the hypothesis that graph algorithms that avoid branches perform better than algorithms that use branches. This is a useful exercise in understanding how to incorporate gem5 into your research process.
 
 ## Cold, hard, cache
-[Part V](http://pages.cs.wisc.edu/~david/courses/cs752/Fall2015/wiki/index.php?n=Main.Homework5)
+[Part V](/documentation/learning_gem5/gem5_101/homework-5)
 
 After looking at the processor core, we now turn our attention to the cache hierarchy. We continue our focus on experimentation, and consider tradeoffs in cache design such as replacement policies and set-associativity. Furthermore, we also learn more about the gem5 simulator, and create our first simObject!
 
 ## Single-core is so two-thousand and late
-[Part VI](http://pages.cs.wisc.edu/~markhill/cs757/Spring2016/wiki/index.php?n=Main.Homework3)
+[Part VI](/documentation/learning_gem5/gem5_101/homework-6)
 
 For this last part, we go both multi-core and full system at the same time! We analyze the performance of a simple application on giving it more computational resources (cores). We also boot a full-fledged unmodified operating system (Linux) on the target system simulated by gem5. Most importantly, we teach you how to create your own, simpler version of the dreaded fs.py configuration script, one that you can feel comfortable modifying.
 
diff --git a/_pages/documentation/learning_gem5/gem5_101/1_gem5_101_homework1.md b/_pages/documentation/learning_gem5/gem5_101/1_gem5_101_homework1.md
new file mode 100644
index 0000000..8282d4b
--- /dev/null
+++ b/_pages/documentation/learning_gem5/gem5_101/1_gem5_101_homework1.md
@@ -0,0 +1,54 @@
+---
+layout: documentation
+title: Homework 1 for CS 752
+doc: Learning gem5
+parent: gem5_101
+permalink: /documentation/learning_gem5/gem5_101/homework-1
+authors:
+---
+
+# Homework 1 for CS 752: Advanced Computer Architecture I (Fall 2015 Section 1 of 1)
+
+**Due Monday, 9/14**
+
+**You should do this assignment on your own. No late assignments**
+
+Person of contact for this assignment: Nilay Vaish  <nilay@cs.wisc.edu>
+
+For this assignment, you will go through the first few parts of the gem5 tutorial we are currently constructing. This gem5 tutorial is a current work-in-progress and may have typos and bugs in it. Feedback about errors, big or small, is appreciated. Please email <powerjg@cs.wisc.edu> with subject "gem5-tutorial comments" with any comments or errors you find.
+
+## Step 1: complete Part I of the gem5 tutorial
+
+There are currently four (three complete) chapters of this tutorial. The first chapter covers downloading and building gem5. The second chapter walks you through creating a simple configuration script and how to run gem5. The third chapter adds some complexity to your first script by adding a two-level cache hierarchy. And the fourth section (incomplete as of this writing) goes through the gem5 output and how to understand the statistics.
+
+The tutorial does include links to the final scripts at the end of each section. However, it's in your best interest to walk through the tutorial step-by-step and create the scripts yourself.
+
+## Step 2: Write an interesting application
+
+Write a program that implements Sieve of Eratosthenes and outputs one single integer at the end: the number of prime numbers <= 100,000,000. Compile your program as a static binary.  The output should be: 5761455.
+
+## Step 3: Use gem5!
+
+Here, you will run your application in gem5 and change the CPU model, CPU frequency, and memory configuration and describe the changes in performance.
+
+* Run your sieve program in gem5 instead of the 'hello' example. '''Choose an appropriate input size.''' You should use something large enough that the application is interesting, but not too large that gem5 takes more than 10 minutes to execute a simulation. I found that 1,000,000 on my machine takes about 5 minutes. ''Note: The MinorCPU (next step) takes about 10x longer than TimingSimpleCPU takes.''
+* Change the CPU model from TimingSimpleCPU to MinorCPU. Hint: you may want to add a command line parameter to control the CPU model.
+* Vary the CPU clock from 1 GHz to 3 GHz (in steps of 500 MHz) with both CPU models. Hint: again, you may want to add a command line parameter for the frequency.
+* Change the memory configuration from DDR3_1600_x64 to DDR3_2133_x64 (DDR3 with a faster clock) and LPDDR2_S4_1066_x32 (low-power DRAM often found in mobile devices).
+
+## What to Hand In
+Turn in your assignment by sending an email message to Nilay Vaish <nilay@cs.wisc.edu> and Prof. David Wood <david@cs.wisc.edu> with the subject line:
+"[CS752 Homework1]"
+
+* The email should contain the name and ID numbers of the student submitting the assignment. The files below should be attached as a zip file to the email.
+* A file named sieve.c with the implementation of sieve of Eratosthenes.
+* A file named sieve-config.py (and any other necessary files) that was used to run gem5. This file should be set up to use TimingSimpleCPU at 1 GHz and DDR3_1600_x64 by default. '''Also bring a print out of this to class'''
+* A file named report.pdf containing a short report with your observations and conclusions from the experiment. This report should contain answers to the following questions:
+    * Which CPU model is more sensitive to changing the CPU frequency? Why do you think this is?
+    * Which CPU model is more sensitive to the memory technology? Why?
+    * Is the sieve application more sensitive to the CPU frequency or the memory technology? Why?
+    * If you were to use a different application, do you think your conclusions would change? Why?
+
+**Bring a paper copy of your report to class on Monday!**
+
+
diff --git a/_pages/documentation/learning_gem5/gem5_101/2_gem5_101_homework2.md b/_pages/documentation/learning_gem5/gem5_101/2_gem5_101_homework2.md
new file mode 100644
index 0000000..759d8db
--- /dev/null
+++ b/_pages/documentation/learning_gem5/gem5_101/2_gem5_101_homework2.md
@@ -0,0 +1,92 @@
+---
+layout: documentation
+title: Homework 2 for CS 752
+doc: Learning gem5
+parent: gem5_101
+permalink: /documentation/learning_gem5/gem5_101/homework-2
+authors:
+---
+
+# Homework 2 for CS 752: Advanced Computer Architecture I (Fall 2015 Section 1 of 1)
+
+**Due 1pm, Monday, 9/21**
+
+**You should do this assignment alone. No late assignments.**
+
+
+## Purpose
+The purpose of this assignment is to help you become familiar with gem5's language for describing instruction sets. You will go through the ISA files in `src/arch/x86/isa` and understand how instructions are decoded and broken down into micro-ops which are ultimately executed.  To get a better understanding, you will implement a missing x87 instruction (FSUBR). Note that x87 is a subset of the x86 ISA. This subset was originally added to provide the floating point support, but is not used much now. To test your implementation of the instruction, you will write a small program that will use this particular through inline assembly feature of GCC.  The program then would be simulated using gem5.
+
+As you might already know, the x86 instructions typically do a lot of work. While one can implement the functionality of each instruction individually, since a lot of work is common across many instructions, typically, each instruction is implemented as a combination of several smaller parts.  The entire instruction is typically referred to as a macro-op, while the smaller parts are referred to as micro-ops.  To implement an instruction in gem5, we first provide the ISA decoder with the information on the macro-op, then we provide an implementation of the macro-op in terms of micro-ops.  Finally, we implement the micro-ops that are not already implemented.  We will carry out these steps for the FSUBR instruction.  Our implementation of FSUBR will mirror that of FSUB, whose implementation is already available in gem5.
+
+
+1. There are many ways in which instructions are encoded in the x86 ISA. We would focus on the x87 subset.  You can read more about instruction encoding in a [manual](http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2008/10/24594_APM_v3.pdf) provided by AMD. Let's go through the file `src/arch/x86/isa/decoder/one_byte_opcodes.isa` to understand how gem5 decodes instructions from the x86 ISA. The file is written in a language designed specifically to express instruction sets. The contents of the file are ultimately converted to a C++ switch case. We first decode the top 5 bits of the opcode byte. There are 32 possible ways in which we can construct binary numbers using 5 bits.  The switch case lists all the possible cases.
+
+All x87 instructions begin with an opcode byte in the range 0xD8 to 0xDF. Therefore the topmost 5 bits always are 0x1B.  For this case, we include the file `src/arch/x86/isa/decoder/x87.isa`. Let's jump to that file.  In this file, we start with decoding the bottom 3 bits.  You can take a look at Table A-15 (page 443) in the manual mentioned above for the instructions represented by different cases for the bottom three bits. For example, FSUB and FSUBR are represented by opcodes 0xD8 and 0xDC, ie. the cases 0x0 and 0x4.  To distinguish between the functionality provided by these different opcodes for the same instruction, you will have to understand the meaning of the ModRM field of the instruction.  Read about it in the manual linked to. In the file x87.isa, you can check that we have FSUB appearing the cases statements for 0x0 and 0x4.  You can also observe that FSUBR's implementation is missing.
+
+As a first step, understand the difference between the two implementations for FSUBR instruction: one with opcode byte D8h and the one with opcode byte DCh.  For this, you should read the description of FSUBR provided in the
+[manual](http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2012/10/26569_APM_v51.pdf) on x87 instructions.
+
+
+2. Now, figure the three places in which fsubr appears in the file x87.isa. Replace the currently appearing statements with ones similar to those specified for FSUB in the same file.  By mentioning something like Inst::FSUBR, you are asking for that instruction to be used, instead of the default one, which simply prints a warning that the instruction is not implemented.
+
+3. Now, we need to provide an implementation of the FSUBR macro-op in terms of some micro-ops.  Again we will mirror the implementation of the FSUB instruction.  Go to the directory `src/arch/x86/isa/insts/x87/arithmetic/`. This directory holds the definition of different x87 arithmetic instructions in terms of micro-ops.  Take a look at how the FSUB instruction has been implemented using micro-ops.  FSUB1 and FSUB2 correspond to the two different opcodes that we mentioned before.  For each type, we have to provide three different implementations:  one that only uses registers, one that reads one of the operands from the memory using the address provided in the instruction and the last one uses the address of the instruction pointer to read the operand. The micro-ops used for the three implementations should be straight forward to understand.
+
+They way gem5's instruction parser works requires us to define all the three implentations for the FSUBR instruction.  In all you should have six separate code blocks for FSUBR, like those specified for FSUB.
+
+
+4. Lastly, we need to provide an implementation of the micro-op subfp.  You can check that the implementation is already  available in the file: `src/arch/x86/isa/micro-ops/fpop.isa`.  So, you would not need to do anything for this step.
+
+
+5. Compile gem5 for x86 ISA to test that you did not make any mistakes in the implementation.
+
+There are many aspects of gem5's ISA language that we have not discussed at all.  Most of these aspects are not documented at all and one needs to figure them out by going over the code in relevant files.
+
+
+6. Now, we will test the implementation of the FSUBR instruction.  For this purpose, we will first write a short C program that reads a file with two floating point numbers, subtracts them and prints the output.  To make sure that FSUBR is used for subtraction, we will explicitly use it using the inline assembly feature of GCC.
+
+Assembly instructions are written inline with the rest of the code using the 'asm' code block.  This code block contains two portions: the instruction portion, and the constraint portion.  In instruction portion is a string containing the assembly instructions.  The GNU C compiler does not check this string for correctness, so anything is allowed. The constraint portion specifies what GCC can or cannot do with the input and output operands, what registers or memory are affected by the instruction portion. There is documentation available from GCC and other sources.  I recommend reading:
+
+* http://www.ibiblio.org/gferg/ldp/GCC-Inline-Assembly-HOWTO.html
+* http://locklessinc.com/articles/gcc_asm/
+
+
+It is likely that most of you would have never used this feature of GCC. Since it is sort of hard to understand how to correctly all the constraints related to assembly instructions, we are providing an implementation with some explanation.
+
+```cpp
+  float subtract(float in1, float in2)
+  {
+    float ret = 0.0;
+    asm ("fsubr %2, %0" : "=&t" (ret) : "%0" (in1), "u" (in2));
+    return ret;
+  }
+```
+
+The aim of the function is to subtract the two floating point input values and return the result.  To do so, we use FSUBR in the 'asm' code block. In our case the instruction string is "fsubr %2, %0".  After that we specify constraints on the output operand, which we ask to be the variable ret. We then specify the two input operands: in1 and in2.  The letters 't' and 'u' specify the top and the second to top register of the x87 stack.
+
+
+Ok, getting back to our main purpose.  Use the function provided above in a C
+program that takes as input a file name, reads two floating point numbers from
+the file, uses FSUBR to subtract the numbers and prints the result to stdout.
+Compile the program statically to generate a binary.  You may also look at
+the assembly code generated by the compiler using the -S flag.
+
+
+7. Now simulate this program using gem5.  You will need to figure out how to supply command line arguments to programs in gem5. You can either use the example script supplied with gem5 in configs/examples/se.py or the script you created in Homework 1. Take a look at the file configs/examples/se.py and the file configs/common/Options.py to figure out how input arguments are to be supplied.
+
+
+## What to Hand In
+
+Turn in your assignment by sending an email message to Prof. David Wood <david@cs.wisc.edu> and Nilay Vaish <nilay@cs.wisc.edu>  with the subject line: "[CS752 Homework2]".
+
+1. The email should contain the name and ID numbers of the student submitting
+the assignment. The files below should be attached as a zip file to the email.
+
+2. A file named fsubr.c which is used for testing the implementation of FSUBR.
+
+3. A patch file containing the changes made to `src/arch/x86/isa/insts/x87/arithmetic/` and `src/arch/x86/isa/decoder/x87.isa`.
+You can generate the patch using the mercurial command `hg diff src/arch/x86/isa > /tmp/changes.patch`.
+
+4. stats.txt file for two different simulations of the program: one using the TimingSimpleCPU and the other using the MinorCPU.
+
+5. A short report (200 words) on your experience with the language / method used by gem5 for implementing different ISAs.
diff --git a/_pages/documentation/learning_gem5/gem5_101/3_gem5_101_homework3.md b/_pages/documentation/learning_gem5/gem5_101/3_gem5_101_homework3.md
new file mode 100644
index 0000000..50b9bef
--- /dev/null
+++ b/_pages/documentation/learning_gem5/gem5_101/3_gem5_101_homework3.md
@@ -0,0 +1,106 @@
+---
+layout: documentation
+title: Homework 3 for CS 752
+doc: Learning gem5
+parent: gem5_101
+permalink: /documentation/learning_gem5/gem5_101/homework-3
+authors:
+---
+
+# Homework 3 for CS 752: Advanced Computer Architecture I (Fall 2015 Section 1 of 1)
+
+**Due 1pm, Tuesday, 9/29**
+
+**You should do this assignment alone. No late assignments.**
+
+The purpose of this assignment is to give you experience with pipelined CPUs.  You will simulate a given program with simple timing cpu to understand the instruction mix of the program.  Then, you will simulate the same program with an pipelined inorder CPU to understand how the latency and bandwidth of different parts of pipeline affect performance.  You will also be exposed to pseudo-instructions that are used for carrying out functions required by the underlying experiment.  This homework is based on exercise 3.6 of CA:AQA 3rd edition.
+
+----
+
+1. The DAXPY loop (double precision aX + Y) is an oft used operation in programs that work with matrices and vectors.  The following code implements DAXPY in C++11.
+
+```cpp
+  #include <cstdio>
+  #include <random>
+
+  int main()
+  {
+    const int N = 1000;
+    double X[N], Y[N], alpha = 0.5;
+    std::random_device rd; std::mt19937 gen(rd());
+    std::uniform_real_distribution<> dis(1, 2);
+    for (int i = 0; i < N; ++i)
+    {
+      X[i] = dis(gen);
+      Y[i] = dis(gen);
+    }
+
+    // Start of daxpy loop
+    for (int i = 0; i < N; ++i)
+    {
+      Y[i] = alpha * X[i] + Y[i];
+    }
+    // End of daxpy loop
+
+    double sum = 0;
+    for (int i = 0; i < N; ++i)
+    {
+      sum += Y[i];
+    }
+    printf("%lf\n", sum);
+    return 0;
+  }
+```
+
+Your first task is to compile this code statically and simulate it with gem5 using the timing simple cpu.  Compile the program with `-O2` flag to avoid running into unimplemented x87 instructions while simulating with gem5.  Report the breakup of instructions for different op classes.  For this, grep for op_class in the file stats.txt.
+
+
+2. Generate the assembly code for the daxpy program above by using the `-S` and `-O2` options when compiling with GCC.  As you can see from the assembly code, instructions that are not central to the actual task of the program (computing `aX + Y`) will also be simulated.  This includes the instructions for generating the vectors `X` and `Y`, summing elements in `Y` and printing the sum.  When I compiled the code with `-S`, I got about 350 lines of assembly code, with only about 10-15 lines for the actual daxpy loop.
+
+Usually while carrying out experiments for evaluating a design, one would like to look only at statistics for the portion of the code that is most important.  To do so, typically programs are annotated so that the simulator, on reaching an annotated portion of the code, carries out functions like create a checkpoint, output and reset statistical variables.
+
+You will edit the C++ code from the first part to output and reset stats just before the start of the DAXPY loop and just after it.  For this, include the file `util/m5/m5op.h` in the program.  You will find this file in `util/m5` directory of the gem5 repository.  Use the function `m5_dumpreset_stats()` from this file in your program. This function outputs the statistical variables and then resets them. You can provide 0 as the value for the delay and the period arguments.
+
+To provide the definition of the `m5_dumpreset_stats()`, go to the directory `util/m5` and edit the Makefile.x86 in the following way:
+
+```
+  diff --git a/util/m5/Makefile.x86 b/util/m5/Makefile.x86
+  --- a/util/m5/Makefile.x86
+  +++ b/util/m5/Makefile.x86
+  [=@@=] -31,7 +31,7 @@
+   AS=as
+   LD=ld
+
+  -CFLAGS=-O2 -DM5OP_ADDR=0xFFFF0000
+  +CFLAGS=-O2
+   OBJS=m5.o m5op_x86.o
+
+   all: m5
+```
+
+Execute the command `make -f Makefile.x86` in the directory `util/m5`.  This will create an object file named `m5op_x86.o`.  Link this file with the program for DAXPY.  Now again simulate the program with the timing simple CPU.  This time you should see three sets of statistics in the file stats.txt.  Report the breakup of instructions among different op classes for the three parts of the program.  Provide the fragment of the generated assembly code that starts with call to `m5_dumpreset_stats()` and ends `m5_dumpreset_stats()`, and has the main daxpy loop in between.
+
+
+3. There are several different types of CPUs that gem5 supports: atomic, timing, out-of-order, inorder and kvm.  Let's talk about the timing and the inorder cpus.  The timing CPU (also known as SimpleTimingCPU) executes each arithmetic instruction in a single cycle, but requires multiple cycles for memory accesses.  Also, it is not pipelined.  So only a single instruction is being worked upon at any time.  The inorder cpu (also known as Minor) executes instructions in a pipelined fashion.  As I understand it has the following pipe stages: fetch1, fetch2, decode and execute.
+
+Take a look at the file `src/cpu/minor/MinorCPU.py`.  In the definition of `MinorFU`, the class for functional units, we define two quantities `opLat` and `issueLat`.  From the comments provided in the file, understand how these two parameters are to be used.  Also note the different functional units that are instantiated as defined in class `MinorDefaultFUPool`.
+
+
+Assume that the issueLat and the opLat of the FloatSimdFU can vary from 1 to 6 cycles and that they always sum to 7 cycles.  For each decrease in the opLat, we need to pay with a unit increase in issueLat.  Which design of the FloatSimd functional unit would you prefer?  Provide statistical evidence obtained through simulations of the annotated portion of the code.
+
+You can find a skeleton file that extends the minor CPU here <$urlbase}html/cpu.py>. If you use this file, you will have to modify your config scripts to work with it. Also, you'll have to modify this file to support the next part.
+
+4. The Minor CPU has by default two integer functional units as defined in the file MinorCPU.py (ignore the Multiplication and the Division units).  Assume our original Minor CPU design requires 2 cycles for integer functions and 4 cycles for floating point functions.  In our upcoming Minor CPU, we can halve either of these latencies.  Which one should we go for?  Provide statistical evidence obtained through simulations.
+
+
+## What to Hand In
+Turn in your assignment by sending an email message to Prof. David Wood <david@cs.wisc.edu> and Nilay Vaish <nilay@cs.wisc.edu>  with the subject line: "CS752 Homework3".
+
+1. The email should contain the name and ID numbers of the student submitting
+the assignment. The files below should be attached as a zip file to the email.
+
+2. A file named daxpy.cpp which is used for testing.  This file should also include the pseudo-instructions (`m5_dumpreset_stats()`) as asked in part 2.  Also provide a file daxpy.s with the fragment of the generated assembly code as asked for in part 2.
+
+3. stats.txt and config.ini files for all the simulations.
+
+4. A short report (200 words) on questions asked.
diff --git a/_pages/documentation/learning_gem5/gem5_101/4_gem5_101_homework4.md b/_pages/documentation/learning_gem5/gem5_101/4_gem5_101_homework4.md
new file mode 100644
index 0000000..f4cb77a
--- /dev/null
+++ b/_pages/documentation/learning_gem5/gem5_101/4_gem5_101_homework4.md
@@ -0,0 +1,105 @@
+---
+layout: documentation
+title: Homework 4 for CS 752
+doc: Learning gem5
+parent: gem5_101
+permalink: /documentation/learning_gem5/gem5_101/homework-4
+authors:
+---
+
+# Homework 4 for CS 752: Advanced Computer Architecture I (Fall 2015 Section 1 of 1)
+
+
+** Due Monday, 10/7**
+
+**You should do this assignment on your own. No late assignments.**
+
+Person of contact for this assignment: **Nilay Vaish** <nilay@cs.wisc.edu>.
+
+
+This homework is experimental in nature since I thought of this only
+yesterday (28 September, 2015).  It deals with two different methods of
+exploiting instruction level parallelism: "branch prediction" and "predication".
+
+Consider the following piece of code:
+```cpp
+  if (x < y)
+     x = y;
+```
+
+There at least two ways in which we can generate the assembly code for this.
+
+1. using branches:
+
+```
+    compare x, y
+    jump if not less L
+    move x, y
+  L:
+```
+
+2. using conditional move:
+
+```
+  compare x, y
+  conditionally move x to y.
+```
+
+Which version should one prefer?  We will try to get some understanding of
+this question in this homework.
+
+
+1. Here are some [posts](http://yarchive.net/comp/linux/cmov.html) on cmov from
+Linus Torvalds, the creator and maintainer of the Linux operating system.
+Linus has provided a short piece of C code for measuring the performance
+of branches and conditional moves.  Run the code on your x86 favorite
+processor and report the timing numbers for the two versions `choose()`
+function.  You should run each version at least 10 times.  Report both the average
+execution time and the standard deviation in the run times.
+If you see too much variation in the run times,  run for more iterations.  This
+should typically stabilize the performance.
+
+
+2. Now simulate the same two versions with gem5 using the out-of-order
+(default configuration) processor.  Lower the number of iterations to
+1,000,000 since 100,000,000 is lot of iterations for gem5.  Again report
+which option performs the best.  Also report the total number of
+branches predicted and the number of branches predicted incorrectly.
+
+----
+
+3. A paper on [branch avoiding algorithms](http://dl.acm.org/citation.cfm?id=2755580)
+was published at SPAA 2015.  The authors suggest that graph algorithms that avoid branches
+may perform better than algorithms that use branches.  Let's try to verify this claim.
+
+The paper provides two versions of an algorithm for computing the
+connected components in an undirected graph.  The first version uses
+branching and the second one uses conditional moves.  I implemented
+both the versions, but there is a slight problem.  The first version can
+be implemented in C++ directly, but the second one requires use of CMOV
+instruction.  I was not able to get this instruction working with inline
+assembly, but with raw assembly things work.  So along with the [C++1 source code](http://pages.cs.wisc.edu/~david/courses/cs752/Fall2015/html/hw4/connected-components.cpp), I am providing you the GCC generated-[assembly code](http://pages.cs.wisc.edu/~david/courses/cs752/Fall2015/html/hw4/connected-components.s) and the [statically compiled executable](http://pages.cs.wisc.edu/~david/courses/cs752/Fall2015/html/hw4/connected-components).  Note that
+you would not be able to generate exactly the same assembly code and the executable
+by compiling the C++11 source.  This is because I modified the generated assembly
+code to get cmov working.  I am also providing three graphs [small](http://pages.cs.wisc.edu/~david/courses/cs752/Fall2015/html/hw4/small.graph), [medium](http://pages.cs.wisc.edu/~david/courses/cs752/Fall2015/html/hw4/medium.graph) and [large](http://pages.cs.wisc.edu/~david/courses/cs752/Fall2015/html/hw4/large.graph.gz) that you will use for your experiments.  Read the C++ source to understand how to supply
+options to the executable.
+
+a. Run both the versions (with branches and with cmov) on an x86 processor and report
+the run time performance for the provided input files.  Do this exercise only for large graph.
+Provide data as asked in part 1.
+
+b. Run both the versions with gem5, report the performance of the two
+versions for the annotated portion of the code, the number of predicted
+branches, % of incorrectly predicted branches.  You need to do this only for small and medium graphs, not for the large one.
+Provide data asked in part 2 again.
+
+## What to Hand In
+Turn in your assignment by sending an email message to Nilay Vaish <nilay@cs.wisc.edu>
+and Prof. David Wood <david@cs.wisc.edu> with the subject line:"
+[CS752 Homework4]"
+
+**Please turn in your homework in the form of a PDF file.**
+
+* Answers for questions in step 1
+* Answers for questions in step 2
+* Answers for questions in step 3
diff --git a/_pages/documentation/learning_gem5/gem5_101/5_gem5_101_homework5.md b/_pages/documentation/learning_gem5/gem5_101/5_gem5_101_homework5.md
new file mode 100644
index 0000000..479e2b2
--- /dev/null
+++ b/_pages/documentation/learning_gem5/gem5_101/5_gem5_101_homework5.md
@@ -0,0 +1,63 @@
+---
+layout: documentation
+title: Homework 5 for CS 752
+doc: Learning gem5
+parent: gem5_101
+permalink: /documentation/learning_gem5/gem5_101/homework-5
+authors:
+---
+
+# Homework 5 for CS 752: Advanced Computer Architecture I (Fall 2015 Section 1 of 1)
+
+**Due Wednesday, 10/28**
+
+**You should do this assignment on your own. No late assignments.**
+
+Person of contact for this assignment: "Nilay Vaish'" <nilay@cs.wisc.edu>
+
+The goal of this assignment is two-fold. First, for you to experience creating a new SimObject in gem5, and second for you to consider tradeoffs in cache design.
+
+An updated cache.py for configuration can be downloaded here <{$urlbase}html/hw5/caches.py>. You can replace the cache.py found in the previous homework here: <{$urlbase}html/hw4-configs.tar.gz>.
+
+## Step 1: Implement NMRU replacement policy
+
+You can follow the tutorial here: <http://pages.cs.wisc.edu/~david/courses/cs752/Spring2015/gem5-tutorial/index.html>
+Part 2 of the tutorial will walk you through how to create the NMRU policy.
+
+## Step 2: Implement PLRU replacement policy
+
+Follow similar steps as you did to implement NRU, but implement pseudo-LRU instead.
+Psuedo-LRU uses a binary tree to encode which blocks are less recently used than other blocks in the set. These slides from Mikko Lipasti do a good job explaining the PLRU algorithm: <https://ece752.ece.wisc.edu/lect11-cache-replacement.pdf>.
+
+## Step 3: Architectural exploration
+
+This time, the Entil CEO has tasked you with designing the L1 data cache of their new processor based on the out-of-order O3CPU. For this task, the marketing director of Entil claims that most of their customers' workload is in the matrix multiply kernel. Due to it's memory intensity, Entil believe a better cache design could make their processor outperform the competition (AMM, Advanced Micro Machines if you're keeping track). 
+
+A blocked matrix multiply implementation can be downloaded here: <{$urlbase}html/hw5/mm.cpp>. Use an input of 128x128 matrix (./mm 128).
+
+You can choose from three replacement policies for the L1D cache: 'Random', 'NMRU', 'PLRU'. As the associativity increases, the costs for NMRU and PLRU rises, whereas the cost for Random stays the same. Therefore, Random can be used with higher associativities than the other replacement policies. Additionally, because NMRU and PLRU must update the recently used bits in the tag they access, these policies limit the clock rate of the CPU. Note, the max clock of the O3 CPU is 2.3 GHz in this generation.
+
+The constraints for these policies are summarized below.
+
+|            |Random |NMRU   |PLRU    |
+|------------|-------|-------|--------|
+|Max assoc.  |16     |8      |8       |
+|Lookup time |100 ps |500 ps | 666 ps |
+
+Clearly describe in a one page memo to the CEO of Entil, all of the configurations you simulated, the results of your simulations, and your overall conclusion of how to architect the L1 data cache.
+Additionally, answer the following specific questions:
+* Why does the 16-way set-associative cache perform better/worse/similar to the 8-way set-associative cache?
+* Why does Random/NMRU/PLRU/None perform better than the other replacement policies?
+* Is the cache replacement/associativity important for this workload, or are you only getting benefits from clock cycle? Explain why the cache architecture is important/unimportant.
+
+
+##What to Hand In
+
+Turn in your assignment by sending an email message to Nilay Vaish <nilay@cs.wisc.edu> and Prof. David Wood <david@cs.wisc.edu> with the subject line:
+"[CS752 Homework5]"
+
+1. The email should contain the name and ID numbers of the student submitting
+the assignment. The files below should be attached as a zip file to the email.
+2. A patch file containing all the changes you made to gem5.
+3. stats.txt and config.ini files for all the simulations.
+4. A short report on the questions asked. The report should be in PDF.
diff --git a/_pages/documentation/learning_gem5/gem5_101/6_gem5_101_homework6.md b/_pages/documentation/learning_gem5/gem5_101/6_gem5_101_homework6.md
new file mode 100644
index 0000000..dc56b3d
--- /dev/null
+++ b/_pages/documentation/learning_gem5/gem5_101/6_gem5_101_homework6.md
@@ -0,0 +1,99 @@
+---
+layout: documentation
+title: Homework 6 - Programming multi-core
+doc: Learning gem5
+parent: gem5_101
+permalink: /documentation/learning_gem5/gem5_101/homework-6
+authors:
+---
+
+
+# CS 758: Programming Multicore Processors (Fall 2013 Section 1 of 1)
+
+
+**GPU: 10/30**
+
+**You should do this assignment alone. No late assignments.**
+
+Filelist for the assignment:
+* [Template files]({$urlbase}/handouts/homeworks/hw6-dist.tgz)
+* [Intro to using Euler cluster](http://wacc.wisc.edu/documentation/EulerWalkthrough.pdf)
+
+The purpose of this assignment is for you to familiarize yourself with GPGPU computing platforms (CUDA) and to gain experience with GPGPU specific optimizations. For this assignment you will be given a basic implementation of an algorithm which runs on the GPU and you will procedurally improve it applying GPGPU optimization principals.
+
+**Important**:
+CUDA can be tricky, especially if you make a mistake. Error messages are often cryptic and uninformative. Start this assignment early! If you run into any problems post on the email list.
+
+## The problem
+For this assignment you will again be implementing the Ocean algorithm. You will be comparing the performance of your GPU-optimized algorithm to your solution from Homework 1. A simple solution to homework 1 is also included in the template files feel free to use it if you want.
+
+## The hardware
+You will be using the Euler cluster. You should have or soon will receive an email with a username and temporary password. (MAKE SURE YOU RESET YOUR PASSWORD!) Read the above tutorial that describes the hardware configuration.
+
+## Job submission
+This assignment was originally set up to submit jobs to the Torque queue.
+For this assignment, please just run jobs directly on  `euler01`.
+
+To get started:
+
+```sh
+local $ ssh user@euler.wacc.wisc.edu
+euler $ ssh euler01
+euler01 $ scp <username>@ale-01.cs.wisc.edu:/p/course/cs758-david/public/html/Fall2013/handouts/homeworks/hw6-dist.tgz .
+euler01 $ tar -x -f hw6-dist.tgz
+euler01 $ mv hw4-dist hw6
+euler01 $ cd hw6
+euler01 $ make
+euler01 $ ./serial_ocean.sh
+```
+
+You shouldn't have any problems as long as your code finishes quickly and you don't leave cuda-gdb open for long periods of time (they have come across a few bugs where cuda-gdb sometimes blocks access to all other GPUs).
+
+Information on the hardware provided in the Euler cluster is available [here](http://wacc.wisc.edu/documentation/EulerWalkthrough.pdf). You will use one of the Fermi cards (Tesla 2070/2050 or GTX 480). Each of which as 448 CUDA cores (14 SMs).
+
+Distributed with CUDA 5.5 is an application called `computeprof` which does a good job of concisely representing the performance counters available on the NVidia GPUs. To use this program, you will need to use `@@ssh -X@@` to login to the Euler cluster in order to forward the X server. You can then run it using `@@/usr/local/cuda/5.5.22/cuda/bin/computeprof@@` I recommend sitting on campus while doing this since there is much higher bandwidth. You can use `computeprof` to diagnose the bottlenecks in each implementation of the algorithm.
+
+##  Additional Information
+Dan Negrut is currently teaching a GPU Computing course (ME964). If you need additional info for your homework, you may find what you need at his course web page: <http://sbel.wisc.edu/Courses/ME964/2013/>
+There is also a forum where students in the class post questions/answers. It is here:
+<http://sbel.wisc.edu/Forum/viewforum.php?f=15>
+
+## Step 1: Porting the CPU algorithm
+I have included this implementation of the @@ocean_kernel@@ in the [template files]({$urlbase}/handouts/homeworks/hw4-dist.tgz). You can find it in `cuda_ocean_kernels.cu`` after `@@#ifdef VERSION1@@`. Although considerably more verbose, this is a mostly literal translation of the algorithm in `omp_ocean.c` with OpenMP static partitioning. Each thread gets a chunk of locations within the red/black ocean grid and updates those locations. Study this code and be sure to understand how it works.
+
+* Question a) Describe `memory` divergence and why it leads to poorly performing code in the SIMT model.
+* Question b) Describe the `memory` divergence behavior of `@@VERSION1@@` of `@@ocean_kernel@@`.
+* Question c) Vary the block size / grid size. What is the optimal block / grid size for this implementation of ocean? What is the speedup over 1 block and 1 thread ("single threaded")? Run with an input of `@@4098 4098 100@@`.
+* Question d) What is the speedup over the single threaded CPU version? Run with an input of `@@4098 4098 100@@`.
+
+## Step 2: Reduce memory divergence (Convert algorithm to "SIMD")
+Implement `@@VERSION2@@` of `@@ocean_kernel@@`. This version of the kernel will take a step towards reducing the memory divergence. Instead of giving each thread a chunk of the array to work on, re-write the algorithm so that the threads in each block work on adjacent elements. (I.e. for a red iteration, thread 0 will work on element 0, thread 1 will work on element 2, thread 3 will work on element 6, etc).
+
+* Question a) Describe where "memory" divergence still exists in this implementation of ocean.
+* Question b) Vary the block size / grid size. What is the optimal block / grid size for this implementation of ocean?
+* Question c) How does this version compare to VERSION1? Run with the optimal block sizes for each respectively and an input of `@@4098 4098 100@@`.
+
+## Step 3: Further reduce memory divergence (Modify data structure to be GPU-centric).
+Implement `@@VERSION3@@` of `@@ocean_kernel@@`. Instead of using one flat array to represent the ocean grid, split it into two arrays, one for the red cells and one for the black cells. You should start by writing two other kernels which will split the grid object into red_grid and black_grid and take red/black_grid and put them back into the grid object.
+
+If you're feeling adventurous, feel free to add any other optimizations to this implementation. Just describe them in your write-up.
+
+* Question a) How does the performance of this version compare to VERSION2? Is this what you expected?
+* Question b) Time each kernel and the memory copies separately (ocean_kernel, and (un)split_array_kernel). Which actions are taking the most execution time? How does this affect the overall execution time of the algorithm? (`computeprof` does a good job summarizing this data)
+* Question c) Vary the block size / grid size. What is the optimal block / grid size for this implementation of ocean? Does it change when you change the problem size?
+* Question d) Describe "branch" divergence and why it leads to poorly performing code in the SIMT model. Does your code exhibit any branch divergence? If so, where?
+* Question e) Given each node in the Euler cluster has 2 Intel Xeon E5520 processors and the GPUs have 448 CUDA cores (GTX480/C2050/C2070) how do you think the performance of your GPU version will compare to the CPU version? 
+* Question f) Run either your OpenMP version of ocean or the one in the template files. How does the performance of the CPU version of Ocean compare to the GPU version, better or worse? Why do you think this is? Use omp_ocean.sh to submit the OpenMP version. Run with problem sizes 1026, 2050, 4098, and 8194 with 100 timesteps.
+* Question g) What do you think of CUDA? SIMT programming in general?
+
+
+
+## Tips and Tricks
+* Start early.
+* Be mindful that Professor Dan Negrut has been gracious to allow us to use his computing resources for this assignment.
+
+## What to Hand In
+Please turn this homework in on **paper** at the beginning of lecture. You must include:
+* A printout of your GPU kernels
+* Answers to all of the questions and supporting graphs.
+**Important:** Include your name on EVERY page.
\ No newline at end of file