Blame - src/doc/inside-minor.doxygen - testing/jenkins-gem5-prod

blob: e55f61c01df8f9b355bf83c208f13b92dfacbf05 [file] [log] [blame]

Andrew Bardsley	0e8a90f	2014-07-23 16:09:04 -0500	[diff] [blame]	1	# Copyright (c) 2014 ARM Limited
				2	# All rights reserved
				3	#
				4	# The license below extends only to copyright in the software and shall
				5	# not be construed as granting a license to any other intellectual
				6	# property including but not limited to intellectual property relating
				7	# to a hardware implementation of the functionality of the software
				8	# licensed hereunder. You may use the software subject to the license
				9	# terms below provided that you ensure that this notice is replicated
				10	# unmodified and in its entirety in all distributions of the software,
				11	# modified or unmodified, in source code or in binary form.
				12	#
				13	# Redistribution and use in source and binary forms, with or without
				14	# modification, are permitted provided that the following conditions are
				15	# met: redistributions of source code must retain the above copyright
				16	# notice, this list of conditions and the following disclaimer;
				17	# redistributions in binary form must reproduce the above copyright
				18	# notice, this list of conditions and the following disclaimer in the
				19	# documentation and/or other materials provided with the distribution;
				20	# neither the name of the copyright holders nor the names of its
				21	# contributors may be used to endorse or promote products derived from
				22	# this software without specific prior written permission.
				23	#
				24	# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
				25	# "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
				26	# LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
				27	# A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
				28	# OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
				29	# SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
				30	# LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
				31	# DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
				32	# THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
				33	# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
				34	# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
				35	#
				36	# Authors: Andrew Bardsley
				37
				38	namespace Minor
				39	{
				40
				41	/*!
				42
				43	\page minor Inside the Minor CPU model
				44
				45	\tableofcontents
				46
				47	This document contains a description of the structure and function of the
				48	Minor gem5 in-order processor model. It is recommended reading for anyone who
				49	wants to understand Minor's internal organisation, design decisions, C++
				50	implementation and Python configuration. A familiarity with gem5 and some of
				51	its internal structures is assumed. This document is meant to be read
				52	alongside the Minor source code and to explain its general structure without
				53	being too slavish about naming every function and data type.
				54
				55	\section whatis What is Minor?
				56
				57	Minor is an in-order processor model with a fixed pipeline but configurable
				58	data structures and execute behaviour. It is intended to be used to model
				59	processors with strict in-order execution behaviour and allows visualisation
				60	of an instruction's position in the pipeline through the
				61	MinorTrace/minorview.py format/tool. The intention is to provide a framework
				62	for micro-architecturally correlating the model with a particular, chosen
				63	processor with similar capabilities.
				64
				65	\section philo Design philosophy
				66
				67	\subsection mt Multithreading
				68
				69	The model isn't currently capable of multithreading but there are THREAD
				70	comments in key places where stage data needs to be arrayed to support
				71	multithreading.
				72
				73	\subsection structs Data structures
				74
				75	Decorating data structures with large amounts of life-cycle information is
				76	avoided. Only instructions (MinorDynInst) contain a significant proportion of
				77	their data content whose values are not set at construction.
				78
				79	All internal structures have fixed sizes on construction. Data held in queues
				80	and FIFOs (MinorBuffer, FUPipeline) should have a BubbleIF interface to
				81	allow a distinct 'bubble'/no data value option for each type.
				82
				83	Inter-stage 'struct' data is packaged in structures which are passed by value.
				84	Only MinorDynInst, the line data in ForwardLineData and the memory-interfacing
				85	objects Fetch1::FetchRequest and LSQ::LSQRequest are '::new' allocated while
				86	running the model.
				87
				88	\section model Model structure
				89
				90	Objects of class MinorCPU are provided by the model to gem5. MinorCPU
				91	implements the interfaces of (cpu.hh) and can provide data and
				92	instruction interfaces for connection to a cache system. The model is
				93	configured in a similar way to other gem5 models through Python. That
				94	configuration is passed on to MinorCPU::pipeline (of class Pipeline) which
				95	actually implements the processor pipeline.
				96
				97	The hierarchy of major unit ownership from MinorCPU down looks like this:
				98
				99	<ul>
				100	<li>MinorCPU</li>
				101	<ul>
				102	<li>Pipeline - container for the pipeline, owns the cyclic 'tick'
				103	event mechanism and the idling (cycle skipping) mechanism.</li>
				104	<ul>
				105	<li>Fetch1 - instruction fetch unit responsible for fetching cache
				106	lines (or parts of lines from the I-cache interface)</li>
				107	<ul>
				108	<li>Fetch1::IcachePort - interface to the I-cache from
				109	Fetch1</li>
				110	</ul>
				111	<li>Fetch2 - line to instruction decomposition</li>
				112	<li>Decode - instruction to micro-op decomposition</li>
				113	<li>Execute - instruction execution and data memory
				114	interface</li>
				115	<ul>
				116	<li>LSQ - load store queue for memory ref. instructions</li>
				117	<li>LSQ::DcachePort - interface to the D-cache from
				118	Execute</li>
				119	</ul>
				120	</ul>
				121	</ul>
				122	</ul>
				123
				124	\section keystruct Key data structures
				125
				126	\subsection ids Instruction and line identity: InstId (dyn_inst.hh)
				127
				128	An InstId contains the sequence numbers and thread numbers that describe the
				129	life cycle and instruction stream affiliations of individual fetched cache
				130	lines and instructions.
				131
				132	An InstId is printed in one of the following forms:
				133
				134	- T/S.P/L - for fetched cache lines
				135	- T/S.P/L/F - for instructions before Decode
				136	- T/S.P/L/F.E - for instructions from Decode onwards
				137
				138	for example:
				139
				140	- 0/10.12/5/6.7
				141
				142	InstId's fields are:
				143
				144	<table>
				145	<tr>
				146	<td><b>Field</b></td>
				147	<td><b>Symbol</b></td>
				148	<td><b>Generated by</b></td>
				149	<td><b>Checked by</b></td>
				150	<td><b>Function</b></td>
				151	</tr>
				152
				153	<tr>
				154	<td>InstId::threadId</td>
				155	<td>T</td>
				156	<td>Fetch1</td>
				157	<td>Everywhere the thread number is needed</td>
				158	<td>Thread number (currently always 0).</td>
				159	</tr>
				160
				161	<tr>
				162	<td>InstId::streamSeqNum</td>
				163	<td>S</td>
				164	<td>Execute</td>
				165	<td>Fetch1, Fetch2, Execute (to discard lines/insts)</td>
				166	<td>Stream sequence number as chosen by Execute. Stream
				167	sequence numbers change after changes of PC (branches, exceptions) in
				168	Execute and are used to separate pre and post branch instruction
				169	streams.</td>
				170	</tr>
				171
				172	<tr>
				173	<td>InstId::predictionSeqNum</td>
				174	<td>P</td>
				175	<td>Fetch2</td>
				176	<td>Fetch2 (while discarding lines after prediction)</td>
				177	<td>Prediction sequence numbers represent branch prediction decisions.
				178	This is used by Fetch2 to mark lines/instructions according to the last
				179	followed branch prediction made by Fetch2. Fetch2 can signal to Fetch1
				180	that it should change its fetch address and mark lines with a new
				181	prediction sequence number (which it will only do if the stream sequence
				182	number Fetch1 expects matches that of the request). </td> </tr>
				183
				184	<tr>
				185	<td>InstId::lineSeqNum</td>
				186	<td>L</td>
				187	<td>Fetch1</td>
				188	<td>(Just for debugging)</td>
				189	<td>Line fetch sequence number of this cache line or the line
				190	this instruction was extracted from.
				191	</td>
				192	</tr>
				193
				194	<tr>
				195	<td>InstId::fetchSeqNum</td>
				196	<td>F</td>
				197	<td>Fetch2</td>
				198	<td>Fetch2 (as the inst. sequence number for branches)</td>
				199	<td>Instruction fetch order assigned by Fetch2 when lines
				200	are decomposed into instructions.
				201	</td>
				202	</tr>
				203
				204	<tr>
				205	<td>InstId::execSeqNum</td>
				206	<td>E</td>
				207	<td>Decode</td>
				208	<td>Execute (to check instruction identity in queues/FUs/LSQ)</td>
				209	<td>Instruction order after micro-op decomposition.</td>
				210	</tr>
				211
				212	</table>
				213
				214	The sequence number fields are all independent of each other and although, for
				215	instance, InstId::execSeqNum for an instruction will always be >=
				216	InstId::fetchSeqNum, the comparison is not useful.
				217
				218	The originating stage of each sequence number field keeps a counter for that
				219	field which can be incremented in order to generate new, unique numbers.
				220
				221	\subsection insts Instructions: MinorDynInst (dyn_inst.hh)
				222
				223	MinorDynInst represents an instruction's progression through the pipeline. An
				224	instruction can be three things:
				225
				226	<table>
				227	<tr>
				228	<td><b>Thing</b></td>
				229	<td><b>Predicate</b></td>
				230	<td><b>Explanation</b></td>
				231	</tr>
				232	<tr>
				233	<td>A bubble</td>
				234	<td>MinorDynInst::isBubble()</td>
				235	<td>no instruction at all, just a space-filler</td>
				236	</tr>
				237	<tr>
				238	<td>A fault</td>
				239	<td>MinorDynInst::isFault()</td>
				240	<td>a fault to pass down the pipeline in an instruction's clothing</td>
				241	</tr>
				242	<tr>
				243	<td>A decoded instruction</td>
				244	<td>MinorDynInst::isInst()</td>
				245	<td>instructions are actually passed to the gem5 decoder in Fetch2 and so
				246	are created fully decoded. MinorDynInst::staticInst is the decoded
				247	instruction form.</td>
				248	</tr>
				249	</table>
				250
				251	Instructions are reference counted using the gem5 RefCountingPtr
				252	(base/refcnt.hh) wrapper. They therefore usually appear as MinorDynInstPtr in
				253	code. Note that as RefCountingPtr initialises as nullptr rather than an
				254	object that supports BubbleIF::isBubble, passing raw MinorDynInstPtrs to
				255	Queue%s and other similar structures from stage.hh without boxing is
				256	dangerous.
				257
				258	\subsection fld ForwardLineData (pipe_data.hh)
				259
				260	ForwardLineData is used to pass cache lines from Fetch1 to Fetch2. Like
				261	MinorDynInst%s, they can be bubbles (ForwardLineData::isBubble()),
				262	fault-carrying or can contain a line (partial line) fetched by Fetch1. The
				263	data carried by ForwardLineData is owned by a Packet object returned from
				264	memory and is explicitly memory managed and do must be deleted once processed
				265	(by Fetch2 deleting the Packet).
				266
				267	\subsection fid ForwardInstData (pipe_data.hh)
				268
				269	ForwardInstData can contain up to ForwardInstData::width() instructions in its
				270	ForwardInstData::insts vector. This structure is used to carry instructions
				271	between Fetch2, Decode and Execute and to store input buffer vectors in Decode
				272	and Execute.
				273
				274	\subsection fr Fetch1::FetchRequest (fetch1.hh)
				275
				276	FetchRequests represent I-cache line fetch requests. The are used in the
				277	memory queues of Fetch1 and are pushed into/popped from Packet::senderState
				278	while traversing the memory system.
				279
				280	FetchRequests contain a memory system Request (mem/request.hh) for that fetch
				281	access, a packet (Packet, mem/packet.hh), if the request gets to memory, and a
				282	fault field that can be populated with a TLB-sourced prefetch fault (if any).
				283
				284	\subsection lsqr LSQ::LSQRequest (execute.hh)
				285
				286	LSQRequests are similar to FetchRequests but for D-cache accesses. They carry
				287	the instruction associated with a memory access.
				288
				289	\section pipeline The pipeline
				290
				291	\verbatim
				292	------------------------------------------------------------------------------
				293	Key:
				294
				295	[] : inter-stage BufferBuffer
				296	,--.
				297	\| \| : pipeline stage
				298	`--'
				299	---> : forward communication
				300	<--- : backward communication
				301
				302	rv : reservation information for input buffers
				303
				304	,------. ,------. ,------. ,-------.
				305	(from --[]-v->\|Fetch1\|-[]->\|Fetch2\|-[]->\|Decode\|-[]->\|Execute\|--> (to Fetch1
				306	Execute) \| \| \|<-[]-\| \|<-rv-\| \|<-rv-\| \| & Fetch2)
				307	\| `------'<-rv-\| \| \| \| \| \|
				308	`-------------->\| \| \| \| \| \|
				309	`------' `------' `-------'
				310	------------------------------------------------------------------------------
				311	\endverbatim
				312
				313	The four pipeline stages are connected together by MinorBuffer FIFO
				314	(stage.hh, derived ultimately from TimeBuffer) structures which allow
				315	inter-stage delays to be modelled. There is a MinorBuffer%s between adjacent
				316	stages in the forward direction (for example: passing lines from Fetch1 to
				317	Fetch2) and, between Fetch2 and Fetch1, a buffer in the backwards direction
				318	carrying branch predictions.
				319
				320	Stages Fetch2, Decode and Execute have input buffers which, each cycle, can
				321	accept input data from the previous stage and can hold that data if the stage
				322	is not ready to process it. Input buffers store data in the same form as it
				323	is received and so Decode and Execute's input buffers contain the output
				324	instruction vector (ForwardInstData (pipe_data.hh)) from their previous stages
				325	with the instructions and bubbles in the same positions as a single buffer
				326	entry.
				327
				328	Stage input buffers provide a Reservable (stage.hh) interface to their
				329	previous stages, to allow slots to be reserved in their input buffers, and
				330	communicate their input buffer occupancy backwards to allow the previous stage
				331	to plan whether it should make an output in a given cycle.
				332
				333	\subsection events Event handling: MinorActivityRecorder (activity.hh,
				334	pipeline.hh)
				335
				336	Minor is essentially a cycle-callable model with some ability to skip cycles
				337	based on pipeline activity. External events are mostly received by callbacks
				338	(e.g. Fetch1::IcachePort::recvTimingResp) and cause the pipeline to be woken
				339	up to service advancing request queues.
				340
				341	Ticked (sim/ticked.hh) is a base class bringing together an evaluate
				342	member function and a provided SimObject. It provides a Ticked::start/stop
				343	interface to start and pause clock events from being periodically issued.
				344	Pipeline is a derived class of Ticked.
				345
				346	During evaluate calls, stages can signal that they still have work to do in
				347	the next cycle by calling either MinorCPU::activityRecorder->activity() (for
				348	non-callable related activity) or MinorCPU::wakeupOnEvent(<stageId>) (for
				349	stage callback-related 'wakeup' activity).
				350
				351	Pipeline::evaluate contains calls to evaluate for each unit and a test for
				352	pipeline idling which can turns off the clock tick if no unit has signalled
				353	that it may become active next cycle.
				354
				355	Within Pipeline (pipeline.hh), the stages are evaluated in reverse order (and
				356	so will ::evaluate in reverse order) and their backwards data can be
				357	read immediately after being written in each cycle allowing output decisions
				358	to be 'perfect' (allowing synchronous stalling of the whole pipeline). Branch
				359	predictions from Fetch2 to Fetch1 can also be transported in 0 cycles making
				360	fetch1ToFetch2BackwardDelay the only configurable delay which can be set as
				361	low as 0 cycles.
				362
				363	The MinorCPU::activateContext and MinorCPU::suspendContext interface can be
				364	called to start and pause threads (threads in the MT sense) and to start and
				365	pause the pipeline. Executing instructions can call this interface
				366	(indirectly through the ThreadContext) to idle the CPU/their threads.
				367
				368	\subsection stages Each pipeline stage
				369
				370	In general, the behaviour of a stage (each cycle) is:
				371
				372	\verbatim
				373	evaluate:
				374	push input to inputBuffer
				375	setup references to input/output data slots
				376
				377	do 'every cycle' 'step' tasks
				378
				379	if there is input and there is space in the next stage:
				380	process and generate a new output
				381	maybe re-activate the stage
				382
				383	send backwards data
				384
				385	if the stage generated output to the following FIFO:
				386	signal pipe activity
				387
				388	if the stage has more processable input and space in the next stage:
				389	re-activate the stage for the next cycle
				390
				391	commit the push to the inputBuffer if that data hasn't all been used
				392	\endverbatim
				393
				394	The Execute stage differs from this model as its forward output (branch) data
				395	is unconditionally sent to Fetch1 and Fetch2. To allow this behaviour, Fetch1
				396	and Fetch2 must be unconditionally receptive to that data.
				397
				398	\subsection fetch1 Fetch1 stage
				399
				400	Fetch1 is responsible for fetching cache lines or partial cache lines from the
				401	I-cache and passing them on to Fetch2 to be decomposed into instructions. It
				402	can receive 'change of stream' indications from both Execute and Fetch2 to
				403	signal that it should change its internal fetch address and tag newly fetched
				404	lines with new stream or prediction sequence numbers. When both Execute and
				405	Fetch2 signal changes of stream at the same time, Fetch1 takes Execute's
				406	change.
				407
				408	Every line issued by Fetch1 will bear a unique line sequence number which can
				409	be used for debugging stream changes.
				410
				411	When fetching from the I-cache, Fetch1 will ask for data from the current
				412	fetch address (Fetch1::pc) up to the end of the 'data snap' size set in the
				413	parameter fetch1LineSnapWidth. Subsequent autonomous line fetches will fetch
				414	whole lines at a snap boundary and of size fetch1LineWidth.
				415
				416	Fetch1 will only initiate a memory fetch if it can reserve space in Fetch2
				417	input buffer. That input buffer serves an the fetch queue/LFL for the system.
				418
				419	Fetch1 contains two queues: requests and transfers to handle the stages of
				420	translating the address of a line fetch (via the TLB) and accommodating the
				421	request/response of fetches to/from memory.
				422
				423	Fetch requests from Fetch1 are pushed into the requests queue as newly
				424	allocated FetchRequest objects once they have been sent to the ITLB with a
				425	call to itb->translateTiming.
				426
				427	A response from the TLB moves the request from the requests queue to the
				428	transfers queue. If there is more than one entry in each queue, it is
				429	possible to get a TLB response for request which is not at the head of the
				430	requests queue. In that case, the TLB response is marked up as a state change
				431	to Translated in the request object, and advancing the request to transfers
				432	(and the memory system) is left to calls to Fetch1::stepQueues which is called
				433	in the cycle following any event is received.
				434
				435	Fetch1::tryToSendToTransfers is responsible for moving requests between the
				436	two queues and issuing requests to memory. Failed TLB lookups (prefetch
				437	aborts) continue to occupy space in the queues until they are recovered at the
				438	head of transfers.
				439
				440	Responses from memory change the request object state to Complete and
				441	Fetch1::evaluate can pick up response data, package it in the ForwardLineData
				442	object, and forward it to Fetch2%'s input buffer.
				443
				444	As space is always reserved in Fetch2::inputBuffer, setting the input buffer's
				445	size to 1 results in non-prefetching behaviour.
				446
				447	When a change of stream occurs, translated requests queue members and
				448	completed transfers queue members can be unconditionally discarded to make way
				449	for new transfers.
				450
				451	\subsection fetch2 Fetch2 stage
				452
				453	Fetch2 receives a line from Fetch1 into its input buffer. The data in the
				454	head line in that buffer is iterated over and separated into individual
				455	instructions which are packed into a vector of instructions which can be
				456	passed to Decode. Packing instructions can be aborted early if a fault is
				457	found in either the input line as a whole or a decomposed instruction.
				458
				459	\subsubsection bp Branch prediction
				460
				461	Fetch2 contains the branch prediction mechanism. This is a wrapper around the
				462	branch predictor interface provided by gem5 (cpu/pred/...).
				463
				464	Branches are predicted for any control instructions found. If prediction is
				465	attempted for an instruction, the MinorDynInst::triedToPredict flag is set on
				466	that instruction.
				467
				468	When a branch is predicted to take, the MinorDynInst::predictedTaken flag is
				469	set and MinorDynInst::predictedTarget is set to the predicted target PC value.
				470	The predicted branch instruction is then packed into Fetch2%'s output vector,
				471	the prediction sequence number is incremented, and the branch is communicated
				472	to Fetch1.
				473
				474	After signalling a prediction, Fetch2 will discard its input buffer contents
				475	and will reject any new lines which have the same stream sequence number as
				476	that branch but have a different prediction sequence number. This allows
				477	following sequentially fetched lines to be rejected without ignoring new lines
				478	generated by a change of stream indicated from a 'real' branch from Execute
				479	(which will have a new stream sequence number).
				480
				481	The program counter value provided to Fetch2 by Fetch1 packets is only updated
				482	when there is a change of stream. Fetch2::havePC indicates whether the PC
				483	will be picked up from the next processed input line. Fetch2::havePC is
				484	necessary to allow line-wrapping instructions to be tracked through decode.
				485
				486	Branches (and instructions predicted to branch) which are processed by Execute
				487	will generate BranchData (pipe_data.hh) data explaining the outcome of the
				488	branch which is sent forwards to Fetch1 and Fetch2. Fetch1 uses this data to
				489	change stream (and update its stream sequence number and address for new
				490	lines). Fetch2 uses it to update the branch predictor. Minor does not
				491	communicate branch data to the branch predictor for instructions which are
				492	discarded on the way to commit.
				493
				494	BranchData::BranchReason (pipe_data.hh) encodes the possible branch scenarios:
				495
				496	<table>
				497	<tr>
				498	<td>Branch enum val.</td>
				499	<td>In Execute</td>
				500	<td>Fetch1 reaction</td>
				501	<td>Fetch2 reaction</td>
				502	</tr>
				503	<tr>
				504	<td>NoBranch</td>
				505	<td>(output bubble data)</td>
				506	<td>-</td>
				507	<td>-</td>
				508	</tr>
				509	<tr>
				510	<td>CorrectlyPredictedBranch</td>
				511	<td>Predicted, taken</td>
				512	<td>-</td>
				513	<td>Update BP as taken branch</td>
				514	</tr>
				515	<tr>
				516	<td>UnpredictedBranch</td>
				517	<td>Not predicted, taken and was taken</td>
				518	<td>New stream</td>
				519	<td>Update BP as taken branch</td>
				520	</tr>
				521	<tr>
				522	<td>BadlyPredictedBranch</td>
				523	<td>Predicted, not taken</td>
				524	<td>New stream to restore to old inst. source</td>
				525	<td>Update BP as not taken branch</td>
				526	</tr>
				527	<tr>
				528	<td>BadlyPredictedBranchTarget</td>
				529	<td>Predicted, taken, but to a different target than predicted one</td>
				530	<td>New stream</td>
				531	<td>Update BTB to new target</td>
				532	</tr>
				533	<tr>
				534	<td>SuspendThread</td>
				535	<td>Hint to suspend fetching</td>
				536	<td>Suspend fetch for this thread (branch to next inst. as wakeup
				537	fetch addr)</td>
				538	<td>-</td>
				539	</tr>
				540	<tr>
				541	<td>Interrupt</td>
				542	<td>Interrupt detected</td>
				543	<td>New stream</td>
				544	<td>-</td>
				545	</tr>
				546	</table>
				547
				548	The parameter decodeInputWidth sets the number of instructions which can be
				549	packed into the output per cycle. If the parameter fetch2CycleInput is true,
				550	Decode can try to take instructions from more than one entry in its input
				551	buffer per cycle.
				552
				553	\subsection decode Decode stage
				554
				555	Decode takes a vector of instructions from Fetch2 (via its input buffer) and
				556	decomposes those instructions into micro-ops (if necessary) and packs them
				557	into its output instruction vector.
				558
				559	The parameter executeInputWidth sets the number of instructions which can be
				560	packed into the output per cycle. If the parameter decodeCycleInput is true,
				561	Decode can try to take instructions from more than one entry in its input
				562	buffer per cycle.
				563
				564	\subsection execute Execute stage
				565
				566	Execute provides all the instruction execution and memory access mechanisms.
				567	An instructions passage through Execute can take multiple cycles with its
				568	precise timing modelled by a functional unit pipeline FIFO.
				569
				570	A vector of instructions (possibly including fault 'instructions') is provided
				571	to Execute by Decode and can be queued in the Execute input buffer before
				572	being issued. Setting the parameter executeCycleInput allows execute to
				573	examine more than one input buffer entry (more than one instruction vector).
				574	The number of instructions in the input vector can be set with
				575	executeInputWidth and the depth of the input buffer can be set with parameter
				576	executeInputBufferSize.
				577
				578	\subsubsection fus Functional units
				579
				580	The Execute stage contains pipelines for each functional unit comprising the
				581	computational core of the CPU. Functional units are configured via the
				582	executeFuncUnits parameter. Each functional unit has a number of instruction
				583	classes it supports, a stated delay between instruction issues, and a delay
				584	from instruction issue to (possible) commit and an optional timing annotation
				585	capable of more complicated timing.
				586
				587	Each active cycle, Execute::evaluate performs this action:
				588
				589	\verbatim
				590	Execute::evaluate:
				591	push input to inputBuffer
				592	setup references to input/output data slots and branch output slot
				593
				594	step D-cache interface queues (similar to Fetch1)
				595
				596	if interrupt posted:
				597	take interrupt (signalling branch to Fetch1/Fetch2)
				598	else
				599	commit instructions
				600	issue new instructions
				601
				602	advance functional unit pipelines
				603
				604	reactivate Execute if the unit is still active
				605
				606	commit the push to the inputBuffer if that data hasn't all been used
				607	\endverbatim
				608
				609	\subsubsection fifos Functional unit FIFOs
				610
				611	Functional units are implemented as SelfStallingPipelines (stage.hh). These
				612	are TimeBuffer FIFOs with two distinct 'push' and 'pop' wires. They respond
				613	to SelfStallingPipeline::advance in the same way as TimeBuffers <b>unless</b>
				614	there is data at the far, 'pop', end of the FIFO. A 'stalled' flag is
				615	provided for signalling stalling and to allow a stall to be cleared. The
				616	intention is to provide a pipeline for each functional unit which will never
				617	advance an instruction out of that pipeline until it has been processed and
				618	the pipeline is explicitly unstalled.
				619
				620	The actions 'issue', 'commit', and 'advance' act on the functional units.
				621
				622	\subsubsection issue Issue
				623
				624	Issuing instructions involves iterating over both the input buffer
				625	instructions and the heads of the functional units to try and issue
				626	instructions in order. The number of instructions which can be issued each
				627	cycle is limited by the parameter executeIssueLimit, how executeCycleInput is
				628	set, the availability of pipeline space and the policy used to choose a
				629	pipeline in which the instruction can be issued.
				630
				631	At present, the only issue policy is strict round-robin visiting of each
				632	pipeline with the given instructions in sequence. For greater flexibility,
				633	better (and more specific policies) will need to be possible.
				634
				635	Memory operation instructions traverse their functional units to perform their
				636	EA calculations. On 'commit', the ExecContext::initiateAcc execution phase is
				637	performed and any memory access is issued (via. ExecContext::{read,write}Mem
				638	calling LSQ::pushRequest) to the LSQ.
				639
				640	Note that faults are issued as if they are instructions and can (currently) be
				641	issued to any functional unit.
				642
				643	Every issued instruction is also pushed into the Execute::inFlightInsts queue.
				644	Memory ref. instructions are pushing into Execute::inFUMemInsts queue.
				645
				646	\subsubsection commit Commit
				647
				648	Instructions are committed by examining the head of the Execute::inFlightInsts
				649	queue (which is decorated with the functional unit number to which the
				650	instruction was issued). Instructions which can then be found in their
				651	functional units are executed and popped from Execute::inFlightInsts.
				652
				653	Memory operation instructions are committed into the memory queues (as
				654	described above) and exit their functional unit pipeline but are not popped
				655	from the Execute::inFlightInsts queue. The Execute::inFUMemInsts queue
				656	provides ordering to memory operations as they pass through the functional
				657	units (maintaining issue order). On entering the LSQ, instructions are popped
				658	from Execute::inFUMemInsts.
				659
				660	If the parameter executeAllowEarlyMemoryIssue is set, memory operations can be
				661	sent from their FU to the LSQ before reaching the head of
				662	Execute::inFlightInsts but after their dependencies are met.
				663	MinorDynInst::instToWaitFor is marked up with the latest dependent instruction
				664	execSeqNum required to be committed for a memory operation to progress to the
				665	LSQ.
				666
				667	Once a memory response is available (by testing the head of
				668	Execute::inFlightInsts against LSQ::findResponse), commit will process that
				669	response (ExecContext::completeAcc) and pop the instruction from
				670	Execute::inFlightInsts.
				671
				672	Any branch, fault or interrupt will cause a stream sequence number change and
				673	signal a branch to Fetch1/Fetch2. Only instructions with the current stream
				674	sequence number will be issued and/or committed.
				675
				676	\subsubsection advance Advance
				677
				678	All non-stalled pipeline are advanced and may, thereafter, become stalled.
				679	Potential activity in the next cycle is signalled if there are any
				680	instructions remaining in any pipeline.
				681
				682	\subsubsection sb Scoreboard
				683
				684	The scoreboard (Scoreboard) is used to control instruction issue. It contains
				685	a count of the number of in flight instructions which will write each general
				686	purpose CPU integer or float register. Instructions will only be issued where
				687	the scoreboard contains a count of 0 instructions which will write to one of
				688	the instructions source registers.
				689
				690	Once an instruction is issued, the scoreboard counts for each destination
				691	register for an instruction will be incremented.
				692
				693	The estimated delivery time of the instruction's result is marked up in the
				694	scoreboard by adding the length of the issued-to FU to the current time. The
				695	timings parameter on each FU provides a list of additional rules for
				696	calculating the delivery time. These are documented in the parameter comments
				697	in MinorCPU.py.
				698
				699	On commit, (for memory operations, memory response commit) the scoreboard
				700	counters for an instruction's source registers are decremented. will be
				701	decremented.
				702
				703	\subsubsection ifi Execute::inFlightInsts
				704
				705	The Execute::inFlightInsts queue will always contain all instructions in
				706	flight in Execute in the correct issue order. Execute::issue is the only
				707	process which will push an instruction into the queue. Execute::commit is the
				708	only process that can pop an instruction.
				709
				710	\subsubsection lsq LSQ
				711
				712	The LSQ can support multiple outstanding transactions to memory in a number of
				713	conservative cases.
				714
				715	There are three queues to contain requests: requests, transfers and the store
				716	buffer. The requests and transfers queue operate in a similar manner to the
				717	queues in Fetch1. The store buffer is used to decouple the delay of
				718	completing store operations from following loads.
				719
				720	Requests are issued to the DTLB as their instructions leave their functional
				721	unit. At the head of requests, cacheable load requests can be sent to memory
				722	and on to the transfers queue. Cacheable stores will be passed to transfers
				723	unprocessed and progress that queue maintaining order with other transactions.
				724
				725	The conditions in LSQ::tryToSendToTransfers dictate when requests can
				726	be sent to memory.
				727
				728	All uncacheable transactions, split transactions and locked transactions are
				729	processed in order at the head of requests. Additionally, store results
				730	residing in the store buffer can have their data forwarded to cacheable loads
				731	(removing the need to perform a read from memory) but no cacheable load can be
				732	issue to the transfers queue until that queue's stores have drained into the
				733	store buffer.
				734
				735	At the end of transfers, requests which are LSQ::LSQRequest::Complete (are
				736	faulting, are cacheable stores, or have been sent to memory and received a
				737	response) can be picked off by Execute and either committed
				738	(ExecContext::completeAcc) and, for stores, be sent to the store buffer.
				739
				740	Barrier instructions do not prevent cacheable loads from progressing to memory
				741	but do cause a stream change which will discard that load. Stores will not be
				742	committed to the store buffer if they are in the shadow of the barrier but
				743	before the new instruction stream has arrived at Execute. As all other memory
				744	transactions are delayed at the end of the requests queue until they are at
				745	the head of Execute::inFlightInsts, they will be discarded by any barrier
				746	stream change.
				747
				748	After commit, LSQ::BarrierDataRequest requests are inserted into the
				749	store buffer to track each barrier until all preceding memory transactions
				750	have drained from the store buffer. No further memory transactions will be
				751	issued from the ends of FUs until after the barrier has drained.
				752
				753	\subsubsection drain Draining
				754
				755	Draining is mostly handled by the Execute stage. When initiated by calling
				756	MinorCPU::drain, Pipeline::evaluate checks the draining status of each unit
				757	each cycle and keeps the pipeline active until draining is complete. It is
				758	Pipeline that signals the completion of draining. Execute is triggered by
				759	MinorCPU::drain and starts stepping through its Execute::DrainState state
				760	machine, starting from state Execute::NotDraining, in this order:
				761
				762	<table>
				763	<tr>
				764	<td><b>State</b></td>
				765	<td><b>Meaning</b></td>
				766	</tr>
				767	<tr>
				768	<td>Execute::NotDraining</td>
				769	<td>Not trying to drain, normal execution</td>
				770	</tr>
				771	<tr>
				772	<td>Execute::DrainCurrentInst</td>
				773	<td>Draining micro-ops to complete inst.</td>
				774	</tr>
				775	<tr>
				776	<td>Execute::DrainHaltFetch</td>
				777	<td>Halt fetching instructions</td>
				778	</tr>
				779	<tr>
				780	<td>Execute::DrainAllInsts</td>
				781	<td>Discarding all instructions presented</td>
				782	</tr>
				783	</table>
				784
				785	When complete, a drained Execute unit will be in the Execute::DrainAllInsts
				786	state where it will continue to discard instructions but has no knowledge of
				787	the drained state of the rest of the model.
				788
				789	\section debug Debug options
				790
				791	The model provides a number of debug flags which can be passed to gem5 with
				792	the --debug-flags option.
				793
				794	The available flags are:
				795
				796	<table>
				797	<tr>
				798	<td><b>Debug flag</b></td>
				799	<td><b>Unit which will generate debugging output</b></td>
				800	</tr>
				801	<tr>
				802	<td>Activity</td>
				803	<td>Debug ActivityMonitor actions</td>
				804	</tr>
				805	<tr>
				806	<td>Branch</td>
				807	<td>Fetch2 and Execute branch prediction decisions</td>
				808	</tr>
				809	<tr>
				810	<td>MinorCPU</td>
				811	<td>CPU global actions such as wakeup/thread suspension</td>
				812	</tr>
				813	<tr>
				814	<td>Decode</td>
				815	<td>Decode</td>
				816	</tr>
				817	<tr>
				818	<td>MinorExec</td>
				819	<td>Execute behaviour</td>
				820	</tr>
				821	<tr>
				822	<td>Fetch</td>
				823	<td>Fetch1 and Fetch2</td>
				824	</tr>
				825	<tr>
				826	<td>MinorInterrupt</td>
				827	<td>Execute interrupt handling</td>
				828	</tr>
				829	<tr>
				830	<td>MinorMem</td>
				831	<td>Execute memory interactions</td>
				832	</tr>
				833	<tr>
				834	<td>MinorScoreboard</td>
				835	<td>Execute scoreboard activity</td>
				836	</tr>
				837	<tr>
				838	<td>MinorTrace</td>
				839	<td>Generate MinorTrace cyclic state trace output (see below)</td>
				840	</tr>
				841	<tr>
				842	<td>MinorTiming</td>
				843	<td>MinorTiming instruction timing modification operations</td>
				844	</tr>
				845	</table>
				846
				847	The group flag Minor enables all of the flags beginning with Minor.
				848
				849	\section trace MinorTrace and minorview.py
				850
				851	The debug flag MinorTrace causes cycle-by-cycle state data to be printed which
				852	can then be processed and viewed by the minorview.py tool. This output is
				853	very verbose and so it is recommended it only be used for small examples.
				854
				855	\subsection traceformat MinorTrace format
				856
				857	There are three types of line outputted by MinorTrace:
				858
				859	\subsubsection state MinorTrace - Ticked unit cycle state
				860
				861	For example:
				862
				863	\verbatim
				864	110000: system.cpu.dcachePort: MinorTrace: state=MemoryRunning in_tlb_mem=0/0
				865	\endverbatim
				866
				867	For each time step, the MinorTrace flag will cause one MinorTrace line to be
				868	printed for every named element in the model.
				869
				870	\subsubsection traceunit MinorInst - summaries of instructions issued by \
				871	Decode
				872
				873	For example:
				874
				875	\verbatim
				876	140000: system.cpu.execute: MinorInst: id=0/1.1/1/1.1 addr=0x5c \
				877	inst=" mov r0, #0" class=IntAlu
				878	\endverbatim
				879
				880	MinorInst lines are currently only generated for instructions which are
				881	committed.
				882
				883	\subsubsection tracefetch1 MinorLine - summaries of line fetches issued by \
				884	Fetch1
				885
				886	For example:
				887
				888	\verbatim
				889	92000: system.cpu.icachePort: MinorLine: id=0/1.1/1 size=36 \
				890	vaddr=0x5c paddr=0x5c
				891	\endverbatim
				892
				893	\subsection minorview minorview.py
				894
				895	Minorview (util/minorview.py) can be used to visualise the data created by
				896	MinorTrace.
				897
				898	\verbatim
				899	usage: minorview.py [-h] [--picture picture-file] [--prefix name]
				900	[--start-time time] [--end-time time] [--mini-views]
				901	event-file
				902
				903	Minor visualiser
				904
				905	positional arguments:
				906	event-file
				907
				908	optional arguments:
				909	-h, --help show this help message and exit
				910	--picture picture-file
				911	markup file containing blob information (default:
				912	<minorview-path>/minor.pic)
				913	--prefix name name prefix in trace for CPU to be visualised
				914	(default: system.cpu)
				915	--start-time time time of first event to load from file
				916	--end-time time time of last event to load from file
				917	--mini-views show tiny views of the next 10 time steps
				918	\endverbatim
				919
				920	Raw debugging output can be passed to minorview.py as the event-file. It will
				921	pick out the MinorTrace lines and use other lines where units in the
				922	simulation are named (such as system.cpu.dcachePort in the above example) will
				923	appear as 'comments' when units are clicked on the visualiser.
				924
				925	Clicking on a unit which contains instructions or lines will bring up a speech
				926	bubble giving extra information derived from the MinorInst/MinorLine lines.
				927
				928	--start-time and --end-time allow only sections of debug files to be loaded.
				929
				930	--prefix allows the name prefix of the CPU to be inspected to be supplied.
				931	This defaults to 'system.cpu'.
				932
				933	In the visualiser, The buttons Start, End, Back, Forward, Play and Stop can be
				934	used to control the displayed simulation time.
				935
				936	The diagonally striped coloured blocks are showing the InstId of the
				937	instruction or line they represent. Note that lines in Fetch1 and f1ToF2.F
				938	only show the id fields of a line and that instructions in Fetch2, f2ToD, and
				939	decode.inputBuffer do not yet have execute sequence numbers. The T/S.P/L/F.E
				940	buttons can be used to toggle parts of InstId on and off to make it easier to
				941	understand the display. Useful combinations are:
				942
				943	<table>
				944	<tr>
				945	<td><b>Combination</b></td>
				946	<td><b>Reason</b></td>
				947	</tr>
				948	<tr>
				949	<td>E</td>
				950	<td>just show the final execute sequence number</td>
				951	</tr>
				952	<tr>
				953	<td>F/E</td>
				954	<td>show the instruction-related numbers</td>
				955	</tr>
				956	<tr>
				957	<td>S/P</td>
				958	<td>show just the stream-related numbers (watch the stream sequence
				959	change with branches and not change with predicted branches)</td>
				960	</tr>
				961	<tr>
				962	<td>S/E</td>
				963	<td>show instructions and their stream</td>
				964	</tr>
				965	</table>
				966
				967	The key to the right shows all the displayable colours (some of the colour
				968	choices are quite bad!):
				969
				970	<table>
				971	<tr>
				972	<td><b>Symbol</b></td>
				973	<td><b>Meaning</b></td>
				974	</tr>
				975	<tr>
				976	<td>U</td>
				977	<td>Unknown data</td>
				978	</tr>
				979	<tr>
				980	<td>B</td>
				981	<td>Blocked stage</td>
				982	</tr>
				983	<tr>
				984	<td>-</td>
				985	<td>Bubble</td>
				986	</tr>
				987	<tr>
				988	<td>E</td>
				989	<td>Empty queue slot</td>
				990	</tr>
				991	<tr>
				992	<td>R</td>
				993	<td>Reserved queue slot</td>
				994	</tr>
				995	<tr>
				996	<td>F</td>
				997	<td>Fault</td>
				998	</tr>
				999	<tr>
				1000	<td>r</td>
				1001	<td>Read (used as the leftmost stripe on data in the dcachePort)</td>
				1002	</tr>
				1003	<tr>
				1004	<td>w</td>
				1005	<td>Write " "</td>
				1006	</tr>
				1007	<tr>
				1008	<td>0 to 9</td>
				1009	<td>last decimal digit of the corresponding data</td>
				1010	</tr>
				1011	</table>
				1012
				1013	\verbatim
				1014
				1015	,---------------. .--------------. *U
				1016	\| \|=\|->\|=\|->\|=\| \| \|\|=\|\|\|->\|\|->\|\| \| *- <- Fetch queues/LSQ
				1017	`---------------' `--------------' *R
				1018	=== ====== *w <- Activity/Stage activity
				1019	,--------------. *1
				1020	,--. ,. ,. \| ============ \| *3 <- Scoreboard
				1021	\| \|-\[]-\\|\|-\[]-\\|\|-\[]-\\| ============ \| *5 <- Execute::inFlightInsts
				1022	\| \| :[] :\|\|-/[]-/\|\|-/[]-/\| -. -------- \| *7
				1023	\| \|-/[]-/\|\| ^ \|\| \| \| --------- \| *9
				1024	\| \| \|\| \| \|\| \| \| ------ \|
				1025	[]->\| \| ->\|\| \| \|\| \| \| ---- \|
				1026	\| \|<-[]<-\|\|<-+-<-\|\|<-[]<-\| \| ------ \|->[] <- Execute to Fetch1,
				1027	'--` `' ^ `' \| -' ------ \| Fetch2 branch data
				1028	---. \| ---. `--------------'
				1029	---' \| ---' ^ ^
				1030	\| ^ \| `------------ Execute
				1031	MinorBuffer ----' input `-------------------- Execute input buffer
				1032	buffer
				1033	\endverbatim
				1034
				1035	Stages show the colours of the instructions currently being
				1036	generated/processed.
				1037
				1038	Forward FIFOs between stages show the data being pushed into them at the
				1039	current tick (to the left), the data in transit, and the data available at
				1040	their outputs (to the right).
				1041
				1042	The backwards FIFO between Fetch2 and Fetch1 shows branch prediction data.
				1043
				1044	In general, all displayed data is correct at the end of a cycle's activity at
				1045	the time indicated but before the inter-stage FIFOs are ticked. Each FIFO
				1046	has, therefore an extra slot to show the asserted new input data, and all the
				1047	data currently within the FIFO.
				1048
				1049	Input buffers for each stage are shown below the corresponding stage and show
				1050	the contents of those buffers as horizontal strips. Strips marked as reserved
				1051	(cyan by default) are reserved to be filled by the previous stage. An input
				1052	buffer with all reserved or occupied slots will, therefore, block the previous
				1053	stage from generating output.
				1054
				1055	Fetch queues and LSQ show the lines/instructions in the queues of each
				1056	interface and show the number of lines/instructions in TLB and memory in the
				1057	two striped colours of the top of their frames.
				1058
				1059	Inside Execute, the horizontal bars represent the individual FU pipelines.
				1060	The vertical bar to the left is the input buffer and the bar to the right, the
				1061	instructions committed this cycle. The background of Execute shows
				1062	instructions which are being committed this cycle in their original FU
				1063	pipeline positions.
				1064
				1065	The strip at the top of the Execute block shows the current streamSeqNum that
				1066	Execute is committing. A similar stripe at the top of Fetch1 shows that
				1067	stage's expected streamSeqNum and the stripe at the top of Fetch2 shows its
				1068	issuing predictionSeqNum.
				1069
				1070	The scoreboard shows the number of instructions in flight which will commit a
				1071	result to the register in the position shown. The scoreboard contains slots
				1072	for each integer and floating point register.
				1073
				1074	The Execute::inFlightInsts queue shows all the instructions in flight in
				1075	Execute with the oldest instruction (the next instruction to be committed) to
				1076	the right.
				1077
				1078	'Stage activity' shows the signalled activity (as E/1) for each stage (with
				1079	CPU miscellaneous activity to the left)
				1080
				1081	'Activity' show a count of stage and pipe activity.
				1082
				1083	\subsection picformat minor.pic format
				1084
				1085	The minor.pic file (src/minor/minor.pic) describes the layout of the
				1086	models blocks on the visualiser. Its format is described in the supplied
				1087	minor.pic file.
				1088
				1089	*/
				1090
				1091	}