| # Copyright (c) 2014 ARM Limited |
| # All rights reserved |
| # |
| # The license below extends only to copyright in the software and shall |
| # not be construed as granting a license to any other intellectual |
| # property including but not limited to intellectual property relating |
| # to a hardware implementation of the functionality of the software |
| # licensed hereunder. You may use the software subject to the license |
| # terms below provided that you ensure that this notice is replicated |
| # unmodified and in its entirety in all distributions of the software, |
| # modified or unmodified, in source code or in binary form. |
| # |
| # Redistribution and use in source and binary forms, with or without |
| # modification, are permitted provided that the following conditions are |
| # met: redistributions of source code must retain the above copyright |
| # notice, this list of conditions and the following disclaimer; |
| # redistributions in binary form must reproduce the above copyright |
| # notice, this list of conditions and the following disclaimer in the |
| # documentation and/or other materials provided with the distribution; |
| # neither the name of the copyright holders nor the names of its |
| # contributors may be used to endorse or promote products derived from |
| # this software without specific prior written permission. |
| # |
| # THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS |
| # "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT |
| # LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR |
| # A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT |
| # OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, |
| # SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT |
| # LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, |
| # DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY |
| # THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT |
| # (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE |
| # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. |
| # |
| # Authors: Andrew Bardsley |
| |
| namespace Minor |
| { |
| |
| /*! |
| |
| \page minor Inside the Minor CPU model |
| |
| \tableofcontents |
| |
| This document contains a description of the structure and function of the |
| Minor gem5 in-order processor model. It is recommended reading for anyone who |
| wants to understand Minor's internal organisation, design decisions, C++ |
| implementation and Python configuration. A familiarity with gem5 and some of |
| its internal structures is assumed. This document is meant to be read |
| alongside the Minor source code and to explain its general structure without |
| being too slavish about naming every function and data type. |
| |
| \section whatis What is Minor? |
| |
| Minor is an in-order processor model with a fixed pipeline but configurable |
| data structures and execute behaviour. It is intended to be used to model |
| processors with strict in-order execution behaviour and allows visualisation |
| of an instruction's position in the pipeline through the |
| MinorTrace/minorview.py format/tool. The intention is to provide a framework |
| for micro-architecturally correlating the model with a particular, chosen |
| processor with similar capabilities. |
| |
| \section philo Design philosophy |
| |
| \subsection mt Multithreading |
| |
| The model isn't currently capable of multithreading but there are THREAD |
| comments in key places where stage data needs to be arrayed to support |
| multithreading. |
| |
| \subsection structs Data structures |
| |
| Decorating data structures with large amounts of life-cycle information is |
| avoided. Only instructions (MinorDynInst) contain a significant proportion of |
| their data content whose values are not set at construction. |
| |
| All internal structures have fixed sizes on construction. Data held in queues |
| and FIFOs (MinorBuffer, FUPipeline) should have a BubbleIF interface to |
| allow a distinct 'bubble'/no data value option for each type. |
| |
| Inter-stage 'struct' data is packaged in structures which are passed by value. |
| Only MinorDynInst, the line data in ForwardLineData and the memory-interfacing |
| objects Fetch1::FetchRequest and LSQ::LSQRequest are '::new' allocated while |
| running the model. |
| |
| \section model Model structure |
| |
| Objects of class MinorCPU are provided by the model to gem5. MinorCPU |
| implements the interfaces of (cpu.hh) and can provide data and |
| instruction interfaces for connection to a cache system. The model is |
| configured in a similar way to other gem5 models through Python. That |
| configuration is passed on to MinorCPU::pipeline (of class Pipeline) which |
| actually implements the processor pipeline. |
| |
| The hierarchy of major unit ownership from MinorCPU down looks like this: |
| |
| <ul> |
| <li>MinorCPU</li> |
| <ul> |
| <li>Pipeline - container for the pipeline, owns the cyclic 'tick' |
| event mechanism and the idling (cycle skipping) mechanism.</li> |
| <ul> |
| <li>Fetch1 - instruction fetch unit responsible for fetching cache |
| lines (or parts of lines from the I-cache interface)</li> |
| <ul> |
| <li>Fetch1::IcachePort - interface to the I-cache from |
| Fetch1</li> |
| </ul> |
| <li>Fetch2 - line to instruction decomposition</li> |
| <li>Decode - instruction to micro-op decomposition</li> |
| <li>Execute - instruction execution and data memory |
| interface</li> |
| <ul> |
| <li>LSQ - load store queue for memory ref. instructions</li> |
| <li>LSQ::DcachePort - interface to the D-cache from |
| Execute</li> |
| </ul> |
| </ul> |
| </ul> |
| </ul> |
| |
| \section keystruct Key data structures |
| |
| \subsection ids Instruction and line identity: InstId (dyn_inst.hh) |
| |
| An InstId contains the sequence numbers and thread numbers that describe the |
| life cycle and instruction stream affiliations of individual fetched cache |
| lines and instructions. |
| |
| An InstId is printed in one of the following forms: |
| |
| - T/S.P/L - for fetched cache lines |
| - T/S.P/L/F - for instructions before Decode |
| - T/S.P/L/F.E - for instructions from Decode onwards |
| |
| for example: |
| |
| - 0/10.12/5/6.7 |
| |
| InstId's fields are: |
| |
| <table> |
| <tr> |
| <td><b>Field</b></td> |
| <td><b>Symbol</b></td> |
| <td><b>Generated by</b></td> |
| <td><b>Checked by</b></td> |
| <td><b>Function</b></td> |
| </tr> |
| |
| <tr> |
| <td>InstId::threadId</td> |
| <td>T</td> |
| <td>Fetch1</td> |
| <td>Everywhere the thread number is needed</td> |
| <td>Thread number (currently always 0).</td> |
| </tr> |
| |
| <tr> |
| <td>InstId::streamSeqNum</td> |
| <td>S</td> |
| <td>Execute</td> |
| <td>Fetch1, Fetch2, Execute (to discard lines/insts)</td> |
| <td>Stream sequence number as chosen by Execute. Stream |
| sequence numbers change after changes of PC (branches, exceptions) in |
| Execute and are used to separate pre and post branch instruction |
| streams.</td> |
| </tr> |
| |
| <tr> |
| <td>InstId::predictionSeqNum</td> |
| <td>P</td> |
| <td>Fetch2</td> |
| <td>Fetch2 (while discarding lines after prediction)</td> |
| <td>Prediction sequence numbers represent branch prediction decisions. |
| This is used by Fetch2 to mark lines/instructions according to the last |
| followed branch prediction made by Fetch2. Fetch2 can signal to Fetch1 |
| that it should change its fetch address and mark lines with a new |
| prediction sequence number (which it will only do if the stream sequence |
| number Fetch1 expects matches that of the request). </td> </tr> |
| |
| <tr> |
| <td>InstId::lineSeqNum</td> |
| <td>L</td> |
| <td>Fetch1</td> |
| <td>(Just for debugging)</td> |
| <td>Line fetch sequence number of this cache line or the line |
| this instruction was extracted from. |
| </td> |
| </tr> |
| |
| <tr> |
| <td>InstId::fetchSeqNum</td> |
| <td>F</td> |
| <td>Fetch2</td> |
| <td>Fetch2 (as the inst. sequence number for branches)</td> |
| <td>Instruction fetch order assigned by Fetch2 when lines |
| are decomposed into instructions. |
| </td> |
| </tr> |
| |
| <tr> |
| <td>InstId::execSeqNum</td> |
| <td>E</td> |
| <td>Decode</td> |
| <td>Execute (to check instruction identity in queues/FUs/LSQ)</td> |
| <td>Instruction order after micro-op decomposition.</td> |
| </tr> |
| |
| </table> |
| |
| The sequence number fields are all independent of each other and although, for |
| instance, InstId::execSeqNum for an instruction will always be >= |
| InstId::fetchSeqNum, the comparison is not useful. |
| |
| The originating stage of each sequence number field keeps a counter for that |
| field which can be incremented in order to generate new, unique numbers. |
| |
| \subsection insts Instructions: MinorDynInst (dyn_inst.hh) |
| |
| MinorDynInst represents an instruction's progression through the pipeline. An |
| instruction can be three things: |
| |
| <table> |
| <tr> |
| <td><b>Thing</b></td> |
| <td><b>Predicate</b></td> |
| <td><b>Explanation</b></td> |
| </tr> |
| <tr> |
| <td>A bubble</td> |
| <td>MinorDynInst::isBubble()</td> |
| <td>no instruction at all, just a space-filler</td> |
| </tr> |
| <tr> |
| <td>A fault</td> |
| <td>MinorDynInst::isFault()</td> |
| <td>a fault to pass down the pipeline in an instruction's clothing</td> |
| </tr> |
| <tr> |
| <td>A decoded instruction</td> |
| <td>MinorDynInst::isInst()</td> |
| <td>instructions are actually passed to the gem5 decoder in Fetch2 and so |
| are created fully decoded. MinorDynInst::staticInst is the decoded |
| instruction form.</td> |
| </tr> |
| </table> |
| |
| Instructions are reference counted using the gem5 RefCountingPtr |
| (base/refcnt.hh) wrapper. They therefore usually appear as MinorDynInstPtr in |
| code. Note that as RefCountingPtr initialises as nullptr rather than an |
| object that supports BubbleIF::isBubble, passing raw MinorDynInstPtrs to |
| Queue%s and other similar structures from stage.hh without boxing is |
| dangerous. |
| |
| \subsection fld ForwardLineData (pipe_data.hh) |
| |
| ForwardLineData is used to pass cache lines from Fetch1 to Fetch2. Like |
| MinorDynInst%s, they can be bubbles (ForwardLineData::isBubble()), |
| fault-carrying or can contain a line (partial line) fetched by Fetch1. The |
| data carried by ForwardLineData is owned by a Packet object returned from |
| memory and is explicitly memory managed and do must be deleted once processed |
| (by Fetch2 deleting the Packet). |
| |
| \subsection fid ForwardInstData (pipe_data.hh) |
| |
| ForwardInstData can contain up to ForwardInstData::width() instructions in its |
| ForwardInstData::insts vector. This structure is used to carry instructions |
| between Fetch2, Decode and Execute and to store input buffer vectors in Decode |
| and Execute. |
| |
| \subsection fr Fetch1::FetchRequest (fetch1.hh) |
| |
| FetchRequests represent I-cache line fetch requests. The are used in the |
| memory queues of Fetch1 and are pushed into/popped from Packet::senderState |
| while traversing the memory system. |
| |
| FetchRequests contain a memory system Request (mem/request.hh) for that fetch |
| access, a packet (Packet, mem/packet.hh), if the request gets to memory, and a |
| fault field that can be populated with a TLB-sourced prefetch fault (if any). |
| |
| \subsection lsqr LSQ::LSQRequest (execute.hh) |
| |
| LSQRequests are similar to FetchRequests but for D-cache accesses. They carry |
| the instruction associated with a memory access. |
| |
| \section pipeline The pipeline |
| |
| \verbatim |
| ------------------------------------------------------------------------------ |
| Key: |
| |
| [] : inter-stage BufferBuffer |
| ,--. |
| | | : pipeline stage |
| `--' |
| ---> : forward communication |
| <--- : backward communication |
| |
| rv : reservation information for input buffers |
| |
| ,------. ,------. ,------. ,-------. |
| (from --[]-v->|Fetch1|-[]->|Fetch2|-[]->|Decode|-[]->|Execute|--> (to Fetch1 |
| Execute) | | |<-[]-| |<-rv-| |<-rv-| | & Fetch2) |
| | `------'<-rv-| | | | | | |
| `-------------->| | | | | | |
| `------' `------' `-------' |
| ------------------------------------------------------------------------------ |
| \endverbatim |
| |
| The four pipeline stages are connected together by MinorBuffer FIFO |
| (stage.hh, derived ultimately from TimeBuffer) structures which allow |
| inter-stage delays to be modelled. There is a MinorBuffer%s between adjacent |
| stages in the forward direction (for example: passing lines from Fetch1 to |
| Fetch2) and, between Fetch2 and Fetch1, a buffer in the backwards direction |
| carrying branch predictions. |
| |
| Stages Fetch2, Decode and Execute have input buffers which, each cycle, can |
| accept input data from the previous stage and can hold that data if the stage |
| is not ready to process it. Input buffers store data in the same form as it |
| is received and so Decode and Execute's input buffers contain the output |
| instruction vector (ForwardInstData (pipe_data.hh)) from their previous stages |
| with the instructions and bubbles in the same positions as a single buffer |
| entry. |
| |
| Stage input buffers provide a Reservable (stage.hh) interface to their |
| previous stages, to allow slots to be reserved in their input buffers, and |
| communicate their input buffer occupancy backwards to allow the previous stage |
| to plan whether it should make an output in a given cycle. |
| |
| \subsection events Event handling: MinorActivityRecorder (activity.hh, |
| pipeline.hh) |
| |
| Minor is essentially a cycle-callable model with some ability to skip cycles |
| based on pipeline activity. External events are mostly received by callbacks |
| (e.g. Fetch1::IcachePort::recvTimingResp) and cause the pipeline to be woken |
| up to service advancing request queues. |
| |
| Ticked (sim/ticked.hh) is a base class bringing together an evaluate |
| member function and a provided SimObject. It provides a Ticked::start/stop |
| interface to start and pause clock events from being periodically issued. |
| Pipeline is a derived class of Ticked. |
| |
| During evaluate calls, stages can signal that they still have work to do in |
| the next cycle by calling either MinorCPU::activityRecorder->activity() (for |
| non-callable related activity) or MinorCPU::wakeupOnEvent(<stageId>) (for |
| stage callback-related 'wakeup' activity). |
| |
| Pipeline::evaluate contains calls to evaluate for each unit and a test for |
| pipeline idling which can turns off the clock tick if no unit has signalled |
| that it may become active next cycle. |
| |
| Within Pipeline (pipeline.hh), the stages are evaluated in reverse order (and |
| so will ::evaluate in reverse order) and their backwards data can be |
| read immediately after being written in each cycle allowing output decisions |
| to be 'perfect' (allowing synchronous stalling of the whole pipeline). Branch |
| predictions from Fetch2 to Fetch1 can also be transported in 0 cycles making |
| fetch1ToFetch2BackwardDelay the only configurable delay which can be set as |
| low as 0 cycles. |
| |
| The MinorCPU::activateContext and MinorCPU::suspendContext interface can be |
| called to start and pause threads (threads in the MT sense) and to start and |
| pause the pipeline. Executing instructions can call this interface |
| (indirectly through the ThreadContext) to idle the CPU/their threads. |
| |
| \subsection stages Each pipeline stage |
| |
| In general, the behaviour of a stage (each cycle) is: |
| |
| \verbatim |
| evaluate: |
| push input to inputBuffer |
| setup references to input/output data slots |
| |
| do 'every cycle' 'step' tasks |
| |
| if there is input and there is space in the next stage: |
| process and generate a new output |
| maybe re-activate the stage |
| |
| send backwards data |
| |
| if the stage generated output to the following FIFO: |
| signal pipe activity |
| |
| if the stage has more processable input and space in the next stage: |
| re-activate the stage for the next cycle |
| |
| commit the push to the inputBuffer if that data hasn't all been used |
| \endverbatim |
| |
| The Execute stage differs from this model as its forward output (branch) data |
| is unconditionally sent to Fetch1 and Fetch2. To allow this behaviour, Fetch1 |
| and Fetch2 must be unconditionally receptive to that data. |
| |
| \subsection fetch1 Fetch1 stage |
| |
| Fetch1 is responsible for fetching cache lines or partial cache lines from the |
| I-cache and passing them on to Fetch2 to be decomposed into instructions. It |
| can receive 'change of stream' indications from both Execute and Fetch2 to |
| signal that it should change its internal fetch address and tag newly fetched |
| lines with new stream or prediction sequence numbers. When both Execute and |
| Fetch2 signal changes of stream at the same time, Fetch1 takes Execute's |
| change. |
| |
| Every line issued by Fetch1 will bear a unique line sequence number which can |
| be used for debugging stream changes. |
| |
| When fetching from the I-cache, Fetch1 will ask for data from the current |
| fetch address (Fetch1::pc) up to the end of the 'data snap' size set in the |
| parameter fetch1LineSnapWidth. Subsequent autonomous line fetches will fetch |
| whole lines at a snap boundary and of size fetch1LineWidth. |
| |
| Fetch1 will only initiate a memory fetch if it can reserve space in Fetch2 |
| input buffer. That input buffer serves an the fetch queue/LFL for the system. |
| |
| Fetch1 contains two queues: requests and transfers to handle the stages of |
| translating the address of a line fetch (via the TLB) and accommodating the |
| request/response of fetches to/from memory. |
| |
| Fetch requests from Fetch1 are pushed into the requests queue as newly |
| allocated FetchRequest objects once they have been sent to the ITLB with a |
| call to itb->translateTiming. |
| |
| A response from the TLB moves the request from the requests queue to the |
| transfers queue. If there is more than one entry in each queue, it is |
| possible to get a TLB response for request which is not at the head of the |
| requests queue. In that case, the TLB response is marked up as a state change |
| to Translated in the request object, and advancing the request to transfers |
| (and the memory system) is left to calls to Fetch1::stepQueues which is called |
| in the cycle following any event is received. |
| |
| Fetch1::tryToSendToTransfers is responsible for moving requests between the |
| two queues and issuing requests to memory. Failed TLB lookups (prefetch |
| aborts) continue to occupy space in the queues until they are recovered at the |
| head of transfers. |
| |
| Responses from memory change the request object state to Complete and |
| Fetch1::evaluate can pick up response data, package it in the ForwardLineData |
| object, and forward it to Fetch2%'s input buffer. |
| |
| As space is always reserved in Fetch2::inputBuffer, setting the input buffer's |
| size to 1 results in non-prefetching behaviour. |
| |
| When a change of stream occurs, translated requests queue members and |
| completed transfers queue members can be unconditionally discarded to make way |
| for new transfers. |
| |
| \subsection fetch2 Fetch2 stage |
| |
| Fetch2 receives a line from Fetch1 into its input buffer. The data in the |
| head line in that buffer is iterated over and separated into individual |
| instructions which are packed into a vector of instructions which can be |
| passed to Decode. Packing instructions can be aborted early if a fault is |
| found in either the input line as a whole or a decomposed instruction. |
| |
| \subsubsection bp Branch prediction |
| |
| Fetch2 contains the branch prediction mechanism. This is a wrapper around the |
| branch predictor interface provided by gem5 (cpu/pred/...). |
| |
| Branches are predicted for any control instructions found. If prediction is |
| attempted for an instruction, the MinorDynInst::triedToPredict flag is set on |
| that instruction. |
| |
| When a branch is predicted to take, the MinorDynInst::predictedTaken flag is |
| set and MinorDynInst::predictedTarget is set to the predicted target PC value. |
| The predicted branch instruction is then packed into Fetch2%'s output vector, |
| the prediction sequence number is incremented, and the branch is communicated |
| to Fetch1. |
| |
| After signalling a prediction, Fetch2 will discard its input buffer contents |
| and will reject any new lines which have the same stream sequence number as |
| that branch but have a different prediction sequence number. This allows |
| following sequentially fetched lines to be rejected without ignoring new lines |
| generated by a change of stream indicated from a 'real' branch from Execute |
| (which will have a new stream sequence number). |
| |
| The program counter value provided to Fetch2 by Fetch1 packets is only updated |
| when there is a change of stream. Fetch2::havePC indicates whether the PC |
| will be picked up from the next processed input line. Fetch2::havePC is |
| necessary to allow line-wrapping instructions to be tracked through decode. |
| |
| Branches (and instructions predicted to branch) which are processed by Execute |
| will generate BranchData (pipe_data.hh) data explaining the outcome of the |
| branch which is sent forwards to Fetch1 and Fetch2. Fetch1 uses this data to |
| change stream (and update its stream sequence number and address for new |
| lines). Fetch2 uses it to update the branch predictor. Minor does not |
| communicate branch data to the branch predictor for instructions which are |
| discarded on the way to commit. |
| |
| BranchData::BranchReason (pipe_data.hh) encodes the possible branch scenarios: |
| |
| <table> |
| <tr> |
| <td>Branch enum val.</td> |
| <td>In Execute</td> |
| <td>Fetch1 reaction</td> |
| <td>Fetch2 reaction</td> |
| </tr> |
| <tr> |
| <td>NoBranch</td> |
| <td>(output bubble data)</td> |
| <td>-</td> |
| <td>-</td> |
| </tr> |
| <tr> |
| <td>CorrectlyPredictedBranch</td> |
| <td>Predicted, taken</td> |
| <td>-</td> |
| <td>Update BP as taken branch</td> |
| </tr> |
| <tr> |
| <td>UnpredictedBranch</td> |
| <td>Not predicted, taken and was taken</td> |
| <td>New stream</td> |
| <td>Update BP as taken branch</td> |
| </tr> |
| <tr> |
| <td>BadlyPredictedBranch</td> |
| <td>Predicted, not taken</td> |
| <td>New stream to restore to old inst. source</td> |
| <td>Update BP as not taken branch</td> |
| </tr> |
| <tr> |
| <td>BadlyPredictedBranchTarget</td> |
| <td>Predicted, taken, but to a different target than predicted one</td> |
| <td>New stream</td> |
| <td>Update BTB to new target</td> |
| </tr> |
| <tr> |
| <td>SuspendThread</td> |
| <td>Hint to suspend fetching</td> |
| <td>Suspend fetch for this thread (branch to next inst. as wakeup |
| fetch addr)</td> |
| <td>-</td> |
| </tr> |
| <tr> |
| <td>Interrupt</td> |
| <td>Interrupt detected</td> |
| <td>New stream</td> |
| <td>-</td> |
| </tr> |
| </table> |
| |
| The parameter decodeInputWidth sets the number of instructions which can be |
| packed into the output per cycle. If the parameter fetch2CycleInput is true, |
| Decode can try to take instructions from more than one entry in its input |
| buffer per cycle. |
| |
| \subsection decode Decode stage |
| |
| Decode takes a vector of instructions from Fetch2 (via its input buffer) and |
| decomposes those instructions into micro-ops (if necessary) and packs them |
| into its output instruction vector. |
| |
| The parameter executeInputWidth sets the number of instructions which can be |
| packed into the output per cycle. If the parameter decodeCycleInput is true, |
| Decode can try to take instructions from more than one entry in its input |
| buffer per cycle. |
| |
| \subsection execute Execute stage |
| |
| Execute provides all the instruction execution and memory access mechanisms. |
| An instructions passage through Execute can take multiple cycles with its |
| precise timing modelled by a functional unit pipeline FIFO. |
| |
| A vector of instructions (possibly including fault 'instructions') is provided |
| to Execute by Decode and can be queued in the Execute input buffer before |
| being issued. Setting the parameter executeCycleInput allows execute to |
| examine more than one input buffer entry (more than one instruction vector). |
| The number of instructions in the input vector can be set with |
| executeInputWidth and the depth of the input buffer can be set with parameter |
| executeInputBufferSize. |
| |
| \subsubsection fus Functional units |
| |
| The Execute stage contains pipelines for each functional unit comprising the |
| computational core of the CPU. Functional units are configured via the |
| executeFuncUnits parameter. Each functional unit has a number of instruction |
| classes it supports, a stated delay between instruction issues, and a delay |
| from instruction issue to (possible) commit and an optional timing annotation |
| capable of more complicated timing. |
| |
| Each active cycle, Execute::evaluate performs this action: |
| |
| \verbatim |
| Execute::evaluate: |
| push input to inputBuffer |
| setup references to input/output data slots and branch output slot |
| |
| step D-cache interface queues (similar to Fetch1) |
| |
| if interrupt posted: |
| take interrupt (signalling branch to Fetch1/Fetch2) |
| else |
| commit instructions |
| issue new instructions |
| |
| advance functional unit pipelines |
| |
| reactivate Execute if the unit is still active |
| |
| commit the push to the inputBuffer if that data hasn't all been used |
| \endverbatim |
| |
| \subsubsection fifos Functional unit FIFOs |
| |
| Functional units are implemented as SelfStallingPipelines (stage.hh). These |
| are TimeBuffer FIFOs with two distinct 'push' and 'pop' wires. They respond |
| to SelfStallingPipeline::advance in the same way as TimeBuffers <b>unless</b> |
| there is data at the far, 'pop', end of the FIFO. A 'stalled' flag is |
| provided for signalling stalling and to allow a stall to be cleared. The |
| intention is to provide a pipeline for each functional unit which will never |
| advance an instruction out of that pipeline until it has been processed and |
| the pipeline is explicitly unstalled. |
| |
| The actions 'issue', 'commit', and 'advance' act on the functional units. |
| |
| \subsubsection issue Issue |
| |
| Issuing instructions involves iterating over both the input buffer |
| instructions and the heads of the functional units to try and issue |
| instructions in order. The number of instructions which can be issued each |
| cycle is limited by the parameter executeIssueLimit, how executeCycleInput is |
| set, the availability of pipeline space and the policy used to choose a |
| pipeline in which the instruction can be issued. |
| |
| At present, the only issue policy is strict round-robin visiting of each |
| pipeline with the given instructions in sequence. For greater flexibility, |
| better (and more specific policies) will need to be possible. |
| |
| Memory operation instructions traverse their functional units to perform their |
| EA calculations. On 'commit', the ExecContext::initiateAcc execution phase is |
| performed and any memory access is issued (via. ExecContext::{read,write}Mem |
| calling LSQ::pushRequest) to the LSQ. |
| |
| Note that faults are issued as if they are instructions and can (currently) be |
| issued to *any* functional unit. |
| |
| Every issued instruction is also pushed into the Execute::inFlightInsts queue. |
| Memory ref. instructions are pushing into Execute::inFUMemInsts queue. |
| |
| \subsubsection commit Commit |
| |
| Instructions are committed by examining the head of the Execute::inFlightInsts |
| queue (which is decorated with the functional unit number to which the |
| instruction was issued). Instructions which can then be found in their |
| functional units are executed and popped from Execute::inFlightInsts. |
| |
| Memory operation instructions are committed into the memory queues (as |
| described above) and exit their functional unit pipeline but are not popped |
| from the Execute::inFlightInsts queue. The Execute::inFUMemInsts queue |
| provides ordering to memory operations as they pass through the functional |
| units (maintaining issue order). On entering the LSQ, instructions are popped |
| from Execute::inFUMemInsts. |
| |
| If the parameter executeAllowEarlyMemoryIssue is set, memory operations can be |
| sent from their FU to the LSQ before reaching the head of |
| Execute::inFlightInsts but after their dependencies are met. |
| MinorDynInst::instToWaitFor is marked up with the latest dependent instruction |
| execSeqNum required to be committed for a memory operation to progress to the |
| LSQ. |
| |
| Once a memory response is available (by testing the head of |
| Execute::inFlightInsts against LSQ::findResponse), commit will process that |
| response (ExecContext::completeAcc) and pop the instruction from |
| Execute::inFlightInsts. |
| |
| Any branch, fault or interrupt will cause a stream sequence number change and |
| signal a branch to Fetch1/Fetch2. Only instructions with the current stream |
| sequence number will be issued and/or committed. |
| |
| \subsubsection advance Advance |
| |
| All non-stalled pipeline are advanced and may, thereafter, become stalled. |
| Potential activity in the next cycle is signalled if there are any |
| instructions remaining in any pipeline. |
| |
| \subsubsection sb Scoreboard |
| |
| The scoreboard (Scoreboard) is used to control instruction issue. It contains |
| a count of the number of in flight instructions which will write each general |
| purpose CPU integer or float register. Instructions will only be issued where |
| the scoreboard contains a count of 0 instructions which will write to one of |
| the instructions source registers. |
| |
| Once an instruction is issued, the scoreboard counts for each destination |
| register for an instruction will be incremented. |
| |
| The estimated delivery time of the instruction's result is marked up in the |
| scoreboard by adding the length of the issued-to FU to the current time. The |
| timings parameter on each FU provides a list of additional rules for |
| calculating the delivery time. These are documented in the parameter comments |
| in MinorCPU.py. |
| |
| On commit, (for memory operations, memory response commit) the scoreboard |
| counters for an instruction's source registers are decremented. will be |
| decremented. |
| |
| \subsubsection ifi Execute::inFlightInsts |
| |
| The Execute::inFlightInsts queue will always contain all instructions in |
| flight in Execute in the correct issue order. Execute::issue is the only |
| process which will push an instruction into the queue. Execute::commit is the |
| only process that can pop an instruction. |
| |
| \subsubsection lsq LSQ |
| |
| The LSQ can support multiple outstanding transactions to memory in a number of |
| conservative cases. |
| |
| There are three queues to contain requests: requests, transfers and the store |
| buffer. The requests and transfers queue operate in a similar manner to the |
| queues in Fetch1. The store buffer is used to decouple the delay of |
| completing store operations from following loads. |
| |
| Requests are issued to the DTLB as their instructions leave their functional |
| unit. At the head of requests, cacheable load requests can be sent to memory |
| and on to the transfers queue. Cacheable stores will be passed to transfers |
| unprocessed and progress that queue maintaining order with other transactions. |
| |
| The conditions in LSQ::tryToSendToTransfers dictate when requests can |
| be sent to memory. |
| |
| All uncacheable transactions, split transactions and locked transactions are |
| processed in order at the head of requests. Additionally, store results |
| residing in the store buffer can have their data forwarded to cacheable loads |
| (removing the need to perform a read from memory) but no cacheable load can be |
| issue to the transfers queue until that queue's stores have drained into the |
| store buffer. |
| |
| At the end of transfers, requests which are LSQ::LSQRequest::Complete (are |
| faulting, are cacheable stores, or have been sent to memory and received a |
| response) can be picked off by Execute and either committed |
| (ExecContext::completeAcc) and, for stores, be sent to the store buffer. |
| |
| Barrier instructions do not prevent cacheable loads from progressing to memory |
| but do cause a stream change which will discard that load. Stores will not be |
| committed to the store buffer if they are in the shadow of the barrier but |
| before the new instruction stream has arrived at Execute. As all other memory |
| transactions are delayed at the end of the requests queue until they are at |
| the head of Execute::inFlightInsts, they will be discarded by any barrier |
| stream change. |
| |
| After commit, LSQ::BarrierDataRequest requests are inserted into the |
| store buffer to track each barrier until all preceding memory transactions |
| have drained from the store buffer. No further memory transactions will be |
| issued from the ends of FUs until after the barrier has drained. |
| |
| \subsubsection drain Draining |
| |
| Draining is mostly handled by the Execute stage. When initiated by calling |
| MinorCPU::drain, Pipeline::evaluate checks the draining status of each unit |
| each cycle and keeps the pipeline active until draining is complete. It is |
| Pipeline that signals the completion of draining. Execute is triggered by |
| MinorCPU::drain and starts stepping through its Execute::DrainState state |
| machine, starting from state Execute::NotDraining, in this order: |
| |
| <table> |
| <tr> |
| <td><b>State</b></td> |
| <td><b>Meaning</b></td> |
| </tr> |
| <tr> |
| <td>Execute::NotDraining</td> |
| <td>Not trying to drain, normal execution</td> |
| </tr> |
| <tr> |
| <td>Execute::DrainCurrentInst</td> |
| <td>Draining micro-ops to complete inst.</td> |
| </tr> |
| <tr> |
| <td>Execute::DrainHaltFetch</td> |
| <td>Halt fetching instructions</td> |
| </tr> |
| <tr> |
| <td>Execute::DrainAllInsts</td> |
| <td>Discarding all instructions presented</td> |
| </tr> |
| </table> |
| |
| When complete, a drained Execute unit will be in the Execute::DrainAllInsts |
| state where it will continue to discard instructions but has no knowledge of |
| the drained state of the rest of the model. |
| |
| \section debug Debug options |
| |
| The model provides a number of debug flags which can be passed to gem5 with |
| the --debug-flags option. |
| |
| The available flags are: |
| |
| <table> |
| <tr> |
| <td><b>Debug flag</b></td> |
| <td><b>Unit which will generate debugging output</b></td> |
| </tr> |
| <tr> |
| <td>Activity</td> |
| <td>Debug ActivityMonitor actions</td> |
| </tr> |
| <tr> |
| <td>Branch</td> |
| <td>Fetch2 and Execute branch prediction decisions</td> |
| </tr> |
| <tr> |
| <td>MinorCPU</td> |
| <td>CPU global actions such as wakeup/thread suspension</td> |
| </tr> |
| <tr> |
| <td>Decode</td> |
| <td>Decode</td> |
| </tr> |
| <tr> |
| <td>MinorExec</td> |
| <td>Execute behaviour</td> |
| </tr> |
| <tr> |
| <td>Fetch</td> |
| <td>Fetch1 and Fetch2</td> |
| </tr> |
| <tr> |
| <td>MinorInterrupt</td> |
| <td>Execute interrupt handling</td> |
| </tr> |
| <tr> |
| <td>MinorMem</td> |
| <td>Execute memory interactions</td> |
| </tr> |
| <tr> |
| <td>MinorScoreboard</td> |
| <td>Execute scoreboard activity</td> |
| </tr> |
| <tr> |
| <td>MinorTrace</td> |
| <td>Generate MinorTrace cyclic state trace output (see below)</td> |
| </tr> |
| <tr> |
| <td>MinorTiming</td> |
| <td>MinorTiming instruction timing modification operations</td> |
| </tr> |
| </table> |
| |
| The group flag Minor enables all of the flags beginning with Minor. |
| |
| \section trace MinorTrace and minorview.py |
| |
| The debug flag MinorTrace causes cycle-by-cycle state data to be printed which |
| can then be processed and viewed by the minorview.py tool. This output is |
| very verbose and so it is recommended it only be used for small examples. |
| |
| \subsection traceformat MinorTrace format |
| |
| There are three types of line outputted by MinorTrace: |
| |
| \subsubsection state MinorTrace - Ticked unit cycle state |
| |
| For example: |
| |
| \verbatim |
| 110000: system.cpu.dcachePort: MinorTrace: state=MemoryRunning in_tlb_mem=0/0 |
| \endverbatim |
| |
| For each time step, the MinorTrace flag will cause one MinorTrace line to be |
| printed for every named element in the model. |
| |
| \subsubsection traceunit MinorInst - summaries of instructions issued by \ |
| Decode |
| |
| For example: |
| |
| \verbatim |
| 140000: system.cpu.execute: MinorInst: id=0/1.1/1/1.1 addr=0x5c \ |
| inst=" mov r0, #0" class=IntAlu |
| \endverbatim |
| |
| MinorInst lines are currently only generated for instructions which are |
| committed. |
| |
| \subsubsection tracefetch1 MinorLine - summaries of line fetches issued by \ |
| Fetch1 |
| |
| For example: |
| |
| \verbatim |
| 92000: system.cpu.icachePort: MinorLine: id=0/1.1/1 size=36 \ |
| vaddr=0x5c paddr=0x5c |
| \endverbatim |
| |
| \subsection minorview minorview.py |
| |
| Minorview (util/minorview.py) can be used to visualise the data created by |
| MinorTrace. |
| |
| \verbatim |
| usage: minorview.py [-h] [--picture picture-file] [--prefix name] |
| [--start-time time] [--end-time time] [--mini-views] |
| event-file |
| |
| Minor visualiser |
| |
| positional arguments: |
| event-file |
| |
| optional arguments: |
| -h, --help show this help message and exit |
| --picture picture-file |
| markup file containing blob information (default: |
| <minorview-path>/minor.pic) |
| --prefix name name prefix in trace for CPU to be visualised |
| (default: system.cpu) |
| --start-time time time of first event to load from file |
| --end-time time time of last event to load from file |
| --mini-views show tiny views of the next 10 time steps |
| \endverbatim |
| |
| Raw debugging output can be passed to minorview.py as the event-file. It will |
| pick out the MinorTrace lines and use other lines where units in the |
| simulation are named (such as system.cpu.dcachePort in the above example) will |
| appear as 'comments' when units are clicked on the visualiser. |
| |
| Clicking on a unit which contains instructions or lines will bring up a speech |
| bubble giving extra information derived from the MinorInst/MinorLine lines. |
| |
| --start-time and --end-time allow only sections of debug files to be loaded. |
| |
| --prefix allows the name prefix of the CPU to be inspected to be supplied. |
| This defaults to 'system.cpu'. |
| |
| In the visualiser, The buttons Start, End, Back, Forward, Play and Stop can be |
| used to control the displayed simulation time. |
| |
| The diagonally striped coloured blocks are showing the InstId of the |
| instruction or line they represent. Note that lines in Fetch1 and f1ToF2.F |
| only show the id fields of a line and that instructions in Fetch2, f2ToD, and |
| decode.inputBuffer do not yet have execute sequence numbers. The T/S.P/L/F.E |
| buttons can be used to toggle parts of InstId on and off to make it easier to |
| understand the display. Useful combinations are: |
| |
| <table> |
| <tr> |
| <td><b>Combination</b></td> |
| <td><b>Reason</b></td> |
| </tr> |
| <tr> |
| <td>E</td> |
| <td>just show the final execute sequence number</td> |
| </tr> |
| <tr> |
| <td>F/E</td> |
| <td>show the instruction-related numbers</td> |
| </tr> |
| <tr> |
| <td>S/P</td> |
| <td>show just the stream-related numbers (watch the stream sequence |
| change with branches and not change with predicted branches)</td> |
| </tr> |
| <tr> |
| <td>S/E</td> |
| <td>show instructions and their stream</td> |
| </tr> |
| </table> |
| |
| The key to the right shows all the displayable colours (some of the colour |
| choices are quite bad!): |
| |
| <table> |
| <tr> |
| <td><b>Symbol</b></td> |
| <td><b>Meaning</b></td> |
| </tr> |
| <tr> |
| <td>U</td> |
| <td>Unknown data</td> |
| </tr> |
| <tr> |
| <td>B</td> |
| <td>Blocked stage</td> |
| </tr> |
| <tr> |
| <td>-</td> |
| <td>Bubble</td> |
| </tr> |
| <tr> |
| <td>E</td> |
| <td>Empty queue slot</td> |
| </tr> |
| <tr> |
| <td>R</td> |
| <td>Reserved queue slot</td> |
| </tr> |
| <tr> |
| <td>F</td> |
| <td>Fault</td> |
| </tr> |
| <tr> |
| <td>r</td> |
| <td>Read (used as the leftmost stripe on data in the dcachePort)</td> |
| </tr> |
| <tr> |
| <td>w</td> |
| <td>Write " "</td> |
| </tr> |
| <tr> |
| <td>0 to 9</td> |
| <td>last decimal digit of the corresponding data</td> |
| </tr> |
| </table> |
| |
| \verbatim |
| |
| ,---------------. .--------------. *U |
| | |=|->|=|->|=| | ||=|||->||->|| | *- <- Fetch queues/LSQ |
| `---------------' `--------------' *R |
| === ====== *w <- Activity/Stage activity |
| ,--------------. *1 |
| ,--. ,. ,. | ============ | *3 <- Scoreboard |
| | |-\[]-\||-\[]-\||-\[]-\| ============ | *5 <- Execute::inFlightInsts |
| | | :[] :||-/[]-/||-/[]-/| -. -------- | *7 |
| | |-/[]-/|| ^ || | | --------- | *9 |
| | | || | || | | ------ | |
| []->| | ->|| | || | | ---- | |
| | |<-[]<-||<-+-<-||<-[]<-| | ------ |->[] <- Execute to Fetch1, |
| '--` `' ^ `' | -' ------ | Fetch2 branch data |
| ---. | ---. `--------------' |
| ---' | ---' ^ ^ |
| | ^ | `------------ Execute |
| MinorBuffer ----' input `-------------------- Execute input buffer |
| buffer |
| \endverbatim |
| |
| Stages show the colours of the instructions currently being |
| generated/processed. |
| |
| Forward FIFOs between stages show the data being pushed into them at the |
| current tick (to the left), the data in transit, and the data available at |
| their outputs (to the right). |
| |
| The backwards FIFO between Fetch2 and Fetch1 shows branch prediction data. |
| |
| In general, all displayed data is correct at the end of a cycle's activity at |
| the time indicated but before the inter-stage FIFOs are ticked. Each FIFO |
| has, therefore an extra slot to show the asserted new input data, and all the |
| data currently within the FIFO. |
| |
| Input buffers for each stage are shown below the corresponding stage and show |
| the contents of those buffers as horizontal strips. Strips marked as reserved |
| (cyan by default) are reserved to be filled by the previous stage. An input |
| buffer with all reserved or occupied slots will, therefore, block the previous |
| stage from generating output. |
| |
| Fetch queues and LSQ show the lines/instructions in the queues of each |
| interface and show the number of lines/instructions in TLB and memory in the |
| two striped colours of the top of their frames. |
| |
| Inside Execute, the horizontal bars represent the individual FU pipelines. |
| The vertical bar to the left is the input buffer and the bar to the right, the |
| instructions committed this cycle. The background of Execute shows |
| instructions which are being committed this cycle in their original FU |
| pipeline positions. |
| |
| The strip at the top of the Execute block shows the current streamSeqNum that |
| Execute is committing. A similar stripe at the top of Fetch1 shows that |
| stage's expected streamSeqNum and the stripe at the top of Fetch2 shows its |
| issuing predictionSeqNum. |
| |
| The scoreboard shows the number of instructions in flight which will commit a |
| result to the register in the position shown. The scoreboard contains slots |
| for each integer and floating point register. |
| |
| The Execute::inFlightInsts queue shows all the instructions in flight in |
| Execute with the oldest instruction (the next instruction to be committed) to |
| the right. |
| |
| 'Stage activity' shows the signalled activity (as E/1) for each stage (with |
| CPU miscellaneous activity to the left) |
| |
| 'Activity' show a count of stage and pipe activity. |
| |
| \subsection picformat minor.pic format |
| |
| The minor.pic file (src/minor/minor.pic) describes the layout of the |
| models blocks on the visualiser. Its format is described in the supplied |
| minor.pic file. |
| |
| */ |
| |
| } |