| __ __ ____ _ _____ ____ _ |
| | \/ | ___| _ \ / \|_ _| | __ ) ___| |_ __ _ |
| | |\/| |/ __| |_) / _ \ | | | _ \ / _ \ __|/ _` | |
| | | | | (__| __/ ___ \| | | |_) | __/ |_| (_| | |
| |_| |_|\___|_| /_/ \_\_| |____/ \___|\__|\__,_| |
| |
| McPAT: Multicore Power, Area, and Timing |
| Current version 0.8Beta |
| =============================== |
| |
| McPAT is an architectural modeling tool for chip multiprocessors (CMP) |
| The main focus of McPAT is accurate power and area |
| modeling, and a target clock rate is used as a design constraint. |
| McPAT performs automatic extensive search to find optimal designs |
| that satisfy the target clock frequency. |
| |
| For complete documentation of the McPAT, please refer McPAT 1.0 |
| technical report and the following paper, |
| "McPAT: An Integrated Power, Area, and Timing Modeling |
| Framework for Multicore and Manycore Architectures", |
| that appears in MICRO 2009. Please cite the paper, if you use |
| McPAT in your work. The bibtex entry is provided below for your convenience. |
| |
| @inproceedings{mcpat:micro, |
| author = {Sheng Li and Jung Ho Ahn and Richard D. Strong and Jay B. Brockman and Dean M. Tullsen and Norman P. Jouppi}, |
| title = "{McPAT: An Integrated Power, Area, and Timing Modeling Framework for Multicore and Manycore Architectures}", |
| booktitle = {MICRO 42: Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture}, |
| year = {2009}, |
| pages = {469--480}, |
| } |
| |
| Current McPAT is in its beta release. |
| List of features of beta release |
| =============================== |
| The following are the list of features supported by the tool. |
| |
| * Power, area, and timing models for CMPs with: |
| Inorder cores both single and multithreaded |
| OOO cores both single and multithreaded |
| Shared/coherent caches with directory hardware: |
| including directory cache, shadowed tag directory |
| and static bank mapped tag directory |
| Network-on-Chip |
| On-chip memory controllers |
| |
| * Internal models are based on real modern processors: |
| Inorder models are based on Sun Niagara family |
| OOO models are based on Intel P6 for reservation |
| station based OOO cores, and on Intel Netburst and |
| Alpha 21264 for physical register file based OOO cores. |
| |
| * Leakage power modeling considers both sub-threshold leakage |
| and gate leakage power. The impact of operating temperature |
| on both leakage power are considered. Longer channel devices |
| that can reduce leakage significantly with modest performance |
| penalty are also modeled. |
| |
| * McPAT supports automatic extensive search to find optimal designs |
| that satisfy the target clock frequency. The timing constraint |
| include both throughput and latency. |
| |
| * Interconnect model with different delay, power, and area |
| properties, as well as both the aggressive and conservative |
| interconnect projections on wire technologies. |
| |
| * All process specific values used by the McPAT are obtained |
| from ITRS and currently, the McPAT supports 90nm, 65nm, 45nm, |
| 32nm, and 22nm technology nodes. At 32nm and 22nm nodes, SOI |
| and DG devices are used. After 45nm, Hi-K metal gates are used. |
| |
| How to use the tool? |
| ==================== |
| |
| McPAT takes input parameters from an XML-based interface, |
| then it computes area and peak power of the |
| Please note that the peak power is the absolute worst case power, |
| which could be even higher than TDP. |
| |
| 1. Steps to run McPAT: |
| -> define the target processor using inorder.xml or OOO.xml |
| -> run the "mcpat" binary: |
| ./mcpat -infile <*.xml> -print_level < level of detailed output> |
| ./mcpat -h (or mcpat --help) will show the quick help message. |
| |
| Rather than being hardwired to certain simulators, McPAT |
| uses an XML-based interface to enable easy integration |
| with various performance simulators. Our collaborator, |
| Richard Strong, at University of California, San Diego, |
| designed an experimental parser for the M5 simulator, aiming for |
| streamlining the integration of McPAT and M5. Please check the M5 |
| repository/ for the latest version of the parser. |
| |
| 2. Optimize: |
| McPAT will try its best to satisfy the target clock rate. |
| When it cannot find a valid solution, it gives out warnings, |
| while still giving a solution that is closest to the timing |
| constraints and calculate power based on it. The optimization |
| will lead to larger power/area numbers for target higher clock |
| rate. McPAT also provides the option "-opt_for_clk" to turn on |
| ("-opt_for_clk 1") and off this strict optimization for the |
| timing constraint. When it is off, McPAT always optimize |
| component for ED^2P without worrying about meeting the |
| target clock frequency. By turning it off, the computation time |
| can be reduced, which suites for situations where target clock rate |
| is conservative. |
| |
| 3. The output: |
| McPAT outputs results in a hierarchical manner. Increasing |
| the "-print_level" will show detailed results inside each |
| component. For each component, major parts are shown, and associated |
| pipeline registers/control logic are added up in total area/power of each |
| components. In general, McPAT does not model the area/overhead of the pad |
| frame used in a processor die. |
| |
| 4. How to use the XML interface for McPAT |
| 4.1 Set up the parameters |
| Parameters of target designs need to be set in the *.xml file for |
| entries taged as "param". McPAT have very detailed parameter settings. |
| please remove the structure parameter from the file if you want |
| to use the default values. Otherwise, the parameters in the xml file |
| will override the default values. |
| |
| 4.2 Pass the statistics |
| There are two options to get the correct stats: a) the performance |
| simulator can capture all the stats in detail and pass them to McPAT; |
| b). Performance simulator can only capture partial stats and pass |
| them to McPAT, while McPAT can reason about the complete stats using |
| the partial information and the configuration. Therefore, there are |
| some overlap for the stats. |
| |
| 4.3 Interface XML file structures (PLEASE READ!) |
| The XML is hierarchical from processor level to micro-architecture |
| level. McPAT support both heterogeneous and homogeneous manycore processors. |
| |
| 1). For heterogeneous processor setup, each component (core, NoC, cache, |
| and etc) must have its own instantiations (core0, core1, ..., coreN). |
| Each instantiation will have different parameters as well as its stats. |
| Thus, the XML file must have multiple "instantiation" of each type of |
| heterogeneous components and the corresponding hetero flags must be set |
| in the XML file. Then state in the XML should be the stats of "a" instantiation |
| (e.g. "a" cores). The reported runtime dynamic is of a single instantiation |
| (e.g. "a" cores). Since the stats for each (e.g. "a" cores) may be different, |
| we will see a whole list of (e.g. "a" cores) with different dynamic power, |
| and total power is just a sum of them. |
| |
| 2). For homogeneous processors, the same method for heterogeneous can |
| also be used by treating all homogeneous instantiations as heterogeneous. |
| However, a preferred approach is to use a single representative for all |
| the same components (e.g. core0 to represent all cores) and set the |
| processor to have homogeneous components (e.g. <param name="homogeneous_cores |
| " value="1"/> ). Thus, the XML file only has one instantiation to represent |
| all others with the same architectural parameters. The corresponding homo |
| flags must be set in the XML file. Then, the stats in the XML should be |
| the aggregated stats of the sum of all instantiations (e.g. aggregated stats |
| of all cores). In the final results, McPAT will only report a single |
| instantiation of each type of component, and the reported runtime dynamic power |
| is the sum of all instantiations of the same type. This approach can run fast |
| and use much less memory. |
| |
| 5. Guide for integrating McPAT into performance simulators and bypassing the XML interface |
| The detailed work flow of McPAT has two phases: the initialization phase and |
| the computation phase. Specifically, in order to start the initialization phase a |
| user specifies static configurations, including parameters at all three levels, |
| namely, architectural, circuit, and technology levels. During the initialization |
| phase, McPAT will generate the internal chip representation using the configurations |
| set by the user. |
| The computation phase of McPAT is called by McPAT or the performance simulator |
| during simulation to generate runtime power numbers. Before calling McPAT to |
| compute runtime power numbers, the performance simulator needs to pass the |
| statistics, namely, the activity factors of each individual components to McPAT |
| via the XML interface. |
| The initialization phase is very time-consuming, since it will repeat many |
| times until valid configurations are found or the possible configurations are |
| exhausted. To reduce the overhead, a user can let the simulator to call McPAT |
| directly for computation phase and only call initialization phase once at the |
| beginning of simulation. In this case, the XML interface file is bypassed, |
| please refer to processor.cc to see how the two phases are called. |
| |
| 6. Sample input files: |
| This package provide sample XML files for validating target processors. Please find the |
| enclosed Niagara1.xml (for the Sun Niagara1 processor), Niagara2.xml (for the Sun Niagara2 |
| processor), Alpha21364.xml (for the Alpha21364 processor), and Xeon.xml (for the Intel |
| Xeon Tulsa processor). |
| |
| Special instructions for using Xeon.xml: |
| McPAT uses ITRS device types including HP, LSTP, and LOP. Although most |
| designs follow ITRS projections, there are designs with special technologies. |
| For example, the 65nm Xeon Tulsa processor uses 1.25 V rather than 1.1V |
| for the core voltage domain, which results in the changes in threshold voltage, |
| leakage current density, saturation current, and etc, besides the different |
| supply voltage. We use MASTAR to match the special technology as used in Xeon |
| core domain. Therefore, in order to generate accurate results of Xeon |
| Tulsa cores, users need to do make TAR=mcpatXeonCore and use the generated |
| special executable. The L3 cache and buses must be computed using standard |
| ITRS technology. |
| |
| |
| ==================== |
| McPAT is in its beginning stage. We are still improving |
| the tool and refining the code. Please come back to its website |
| for newer versions. If you have any comments, |
| questions, or suggestions, please write to us. |
| |
| Version history and roadmap |
| |
| McPAT Alpha: released Sep. 2009 Experimental release |
| McPAT Beta (0.6): released Nov. 2009 New code base and technology base |
| McPAT Beta (0.7): released May. 2010 Added various new models, |
| including long channel devices, buses model; together |
| with bug fixes and extensive code optimization to reduce |
| memory usage. |
| McPAT Beta (0.8): released Aug. 2010 Added various new models, |
| including on-chip 10Gb ethernet units, PCIe, and flash controllers. |
| Next major release: |
| McPAT 1.0: including advance power-saving states |
| |
| Future releases may include the modeling of embedded low-power |
| processors as well as vector processors and GPGPUs. |
| |
| |
| Sheng Li |
| sheng.li@hp.com |
| |
| |
| |
| |